2: Model parameters: d_model 2304 ffw_size 9216 kv_size 128 n_heads 18 n_layers 32 5: Model parameters: d_model 2304 ffw_size 9216 kv_size 128 n_heads 18 n_layers 32 0: Model parameters: d_model 2304 ffw_size 9216 kv_size 128 n_heads 18 n_layers 32 6: Model parameters: d_model 2304 ffw_size 9216 kv_size 128 n_heads 18 n_layers 32 7: Model parameters: d_model 2304 ffw_size 9216 kv_size 128 n_heads 18 n_layers 32 3: Model parameters: d_model 2304 ffw_size 9216 kv_size 128 n_heads 18 n_layers 32 4: Model parameters: d_model 2304 ffw_size 9216 kv_size 128 n_heads 18 n_layers 32 2: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 32 --hidden-size 2304 --num-attention-heads 18 --kv-channels 128 --ffn-hidden-size 9216 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 2 --global-batch-size 512 --train-samples 22_565_693 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-2b2 --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 22_565_693 --lr-warmup-samples 225_657 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_2b2 --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_2b2 --load checkpoints_2b2 --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document --data- 0: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 32 --hidden-size 2304 --num-attention-heads 18 --kv-channels 128 --ffn-hidden-size 9216 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 2 --global-batch-size 512 --train-samples 22_565_693 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-2b2 --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 22_565_693 --lr-warmup-samples 225_657 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_2b2 --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_2b2 --load checkpoints_2b2 --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document --data- 7: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 32 --hidden-size 2304 --num-attention-heads 18 --kv-channels 128 --ffn-hidden-size 9216 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 2 --global-batch-size 512 --train-samples 22_565_693 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-2b2 --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 22_565_693 --lr-warmup-samples 225_657 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_2b2 --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_2b2 --load checkpoints_2b2 --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document --data- 0: impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068236.json --zero-stage 0 0: START 2068236: Fri Nov 25 09:57:21 EET 2022 2: impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068236.json --zero-stage 0 7: impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068236.json --zero-stage 0 2: START 2068236: Fri Nov 25 09:57:21 EET 2022 7: START 2068236: Fri Nov 25 09:57:21 EET 2022 1: Model parameters: d_model 2304 ffw_size 9216 kv_size 128 n_heads 18 n_layers 32 6: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 32 --hidden-size 2304 --num-attention-heads 18 --kv-channels 128 --ffn-hidden-size 9216 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 2 --global-batch-size 512 --train-samples 22_565_693 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-2b2 --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 22_565_693 --lr-warmup-samples 225_657 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_2b2 --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_2b2 --load checkpoints_2b2 --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document --data- 6: impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068236.json --zero-stage 0 6: START 2068236: Fri Nov 25 09:57:21 EET 2022 3: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 32 --hidden-size 2304 --num-attention-heads 18 --kv-channels 128 --ffn-hidden-size 9216 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 2 --global-batch-size 512 --train-samples 22_565_693 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-2b2 --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 22_565_693 --lr-warmup-samples 225_657 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_2b2 --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_2b2 --load checkpoints_2b2 --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document --data- 3: impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068236.json --zero-stage 0 3: START 2068236: Fri Nov 25 09:57:21 EET 2022 2: 2: 2: ======================= ROCm System Management Interface ======================= 2: ================================= Concise Info ================================= 2: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 2: 0 42.0c 94.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 2: 1 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 2: 2 41.0c 96.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 2: 3 41.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 2: 4 42.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 2: 5 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 2: 6 41.0c 96.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 2: 7 37.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 2: ================================================================================ 2: ============================= End of ROCm SMI Log ============================== 0: 0: 0: ======================= ROCm System Management Interface ======================= 0: ================================= Concise Info ================================= 0: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 0: 0 51.0c 94.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 0: 1 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 0: 2 46.0c 95.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 0: 3 42.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 0: 4 41.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 0: 5 51.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 0: 6 43.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 0: 7 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 0: ================================================================================ 0: ============================= End of ROCm SMI Log ============================== 7: 7: 7: ======================= ROCm System Management Interface ======================= 7: ================================= Concise Info ================================= 7: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 7: 0 49.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 7: 1 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 7: 2 45.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 7: 3 40.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 7: 4 45.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 7: 5 42.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 7: 6 43.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 7: 7 42.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 7: ================================================================================ 7: ============================= End of ROCm SMI Log ============================== 4: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 32 --hidden-size 2304 --num-attention-heads 18 --kv-channels 128 --ffn-hidden-size 9216 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 2 --global-batch-size 512 --train-samples 22_565_693 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-2b2 --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 22_565_693 --lr-warmup-samples 225_657 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_2b2 --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_2b2 --load checkpoints_2b2 --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document --data- 4: impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068236.json --zero-stage 0 4: START 2068236: Fri Nov 25 09:57:21 EET 2022 6: 6: 6: ======================= ROCm System Management Interface ======================= 6: ================================= Concise Info ================================= 6: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 6: 0 40.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 6: 1 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 6: 2 40.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 6: 3 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 6: 4 41.0c 86.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 6: 5 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 6: 6 38.0c 85.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 6: 7 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 6: ================================================================================ 6: ============================= End of ROCm SMI Log ============================== 3: 3: 3: ======================= ROCm System Management Interface ======================= 3: ================================= Concise Info ================================= 3: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 3: 0 40.0c 100.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 3: 1 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 3: 2 39.0c 86.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 3: 3 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 3: 4 40.0c 94.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 3: 5 42.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 3: 6 39.0c 92.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 3: 7 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 3: ================================================================================ 3: ============================= End of ROCm SMI Log ============================== 5: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 32 --hidden-size 2304 --num-attention-heads 18 --kv-channels 128 --ffn-hidden-size 9216 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 2 --global-batch-size 512 --train-samples 22_565_693 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-2b2 --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 22_565_693 --lr-warmup-samples 225_657 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_2b2 --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_2b2 --load checkpoints_2b2 --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document --data- 5: impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068236.json --zero-stage 0 5: START 2068236: Fri Nov 25 09:57:21 EET 2022 1: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 32 --hidden-size 2304 --num-attention-heads 18 --kv-channels 128 --ffn-hidden-size 9216 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 2 --global-batch-size 512 --train-samples 22_565_693 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-2b2 --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 22_565_693 --lr-warmup-samples 225_657 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_2b2 --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_2b2 --load checkpoints_2b2 --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document --data- 1: impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068236.json --zero-stage 0 1: START 2068236: Fri Nov 25 09:57:21 EET 2022 4: 4: 4: ======================= ROCm System Management Interface ======================= 4: ================================= Concise Info ================================= 4: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 4: 0 48.0c 94.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 4: 1 49.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 4: 2 43.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 4: 3 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 4: 4 42.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 4: 5 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 4: 6 45.0c 95.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 4: 7 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 4: ================================================================================ 4: ============================= End of ROCm SMI Log ============================== 5: 5: 5: ======================= ROCm System Management Interface ======================= 5: ================================= Concise Info ================================= 5: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 5: 0 44.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 5: 1 52.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 5: 2 43.0c 99.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 5: 3 41.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 5: 4 40.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 5: 5 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 5: 6 44.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 5: 7 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 5: ================================================================================ 5: ============================= End of ROCm SMI Log ============================== 1: 1: 1: ======================= ROCm System Management Interface ======================= 1: ================================= Concise Info ================================= 1: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 1: 0 43.0c 93.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 1: 1 49.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 1: 2 46.0c 84.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 1: 3 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 1: 4 45.0c 93.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 1: 5 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 1: 6 40.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 1: 7 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 1: ================================================================================ 1: ============================= End of ROCm SMI Log ============================== 0: Launching on nid005343 (0/8), master nid005343 port 9999, GPUs 8, CUDA: True 7: Launching on nid005354 (7/8), master nid005343 port 9999, GPUs 8, CUDA: True 2: Launching on nid005345 (2/8), master nid005343 port 9999, GPUs 8, CUDA: True 3: Launching on nid005346 (3/8), master nid005343 port 9999, GPUs 8, CUDA: True 6: Launching on nid005353 (6/8), master nid005343 port 9999, GPUs 8, CUDA: True 4: Launching on nid005350 (4/8), master nid005343 port 9999, GPUs 8, CUDA: True 5: Launching on nid005351 (5/8), master nid005343 port 9999, GPUs 8, CUDA: True 1: Launching on nid005344 (1/8), master nid005343 port 9999, GPUs 8, CUDA: True 0: using world size: 64, data-parallel-size: 64, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 0: accumulate and all-reduce gradients in fp32 for bfloat16 data type. 0: using torch.bfloat16 for parameters ... 0: ------------------------ arguments ------------------------ 0: abort_on_unmet_fused_kernel_constraints ......... False 0: accumulate_allreduce_grads_in_fp32 .............. True 0: adam_beta1 ...................................... 0.9 0: adam_beta2 ...................................... 0.999 0: adam_eps ........................................ 1e-08 0: adlr_autoresume ................................. False 0: adlr_autoresume_interval ........................ 1000 0: apply_query_key_layer_scaling ................... True 0: apply_residual_connection_post_layernorm ........ False 0: attention_dropout ............................... 0.1 0: attention_softmax_in_fp32 ....................... False 0: bert_binary_head ................................ True 0: bert_load ....................................... None 0: bf16 ............................................ True 0: bias_dropout_fusion ............................. True 0: bias_gelu_fusion ................................ True 0: biencoder_projection_dim ........................ 0 0: biencoder_shared_query_context_model ............ False 0: block_data_path ................................. None 0: checkpoint_activations .......................... False 0: checkpoint_in_cpu ............................... False 0: checkpoint_num_layers ........................... 1 0: clip_grad ....................................... 1.0 0: codecarbon_dir .................................. None 0: consumed_train_samples .......................... 0 0: consumed_train_tokens ........................... 0 0: consumed_valid_samples .......................... 0 0: contigious_checkpointing ........................ False 0: cpu_optimizer ................................... False 0: cpu_torch_adam .................................. False 0: curriculum_learning ............................. False 0: data_impl ....................................... mmap 0: data_parallel_size .............................. 64 0: data_path ....................................... ['/scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document'] 0: dataloader_type ................................. single 0: DDP_impl ........................................ local 0: decoder_seq_length .............................. None 0: deepscale ....................................... False 0: deepscale_config ................................ None 0: deepspeed ....................................... True 0: deepspeed_activation_checkpointing .............. False 0: deepspeed_config ................................ ds_configs/2068236.json 0: deepspeed_mpi ................................... False 0: distribute_checkpointed_activations ............. False 0: distributed_backend ............................. nccl 0: embed_layernorm ................................. False 0: embedding_path .................................. None 0: encoder_seq_length .............................. 2048 0: eod_mask_loss ................................... False 0: eval_interval ................................... 1000 0: eval_iters ...................................... 1 0: eval_only ....................................... None 0: evidence_data_path .............................. None 0: exit_duration_in_mins ........................... None 0: exit_interval ................................... None 0: ffn_hidden_size ................................. 9216 0: finetune ........................................ False 0: fp16 ............................................ False 0: fp16_lm_cross_entropy ........................... False 0: fp32_residual_connection ........................ False 0: gigaflos_no_embeds .............................. 0 0: global_batch_size ............................... 512 0: glu_activation .................................. None 0: hidden_dropout .................................. 0.1 0: hidden_size ..................................... 2304 0: hysteresis ...................................... 2 0: ict_head_size ................................... None 0: ict_load ........................................ None 0: img_dim ......................................... 224 0: indexer_batch_size .............................. 128 0: indexer_log_interval ............................ 1000 0: inference ....................................... False 0: init_method_std ................................. 0.02 0: init_method_xavier_uniform ...................... False 0: initial_loss_scale .............................. 4294967296 0: kill_switch_path ................................ kill-switch-2b2 0: kv_channels ..................................... 128 0: layer_norm_fusion ............................... True 0: layernorm_epsilon ............................... 1e-05 0: lazy_mpu_init ................................... None 0: load ............................................ checkpoints_2b2 0: local_rank ...................................... None 0: log_batch_size_to_tensorboard ................... True 0: log_interval .................................... 10 0: log_learning_rate_to_tensorboard ................ True 0: log_level ....................................... None 0: log_level_replica ............................... None 0: log_loss_scale_to_tensorboard ................... True 0: log_num_zeros_in_grad ........................... False 0: log_params_norm ................................. False 0: log_path ........................................ None 0: log_timers_to_tensorboard ....................... True 0: log_validation_ppl_to_tensorboard ............... True 0: loss_on_targets_only ............................ False 0: loss_scale ...................................... None 0: loss_scale_window ............................... 1000 0: lr .............................................. 0.0002 0: lr_decay_iters .................................. None 0: lr_decay_samples ................................ 22565693 0: lr_decay_style .................................. cosine 0: lr_decay_tokens ................................. None 0: lr_warmup_fraction .............................. None 0: lr_warmup_iters ................................. 0 0: lr_warmup_samples ............................... 225657 0: make_vocab_size_divisible_by .................... 128 0: mask_prob ....................................... 0.15 0: masked_softmax_fusion ........................... True 0: max_position_embeddings ......................... 2048 0: mean_noise_span_length .......................... None 0: memory_centric_tiled_linear ..................... False 0: merge_file ...................................... gpt2/merges.txt 0: micro_batch_size ................................ 2 0: min_loss_scale .................................. 1.0 0: min_lr .......................................... 2e-05 0: mmap_warmup ..................................... False 0: no_load_optim ................................... None 0: no_load_rng ..................................... None 0: no_save_optim ................................... None 0: no_save_rng ..................................... None 0: noise_density ................................... None 0: num_attention_heads ............................. 18 0: num_channels .................................... 3 0: num_classes ..................................... 1000 0: num_layers ...................................... 32 0: num_layers_per_virtual_pipeline_stage ........... None 0: num_workers ..................................... 2 0: onnx_safe ....................................... None 0: openai_gelu ..................................... False 0: optimizer ....................................... adam 0: optimizer_fusion ................................ True 0: override_lr_scheduler ........................... False 0: pad_vocab_size_to ............................... None 0: params_dtype .................................... torch.bfloat16 0: partition_activations ........................... False 0: patch_dim ....................................... 16 0: pipeline_model_parallel_size .................... 1 0: position_embedding_type ......................... PositionEmbeddingType.absolute 0: pp_partition_method ............................. None 0: profile_backward ................................ False 0: query_in_block_prob ............................. 0.1 0: rampup_batch_size ............................... None 0: rank ............................................ 0 0: remote_device ................................... none 0: reset_attention_mask ............................ False 0: reset_position_ids .............................. False 0: retriever_report_topk_accuracies ................ [] 0: retriever_score_scaling ......................... False 0: retriever_seq_length ............................ 256 0: reweight_loss_based_on_position_frequency ....... False 0: sample_rate ..................................... 1.0 0: save ............................................ checkpoints_2b2 0: save_interval ................................... 1000 0: scatter_gather_tensors_in_pipeline .............. True 0: scattered_embeddings ............................ False 0: seed ............................................ 1234 0: seq_length ...................................... 2048 0: sgd_momentum .................................... 0.9 0: short_seq_prob .................................. 0.1 0: skip_train_iteration_range ...................... None 0: split ........................................... 949,50,1 0: split_transformers .............................. False 0: sync_tp_duplicated_parameters ................... False 0: synchronize_each_layer .......................... False 0: tensor_model_parallel_size ...................... 1 0: tensorboard_dir ................................. tensorboard_2b2 0: tensorboard_log_interval ........................ 1 0: tensorboard_queue_size .......................... 5 0: test_weighted_split_names ....................... None 0: test_weighted_split_paths ....................... None 0: test_weighted_split_paths_path .................. None 0: test_weighted_split_splits ...................... None 0: test_weighted_split_weights ..................... None 0: tile_factor ..................................... 1 0: titles_data_path ................................ None 0: tokenizer_name_or_path .......................... None 0: tokenizer_type .................................. GPT2BPETokenizer 0: train_iters ..................................... None 0: train_samples ................................... 22565693 0: train_tokens .................................... None 0: train_weighted_split_paths ...................... None 0: train_weighted_split_paths_path ................. None 0: universal_checkpoint ............................ False 0: use_bnb_optimizer ............................... False 0: use_checkpoint_lr_scheduler ..................... False 0: use_contiguous_buffers_in_ddp ................... True 0: use_cpu_initialization .......................... None 0: use_one_sent_docs ............................... False 0: use_pin_memory .................................. False 0: valid_num_workers ............................... 2 0: valid_weighted_split_names ...................... None 0: valid_weighted_split_paths ...................... None 0: valid_weighted_split_paths_path ................. None 0: valid_weighted_split_splits ..................... None 0: valid_weighted_split_weights .................... None 0: virtual_pipeline_model_parallel_size ............ None 0: vocab_extra_ids ................................. 0 0: vocab_file ...................................... gpt2/vocab.json 0: weight_decay .................................... 0.1 0: world_size ...................................... 64 0: zero_allgather_bucket_size ...................... 0.0 0: zero_contigious_gradients ....................... False 0: zero_reduce_bucket_size ......................... 0.0 0: zero_reduce_scatter ............................. False 0: zero_stage ...................................... 0 0: -------------------- end of arguments --------------------- 0: setting number of micro-batches to constant 4 0: > building GPT2BPETokenizer tokenizer ... 0: > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304) 0: DeepSpeed general environment info: 0: torch install path ............... ['/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch'] 0: torch version .................... 1.13.0+rocm5.2 0: torch cuda version ............... None 0: torch hip version ................ 5.2.21151-afdc89f8 0: nvcc version ..................... None 0: deepspeed install path ........... ['/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed'] 0: deepspeed info ................... 0.7.5, unknown, unknown 0: deepspeed wheel compiled w. ...... torch 1.13, hip 5.1 0: **** Git info for Megatron: git_hash=unknown git_branch=unknown **** 0: > initializing torch distributed ... 0: [2022-11-25 09:57:30,635] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 7: > setting tensorboard ... 0: > initializing tensor model parallel with size 1 0: > initializing pipeline model parallel with size 1 0: > setting random seeds to 1234 ... 0: > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 0: > compiling dataset index builder ... 0: make: Entering directory '/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/data' 0: make: Nothing to be done for 'default'. 0: make: Leaving directory '/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/data' 0: >>> done with dataset index builder. Compilation time: 0.052 seconds 0: > compiling and loading fused kernels ... 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax.cpp -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax_hip.cpp [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax_hip.h [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/compat.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/compat.h [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/type_shim.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/type_shim.h [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax_cuda.cu -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax_hip.hip [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/type_shim.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/type_shim.h [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/compat.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/compat.h [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax_hip.h [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax_hip.h [skipped, already hipified] 0: Total number of unsupported CUDA function calls: 0 0: 0: 0: Total number of replaced kernel launches: 87 0: ninja: no work to do. 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax.cpp -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax_hip.cpp [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax_cuda.cu -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax_hip.hip [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/type_shim.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/type_shim.h [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/compat.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/compat.h [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax_hip.h [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax_hip.h [skipped, already hipified] 0: Total number of unsupported CUDA function calls: 0 0: 0: 0: Total number of replaced kernel launches: 63 0: [1/1] c++ scaled_masked_softmax_hip.cuda.o scaled_masked_softmax_hip.o -shared -L/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/lib -lc10 -lc10_hip -ltorch_cpu -ltorch_hip -ltorch -ltorch_python -L/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib -lamdhip64 -o scaled_masked_softmax_cuda.so 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/layer_norm_cuda.cpp -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/layer_norm_cuda.cpp [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/layer_norm_cuda_kernel.cu -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/layer_norm_hip_kernel.hip [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/type_shim.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/type_shim.h [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/compat.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/compat.h [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax_hip.h [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax_hip.h [skipped, already hipified] 0: Total number of unsupported CUDA function calls: 0 0: 0: 0: Total number of replaced kernel launches: 67 0: ninja: no work to do. 0: >>> done with compiling and loading fused kernels. Compilation time: 16.664 seconds 0: time to initialize megatron (seconds): -0.542 0: [after megatron is initialized] datetime: 2022-11-25 09:57:51 0: building GPT model ... 0: [2022-11-25 09:57:51,521] [INFO] [utils.py:827:see_memory_usage] Before Building Model 0: [2022-11-25 09:57:51,522] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB 0: [2022-11-25 09:57:51,522] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 41.02 GB, percent = 8.1% 0: SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None 0: Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=1, model=0): 1, ProcessCoord(pipe=0, data=2, model=0): 2, ProcessCoord(pipe=0, data=3, model=0): 3, ProcessCoord(pipe=0, data=4, model=0): 4, ProcessCoord(pipe=0, data=5, model=0): 5, ProcessCoord(pipe=0, data=6, model=0): 6, ProcessCoord(pipe=0, data=7, model=0): 7, ProcessCoord(pipe=0, data=8, model=0): 8, ProcessCoord(pipe=0, data=9, model=0): 9, ProcessCoord(pipe=0, data=10, model=0): 10, ProcessCoord(pipe=0, data=11, model=0): 11, ProcessCoord(pipe=0, data=12, model=0): 12, ProcessCoord(pipe=0, data=13, model=0): 13, ProcessCoord(pipe=0, data=14, model=0): 14, ProcessCoord(pipe=0, data=15, model=0): 15, ProcessCoord(pipe=0, data=16, model=0): 16, ProcessCoord(pipe=0, data=17, model=0): 17, ProcessCoord(pipe=0, data=18, model=0): 18, ProcessCoord(pipe=0, data=19, model=0): 19, ProcessCoord(pipe=0, data=20, model=0): 20, ProcessCoord(pipe=0, data=21, model=0): 21, ProcessCoord(pipe=0, data=22, model=0): 22, ProcessCoord(pi 0: pe=0, data=23, model=0): 23, ProcessCoord(pipe=0, data=24, model=0): 24, ProcessCoord(pipe=0, data=25, model=0): 25, ProcessCoord(pipe=0, data=26, model=0): 26, ProcessCoord(pipe=0, data=27, model=0): 27, ProcessCoord(pipe=0, data=28, model=0): 28, ProcessCoord(pipe=0, data=29, model=0): 29, ProcessCoord(pipe=0, data=30, model=0): 30, ProcessCoord(pipe=0, data=31, model=0): 31, ProcessCoord(pipe=0, data=32, model=0): 32, ProcessCoord(pipe=0, data=33, model=0): 33, ProcessCoord(pipe=0, data=34, model=0): 34, ProcessCoord(pipe=0, data=35, model=0): 35, ProcessCoord(pipe=0, data=36, model=0): 36, ProcessCoord(pipe=0, data=37, model=0): 37, ProcessCoord(pipe=0, data=38, model=0): 38, ProcessCoord(pipe=0, data=39, model=0): 39, ProcessCoord(pipe=0, data=40, model=0): 40, ProcessCoord(pipe=0, data=41, model=0): 41, ProcessCoord(pipe=0, data=42, model=0): 42, ProcessCoord(pipe=0, data=43, model=0): 43, ProcessCoord(pipe=0, data=44, model=0): 44, ProcessCoord(pipe=0, data=45, model=0): 45, ProcessCoord(pipe=0, data=4 0: 6, model=0): 46, ProcessCoord(pipe=0, data=47, model=0): 47, ProcessCoord(pipe=0, data=48, model=0): 48, ProcessCoord(pipe=0, data=49, model=0): 49, ProcessCoord(pipe=0, data=50, model=0): 50, ProcessCoord(pipe=0, data=51, model=0): 51, ProcessCoord(pipe=0, data=52, model=0): 52, ProcessCoord(pipe=0, data=53, model=0): 53, ProcessCoord(pipe=0, data=54, model=0): 54, ProcessCoord(pipe=0, data=55, model=0): 55, ProcessCoord(pipe=0, data=56, model=0): 56, ProcessCoord(pipe=0, data=57, model=0): 57, ProcessCoord(pipe=0, data=58, model=0): 58, ProcessCoord(pipe=0, data=59, model=0): 59, ProcessCoord(pipe=0, data=60, model=0): 60, ProcessCoord(pipe=0, data=61, model=0): 61, ProcessCoord(pipe=0, data=62, model=0): 62, ProcessCoord(pipe=0, data=63, model=0): 63} 0: [2022-11-25 09:57:53,523] [INFO] [module.py:366:_partition_layers] Partitioning pipeline stages with method type:transformer 0: stage=0 layers=39 0: 0: _to_float16 0: 1: EmbeddingPipe 0: 2: 0: 3: ParallelTransformerLayerPipe 0: 4: ParallelTransformerLayerPipe 0: 5: ParallelTransformerLayerPipe 0: 6: ParallelTransformerLayerPipe 0: 7: ParallelTransformerLayerPipe 0: 8: ParallelTransformerLayerPipe 0: 9: ParallelTransformerLayerPipe 0: 10: ParallelTransformerLayerPipe 0: 11: ParallelTransformerLayerPipe 0: 12: ParallelTransformerLayerPipe 0: 13: ParallelTransformerLayerPipe 0: 14: ParallelTransformerLayerPipe 0: 15: ParallelTransformerLayerPipe 0: 16: ParallelTransformerLayerPipe 0: 17: ParallelTransformerLayerPipe 0: 18: ParallelTransformerLayerPipe 0: 19: ParallelTransformerLayerPipe 0: 20: ParallelTransformerLayerPipe 0: 21: ParallelTransformerLayerPipe 0: 22: ParallelTransformerLayerPipe 0: 23: ParallelTransformerLayerPipe 0: 24: ParallelTransformerLayerPipe 0: 25: ParallelTransformerLayerPipe 0: 26: ParallelTransformerLayerPipe 0: 27: ParallelTransformerLayerPipe 0: 28: ParallelTransformerLayerPipe 0: 29: ParallelTransformerLayerPipe 0: 30: ParallelTransformerLayerPipe 0: 31: ParallelTransformerLayerPipe 0: 32: ParallelTransformerLayerPipe 0: 33: ParallelTransformerLayerPipe 0: 34: ParallelTransformerLayerPipe 0: 35: undo 0: 36: MixedFusedLayerNorm 0: 37: EmbeddingPipe 0: 38: float16_to_fp32 0: loss: CrossEntropy 0: [2022-11-25 09:57:54,447] [INFO] [utils.py:827:see_memory_usage] After Building Model 0: [2022-11-25 09:57:54,447] [INFO] [utils.py:828:see_memory_usage] MA 4.03 GB Max_MA 4.03 GB CA 4.24 GB Max_CA 4 GB 0: [2022-11-25 09:57:54,448] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 41.05 GB, percent = 8.2% 0: setting training iterations to 44073 0: > learning rate decay style: cosine 0: DeepSpeed is enabled. 0: [2022-11-25 09:57:54,450] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.7.5, git-hash=unknown, git-branch=unknown 0: [2022-11-25 09:58:08,322] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False 0: [2022-11-25 09:58:08,322] [INFO] [logging.py:68:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer 0: [2022-11-25 09:58:08,322] [INFO] [logging.py:68:log_dist] [Rank 0] Using client Optimizer as basic optimizer 0: [2022-11-25 09:58:08,340] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam 0: [2022-11-25 09:58:08,340] [INFO] [logging.py:68:log_dist] [Rank 0] Creating BF16 optimizer 0: [2022-11-25 09:58:08,384] [INFO] [utils.py:827:see_memory_usage] begin bf16_optimizer 0: [2022-11-25 09:58:08,384] [INFO] [utils.py:828:see_memory_usage] MA 4.02 GB Max_MA 4.04 GB CA 4.26 GB Max_CA 4 GB 0: [2022-11-25 09:58:08,384] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 41.73 GB, percent = 8.3% 1: ninja: no work to do. 1: Time to load utils op: 0.20072102546691895 seconds 1: Time to load utils op: 0.20334577560424805 seconds 1: Time to load utils op: 0.2033393383026123 seconds 1: Time to load utils op: 0.2029721736907959 seconds 1: Time to load utils op: 0.20308399200439453 seconds 1: Time to load utils op: 0.20376920700073242 seconds 1: Time to load utils op: 0.20344066619873047 seconds 1: Time to load utils op: 0.204132080078125 seconds 0: Time to load utils op: 0.2117459774017334 seconds 0: Time to load utils op: 0.21184611320495605 seconds 0: Time to load utils op: 0.21181344985961914 seconds 0: Time to load utils op: 0.21228837966918945 secondsTime to load utils op: 0.2126781940460205 seconds 0: 0: Time to load utils op: 0.21259713172912598 seconds 0: Time to load utils op: 0.21216177940368652 seconds 4: Time to load utils op: 0.2101750373840332 secondsTime to load utils op: 0.21127104759216309 secondsTime to load utils op: 0.21036171913146973 seconds 4: 4: 4: Time to load utils op: 0.20903563499450684 seconds 4: Time to load utils op: 0.2110910415649414 secondsTime to load utils op: 0.21071386337280273 seconds 4: 4: Time to load utils op: 0.21044349670410156 seconds 4: Time to load utils op: 0.21149158477783203 seconds 0: Time to load utils op: 0.20265793800354004 seconds 6: Time to load utils op: 0.21256065368652344 secondsTime to load utils op: 0.21201634407043457 seconds 6: 6: Time to load utils op: 0.21330881118774414 seconds 6: Time to load utils op: 0.21202468872070312 seconds 6: Time to load utils op: 0.2117595672607422 seconds 6: Time to load utils op: 0.21298527717590332 seconds 6: Time to load utils op: 0.21194911003112793 seconds 6: Time to load utils op: 0.21238422393798828 seconds 2: Time to load utils op: 0.21303391456604004 seconds 2: Time to load utils op: 0.21305561065673828 seconds 2: Time to load utils op: 0.21308231353759766 seconds 2: Time to load utils op: 0.21309638023376465 seconds 2: Time to load utils op: 0.21309566497802734 secondsTime to load utils op: 0.21310067176818848 seconds 2: 2: Time to load utils op: 0.21311306953430176 seconds 2: Time to load utils op: 0.21312689781188965 seconds 7: Time to load utils op: 0.21129322052001953 secondsTime to load utils op: 0.21220064163208008 seconds 7: 7: Time to load utils op: 0.21224284172058105 seconds 7: Time to load utils op: 0.210862398147583 secondsTime to load utils op: 0.2121570110321045 secondsTime to load utils op: 0.21292781829833984 seconds 7: 7: 7: Time to load utils op: 0.21208977699279785 seconds 7: Time to load utils op: 0.2115023136138916 seconds 3: Time to load utils op: 0.2123122215270996 seconds 3: Time to load utils op: 0.2123267650604248 seconds 3: Time to load utils op: 0.21233272552490234 seconds 3: Time to load utils op: 0.21235060691833496 seconds 3: Time to load utils op: 0.2123396396636963 seconds 3: Time to load utils op: 0.21234655380249023 seconds 3: Time to load utils op: 0.21234893798828125 seconds 3: Time to load utils op: 0.212371826171875 seconds 5: Time to load utils op: 0.21142363548278809 seconds 5: Time to load utils op: 0.21143531799316406 seconds 5: Time to load utils op: 0.21145200729370117 seconds 5: Time to load utils op: 0.2114582061767578 seconds 5: Time to load utils op: 0.2114567756652832 seconds 5: Time to load utils op: 0.2114732265472412 secondsTime to load utils op: 0.21146059036254883 seconds 5: 5: Time to load utils op: 0.21147990226745605 seconds 0: [2022-11-25 09:58:08,619] [INFO] [utils.py:827:see_memory_usage] before initializing group 0 0: [2022-11-25 09:58:08,620] [INFO] [utils.py:828:see_memory_usage] MA 4.02 GB Max_MA 4.02 GB CA 4.26 GB Max_CA 4 GB 0: [2022-11-25 09:58:08,620] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 41.73 GB, percent = 8.3% 0: [2022-11-25 09:58:09,571] [INFO] [utils.py:827:see_memory_usage] after initializing group 0 0: [2022-11-25 09:58:09,571] [INFO] [utils.py:828:see_memory_usage] MA 8.26 GB Max_MA 8.26 GB CA 10.51 GB Max_CA 11 GB 0: [2022-11-25 09:58:09,571] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 41.75 GB, percent = 8.3% 0: [2022-11-25 09:58:09,602] [INFO] [utils.py:827:see_memory_usage] before initializing group 1 0: [2022-11-25 09:58:09,602] [INFO] [utils.py:828:see_memory_usage] MA 8.26 GB Max_MA 8.26 GB CA 10.51 GB Max_CA 11 GB 0: [2022-11-25 09:58:09,602] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 41.74 GB, percent = 8.3% 0: Time to load utils op: 0.0006129741668701172 seconds 0: Time to load utils op: 0.0006835460662841797 seconds 0: Time to load utils op: 0.0006346702575683594 seconds 0: Time to load utils op: 0.0007202625274658203 seconds 0: Time to load utils op: 0.0007004737854003906 secondsTime to load utils op: 0.0006895065307617188 secondsTime to load utils op: 0.0006678104400634766 seconds 0: 0: 4: Time to load utils op: 0.0006201267242431641 secondsTime to load utils op: 0.0006661415100097656 seconds 4: 4: Time to load utils op: 0.0006952285766601562 secondsTime to load utils op: 0.0006859302520751953 seconds 4: 4: Time to load utils op: 0.0009059906005859375 secondsTime to load utils op: 0.0008351802825927734 seconds 4: 4: Time to load utils op: 0.0008649826049804688 seconds 4: Time to load utils op: 0.000904083251953125 seconds 1: Time to load utils op: 0.0004734992980957031 seconds 1: Time to load utils op: 0.0005106925964355469 secondsTime to load utils op: 0.0005357265472412109 seconds 1: 1: Time to load utils op: 0.00046825408935546875 secondsTime to load utils op: 0.00045680999755859375 seconds 1: 1: Time to load utils op: 0.000461578369140625 secondsTime to load utils op: 0.0004715919494628906 seconds 1: 1: Time to load utils op: 0.0004680156707763672 seconds 6: Time to load utils op: 0.0009751319885253906 seconds 6: Time to load utils op: 0.0011925697326660156 seconds 6: Time to load utils op: 0.0012650489807128906 seconds 6: Time to load utils op: 0.0012278556823730469 seconds 6: Time to load utils op: 0.001260995864868164 seconds 6: Time to load utils op: 0.0013518333435058594 seconds 6: Time to load utils op: 0.0013492107391357422 seconds 6: Time to load utils op: 0.0013535022735595703 seconds 2: Time to load utils op: 0.0011076927185058594 seconds 5: Time to load utils op: 0.0010919570922851562 seconds 5: Time to load utils op: 0.0011832714080810547 seconds 3: Time to load utils op: 0.001050710678100586 seconds 7: Time to load utils op: 0.000522613525390625 seconds 7: Time to load utils op: 0.0005061626434326172 seconds 5: Time to load utils op: 0.001374959945678711 seconds 7: Time to load utils op: 0.0005254745483398438 seconds 2: Time to load utils op: 0.0016627311706542969 seconds 5: Time to load utils op: 0.0013451576232910156 seconds 3: Time to load utils op: 0.0012769699096679688 seconds 5: Time to load utils op: 0.001318216323852539 seconds 3: Time to load utils op: 0.0013535022735595703 seconds 5: Time to load utils op: 0.001356363296508789 secondsTime to load utils op: 0.0013594627380371094 seconds 5: 5: Time to load utils op: 0.0013568401336669922 seconds 7: Time to load utils op: 0.00047588348388671875 seconds 7: Time to load utils op: 0.0004088878631591797 seconds 7: Time to load utils op: 0.00039768218994140625 secondsTime to load utils op: 0.00040435791015625 seconds 7: 2: Time to load utils op: 0.0016093254089355469 seconds 3: Time to load utils op: 0.0014576911926269531 seconds 7: Time to load utils op: 0.00039839744567871094 seconds 3: Time to load utils op: 0.0014791488647460938 secondsTime to load utils op: 0.0014729499816894531 seconds 3: 2: Time to load utils op: 0.001583099365234375 seconds 3: Time to load utils op: 0.0014095306396484375 seconds 2: Time to load utils op: 0.0016803741455078125 seconds 3: Time to load utils op: 0.0014452934265136719 seconds 2: Time to load utils op: 0.0016515254974365234 seconds 2: Time to load utils op: 0.0017061233520507812 seconds 2: Time to load utils op: 0.0016658306121826172 seconds 0: [2022-11-25 09:58:09,638] [INFO] [utils.py:827:see_memory_usage] after initializing group 1 0: [2022-11-25 09:58:09,638] [INFO] [utils.py:828:see_memory_usage] MA 12.19 GB Max_MA 12.19 GB CA 16.33 GB Max_CA 16 GB 0: [2022-11-25 09:58:09,638] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 41.86 GB, percent = 8.3% 0: [2022-11-25 09:58:09,669] [INFO] [utils.py:827:see_memory_usage] before initializing group 2 0: [2022-11-25 09:58:09,670] [INFO] [utils.py:828:see_memory_usage] MA 12.19 GB Max_MA 12.19 GB CA 16.33 GB Max_CA 16 GB 0: [2022-11-25 09:58:09,670] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 41.89 GB, percent = 8.3% 0: [2022-11-25 09:58:09,705] [INFO] [utils.py:827:see_memory_usage] after initializing group 2 0: [2022-11-25 09:58:09,706] [INFO] [utils.py:828:see_memory_usage] MA 12.2 GB Max_MA 12.2 GB CA 16.33 GB Max_CA 16 GB 0: [2022-11-25 09:58:09,706] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 41.89 GB, percent = 8.3% 0: [2022-11-25 09:58:09,737] [INFO] [utils.py:827:see_memory_usage] before initialize_optimizer 0: [2022-11-25 09:58:09,737] [INFO] [utils.py:828:see_memory_usage] MA 12.2 GB Max_MA 12.2 GB CA 16.33 GB Max_CA 16 GB 0: [2022-11-25 09:58:09,737] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 41.89 GB, percent = 8.3% 0: [2022-11-25 09:58:09,774] [INFO] [utils.py:827:see_memory_usage] end initialize_optimizer 0: [2022-11-25 09:58:09,775] [INFO] [utils.py:828:see_memory_usage] MA 12.45 GB Max_MA 12.45 GB CA 16.52 GB Max_CA 17 GB 0: [2022-11-25 09:58:09,775] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 41.89 GB, percent = 8.3% 0: [2022-11-25 09:58:09,806] [INFO] [utils.py:827:see_memory_usage] end bf16_optimizer 0: [2022-11-25 09:58:09,807] [INFO] [utils.py:828:see_memory_usage] MA 12.45 GB Max_MA 12.45 GB CA 16.52 GB Max_CA 17 GB 0: [2022-11-25 09:58:09,807] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 41.89 GB, percent = 8.3% 0: [2022-11-25 09:58:09,807] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam 0: [2022-11-25 09:58:09,807] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed using client LR scheduler 0: [2022-11-25 09:58:09,807] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed LR Scheduler = 0: [2022-11-25 09:58:09,807] [INFO] [logging.py:68:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 0: [2022-11-25 09:58:09,808] [INFO] [config.py:1007:print] DeepSpeedEngine configuration: 0: [2022-11-25 09:58:09,808] [INFO] [config.py:1011:print] activation_checkpointing_config { 0: "partition_activations": false, 0: "contiguous_memory_optimization": false, 0: "cpu_checkpointing": false, 0: "number_checkpoints": null, 0: "synchronize_checkpoint_boundary": false, 0: "profile": false 0: } 0: [2022-11-25 09:58:09,808] [INFO] [config.py:1011:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} 0: [2022-11-25 09:58:09,808] [INFO] [config.py:1011:print] amp_enabled .................. False 0: [2022-11-25 09:58:09,808] [INFO] [config.py:1011:print] amp_params ................... False 0: [2022-11-25 09:58:09,808] [INFO] [config.py:1011:print] autotuning_config ............ { 0: "enabled": false, 0: "start_step": null, 0: "end_step": null, 0: "metric_path": null, 0: "arg_mappings": null, 0: "metric": "throughput", 0: "model_info": null, 0: "results_dir": "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/autotuning_results", 0: "exps_dir": "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/autotuning_exps", 0: "overwrite": true, 0: "fast": true, 0: "start_profile_step": 3, 0: "end_profile_step": 5, 0: "tuner_type": "gridsearch", 0: "tuner_early_stopping": 5, 0: "tuner_num_trials": 50, 0: "model_info_path": null, 0: "mp_size": 1, 0: "max_train_batch_size": null, 0: "min_train_batch_size": 1, 0: "max_train_micro_batch_size_per_gpu": 1.024000e+03, 0: "min_train_micro_batch_size_per_gpu": 1, 0: "num_tuning_micro_batch_sizes": 3 0: } 0: [2022-11-25 09:58:09,808] [INFO] [config.py:1011:print] bfloat16_enabled ............. True 0: [2022-11-25 09:58:09,808] [INFO] [config.py:1011:print] checkpoint_parallel_write_pipeline False 0: [2022-11-25 09:58:09,808] [INFO] [config.py:1011:print] checkpoint_tag_validation_enabled True 0: [2022-11-25 09:58:09,808] [INFO] [config.py:1011:print] checkpoint_tag_validation_fail False 0: [2022-11-25 09:58:09,808] [INFO] [config.py:1011:print] comms_config ................. 0: [2022-11-25 09:58:09,808] [INFO] [config.py:1011:print] communication_data_type ...... None 0: [2022-11-25 09:58:09,808] [INFO] [config.py:1011:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_pa 0: rameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} 0: [2022-11-25 09:58:09,808] [INFO] [config.py:1011:print] curriculum_enabled ........... False 0: [2022-11-25 09:58:09,808] [INFO] [config.py:1011:print] curriculum_params ............ False 0: [2022-11-25 09:58:09,808] [INFO] [config.py:1011:print] dataloader_drop_last ......... False 0: [2022-11-25 09:58:09,808] [INFO] [config.py:1011:print] disable_allgather ............ False 0: [2022-11-25 09:58:09,808] [INFO] [config.py:1011:print] dump_state ................... False 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] dynamic_loss_scale_args ...... None 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] eigenvalue_enabled ........... False 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] eigenvalue_gas_boundary_resolution 1 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] eigenvalue_layer_name ........ bert.encoder.layer 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] eigenvalue_layer_num ......... 0 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] eigenvalue_max_iter .......... 100 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] eigenvalue_stability ......... 1e-06 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] eigenvalue_tol ............... 0.01 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] eigenvalue_verbose ........... False 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] elasticity_enabled ........... False 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] flops_profiler_config ........ { 0: "enabled": false, 0: "profile_step": 1, 0: "module_depth": -1, 0: "top_modules": 1, 0: "detailed": true, 0: "output_file": null 0: } 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] fp16_auto_cast ............... None 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] fp16_enabled ................. False 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] fp16_master_weights_and_gradients False 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] global_rank .................. 0 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] gradient_accumulation_steps .. 4 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] gradient_clipping ............ 1.0 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] gradient_predivide_factor .... 1.0 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] initial_dynamic_scale ........ 1 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] load_universal_checkpoint .... False 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] loss_scale ................... 1.0 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] memory_breakdown ............. False 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] monitor_config ............... 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] nebula_config ................ { 0: "enabled": false, 0: "persistent_storage_path": null, 0: "persistent_time_interval": 100, 0: "num_of_version_in_retention": 2, 0: "enable_nebula_load": true, 0: "load_path": null 0: } 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] optimizer_legacy_fusion ...... False 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] optimizer_name ............... None 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] optimizer_params ............. None 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] pld_enabled .................. False 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] pld_params ................... False 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] prescale_gradients ........... False 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] scheduler_name ............... None 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] scheduler_params ............. None 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] sparse_attention ............. None 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] sparse_gradients_enabled ..... False 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] steps_per_print .............. 2000 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] train_batch_size ............. 512 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] train_micro_batch_size_per_gpu 2 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] use_node_local_storage ....... False 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] wall_clock_breakdown ......... False 0: [2022-11-25 09:58:09,809] [INFO] [config.py:1011:print] world_size ................... 64 0: [2022-11-25 09:58:09,810] [INFO] [config.py:1011:print] zero_allow_untested_optimizer False 0: [2022-11-25 09:58:09,810] [INFO] [config.py:1011:print] zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False 0: [2022-11-25 09:58:09,810] [INFO] [config.py:1011:print] zero_enabled ................. False 0: [2022-11-25 09:58:09,810] [INFO] [config.py:1011:print] zero_optimization_stage ...... 0 0: [2022-11-25 09:58:09,810] [INFO] [config.py:996:print_user_config] json = { 0: "train_micro_batch_size_per_gpu": 2, 0: "train_batch_size": 512, 0: "gradient_clipping": 1.0, 0: "zero_optimization": { 0: "stage": 0 0: }, 0: "bf16": { 0: "enabled": true 0: }, 0: "steps_per_print": 2.000000e+03, 0: "wall_clock_breakdown": false 0: } 0: Time to load utils op: 0.00039958953857421875 seconds 0: [2022-11-25 09:58:09,811] [INFO] [engine.py:87:__init__] CONFIG: micro_batches=4 micro_batch_size=2 0: [2022-11-25 09:58:09,835] [INFO] [engine.py:145:__init__] RANK=0 STAGE=0 LAYERS=39 [0, 39) STAGE_PARAMS=2160013824 (2160.014M) TOTAL_PARAMS=2160013824 (2160.014M) UNIQUE_PARAMS=2160013824 (2160.014M) 0: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 0: WARNING: could not find the metadata file checkpoints_2b2 0: will not load any checkpoints and will start from random 7: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 7: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 4: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 4: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 6: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 6: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 0: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 6: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 6: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 7: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 4: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 0: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 7: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 3: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 4: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 2: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 2: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 5: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 5: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 3: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 2: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 5: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 0: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 2: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 5: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 3: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 6: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 1: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 1: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 3: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 4: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 0: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 6: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 7: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 1: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 1: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 0: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 4: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 7: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 2: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 2: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 5: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 5: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 3: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 6: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 3: [2022-11-25 09:58:09,841] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 7: [2022-11-25 09:58:09,842] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 7: [2022-11-25 09:58:09,842] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 4: [2022-11-25 09:58:09,842] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 0: [2022-11-25 09:58:09,842] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 6: [2022-11-25 09:58:09,842] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 0: [2022-11-25 09:58:09,842] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 4: [2022-11-25 09:58:09,842] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 1: [2022-11-25 09:58:09,842] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 5: [2022-11-25 09:58:09,842] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 5: [2022-11-25 09:58:09,842] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 2: [2022-11-25 09:58:09,842] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 2: [2022-11-25 09:58:09,842] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 3: [2022-11-25 09:58:09,842] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 1: [2022-11-25 09:58:09,842] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 3: [2022-11-25 09:58:09,842] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 1: [2022-11-25 09:58:09,842] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 1: [2022-11-25 09:58:09,842] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_2b2/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 7: time (ms) | load-checkpoint: 8.00 0: estimated model parameters: 2.160013824 0: estimated model parameters without embeddings: 2.039394816 0: [after model, optimizer, and learning rate scheduler are built] datetime: 2022-11-25 09:58:10 0: > building train, validation, and test datasets ... 0: > datasets target sizes (minimum size): 0: train: 22565693 0: validation: 23040 0: test: 512 0: > building train, validation, and test datasets for GPT ... 0: > building dataset index ... 0: reading sizes... 0: reading pointers... 0: reading document index... 0: creating numpy buffer of mmap... 0: creating memory view of numpy buffer... 0: > finished creating indexed dataset in 0.008434 seconds 0: number of documents: 210604984 0: > dataset split: 0: train: 0: document indices in [0, 199864130) total of 199864130 documents 0: validation: 0: document indices in [199864130, 210394379) total of 10530249 documents 0: test: 0: document indices in [210394379, 210604984) total of 210605 documents 0: > loading doc-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_train_indexmap_22565693ns_2048sl_1234s_doc_idx.npy 0: > loading sample-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_train_indexmap_22565693ns_2048sl_1234s_sample_idx.npy 0: > loading shuffle-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_train_indexmap_22565693ns_2048sl_1234s_shuffle_idx.npy 0: loaded indexed file in 0.042 seconds 0: total number of samples: 173377817 0: total number of epochs: 1 0: > loading doc-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_valid_indexmap_23040ns_2048sl_1234s_doc_idx.npy 0: > loading sample-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_valid_indexmap_23040ns_2048sl_1234s_sample_idx.npy 0: > loading shuffle-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_valid_indexmap_23040ns_2048sl_1234s_shuffle_idx.npy 0: loaded indexed file in 0.098 seconds 0: total number of samples: 9118345 0: total number of epochs: 1 0: > loading doc-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_test_indexmap_512ns_2048sl_1234s_doc_idx.npy 0: > loading sample-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_test_indexmap_512ns_2048sl_1234s_sample_idx.npy 0: > loading shuffle-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_test_indexmap_512ns_2048sl_1234s_shuffle_idx.npy 0: loaded indexed file in 0.071 seconds 0: total number of samples: 182928 0: total number of epochs: 1 0: > finished creating GPT datasets ... 0: [after dataloaders are built] datetime: 2022-11-25 09:58:28 0: done with setup ... 0: training ... 0: Number of parameters: [tensor rank - pipeline rank] w/ and w/o embeddings: 7: time (ms) | model-and-optimizer-setup: 18834.32 | train/valid/test-data-iterators-setup: 17970.88 0: [000-000] 2.1600B / 2.0394B 0: [before the start of training step] datetime: 2022-11-25 09:58:28 0: [Rank 0] (after 10 iterations) memory (MB) | allocated: 18448.37548828125 | max allocated: 52513.4677734375 | reserved: 59052.0 | max reserved: 59052.0 7: iteration 10/ 44073 | consumed samples: 5120 | consumed tokens: 10485760 | elapsed time per iteration (s): 6.04 | learning rate: 4.538E-06 | global batch size: 512 | lm loss: 1.031786E+01 | grad norm: 21.832 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 84.718 | TFLOPs: 39.48 | 7: iteration 20/ 44073 | consumed samples: 10240 | consumed tokens: 20971520 | elapsed time per iteration (s): 4.18 | learning rate: 9.076E-06 | global batch size: 512 | lm loss: 8.693311E+00 | grad norm: 8.736 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.601 | TFLOPs: 57.14 | 7: iteration 30/ 44073 | consumed samples: 15360 | consumed tokens: 31457280 | elapsed time per iteration (s): 4.14 | learning rate: 1.361E-05 | global batch size: 512 | lm loss: 8.206876E+00 | grad norm: 5.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.798 | TFLOPs: 57.70 | 7: iteration 40/ 44073 | consumed samples: 20480 | consumed tokens: 41943040 | elapsed time per iteration (s): 4.15 | learning rate: 1.815E-05 | global batch size: 512 | lm loss: 7.840375E+00 | grad norm: 3.222 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.472 | TFLOPs: 57.54 | 7: iteration 50/ 44073 | consumed samples: 25600 | consumed tokens: 52428800 | elapsed time per iteration (s): 4.15 | learning rate: 2.269E-05 | global batch size: 512 | lm loss: 7.448897E+00 | grad norm: 2.464 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.264 | TFLOPs: 57.45 | 7: iteration 60/ 44073 | consumed samples: 30720 | consumed tokens: 62914560 | elapsed time per iteration (s): 4.15 | learning rate: 2.723E-05 | global batch size: 512 | lm loss: 7.088107E+00 | grad norm: 1.890 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.455 | TFLOPs: 57.54 | 7: iteration 70/ 44073 | consumed samples: 35840 | consumed tokens: 73400320 | elapsed time per iteration (s): 4.17 | learning rate: 3.177E-05 | global batch size: 512 | lm loss: 6.849487E+00 | grad norm: 2.853 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.866 | TFLOPs: 57.26 | 7: iteration 80/ 44073 | consumed samples: 40960 | consumed tokens: 83886080 | elapsed time per iteration (s): 4.14 | learning rate: 3.630E-05 | global batch size: 512 | lm loss: 6.655862E+00 | grad norm: 2.785 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.720 | TFLOPs: 57.66 | 7: iteration 90/ 44073 | consumed samples: 46080 | consumed tokens: 94371840 | elapsed time per iteration (s): 4.16 | learning rate: 4.084E-05 | global batch size: 512 | lm loss: 6.516167E+00 | grad norm: 3.844 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.983 | TFLOPs: 57.32 | 7: iteration 100/ 44073 | consumed samples: 51200 | consumed tokens: 104857600 | elapsed time per iteration (s): 4.14 | learning rate: 4.538E-05 | global batch size: 512 | lm loss: 6.367654E+00 | grad norm: 2.803 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.592 | TFLOPs: 57.60 | 7: iteration 110/ 44073 | consumed samples: 56320 | consumed tokens: 115343360 | elapsed time per iteration (s): 4.17 | learning rate: 4.992E-05 | global batch size: 512 | lm loss: 6.223198E+00 | grad norm: 2.811 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.791 | TFLOPs: 57.23 | 7: iteration 120/ 44073 | consumed samples: 61440 | consumed tokens: 125829120 | elapsed time per iteration (s): 4.18 | learning rate: 5.445E-05 | global batch size: 512 | lm loss: 6.138682E+00 | grad norm: 2.391 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.531 | TFLOPs: 57.11 | 7: iteration 130/ 44073 | consumed samples: 66560 | consumed tokens: 136314880 | elapsed time per iteration (s): 4.16 | learning rate: 5.899E-05 | global batch size: 512 | lm loss: 6.034589E+00 | grad norm: 2.459 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.203 | TFLOPs: 57.42 | 7: iteration 140/ 44073 | consumed samples: 71680 | consumed tokens: 146800640 | elapsed time per iteration (s): 4.17 | learning rate: 6.353E-05 | global batch size: 512 | lm loss: 5.951537E+00 | grad norm: 3.757 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.825 | TFLOPs: 57.24 | 7: iteration 150/ 44073 | consumed samples: 76800 | consumed tokens: 157286400 | elapsed time per iteration (s): 4.15 | learning rate: 6.807E-05 | global batch size: 512 | lm loss: 5.848100E+00 | grad norm: 2.093 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.478 | TFLOPs: 57.55 | 7: iteration 160/ 44073 | consumed samples: 81920 | consumed tokens: 167772160 | elapsed time per iteration (s): 4.17 | learning rate: 7.261E-05 | global batch size: 512 | lm loss: 5.820253E+00 | grad norm: 1.736 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.845 | TFLOPs: 57.25 | 7: iteration 170/ 44073 | consumed samples: 87040 | consumed tokens: 178257920 | elapsed time per iteration (s): 4.15 | learning rate: 7.714E-05 | global batch size: 512 | lm loss: 5.755837E+00 | grad norm: 1.579 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.266 | TFLOPs: 57.45 | 7: iteration 180/ 44073 | consumed samples: 92160 | consumed tokens: 188743680 | elapsed time per iteration (s): 4.17 | learning rate: 8.168E-05 | global batch size: 512 | lm loss: 5.684624E+00 | grad norm: 1.668 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.899 | TFLOPs: 57.28 | 7: iteration 190/ 44073 | consumed samples: 97280 | consumed tokens: 199229440 | elapsed time per iteration (s): 4.17 | learning rate: 8.622E-05 | global batch size: 512 | lm loss: 5.588400E+00 | grad norm: 1.683 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.672 | TFLOPs: 57.17 | 7: iteration 200/ 44073 | consumed samples: 102400 | consumed tokens: 209715200 | elapsed time per iteration (s): 4.20 | learning rate: 9.076E-05 | global batch size: 512 | lm loss: 5.564513E+00 | grad norm: 2.062 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.819 | TFLOPs: 56.77 | 7: iteration 210/ 44073 | consumed samples: 107520 | consumed tokens: 220200960 | elapsed time per iteration (s): 4.22 | learning rate: 9.530E-05 | global batch size: 512 | lm loss: 5.530732E+00 | grad norm: 1.078 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.222 | TFLOPs: 56.50 | 7: iteration 220/ 44073 | consumed samples: 112640 | consumed tokens: 230686720 | elapsed time per iteration (s): 4.17 | learning rate: 9.983E-05 | global batch size: 512 | lm loss: 5.467340E+00 | grad norm: 1.307 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.829 | TFLOPs: 57.24 | 7: iteration 230/ 44073 | consumed samples: 117760 | consumed tokens: 241172480 | elapsed time per iteration (s): 4.16 | learning rate: 1.044E-04 | global batch size: 512 | lm loss: 5.414989E+00 | grad norm: 1.642 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.968 | TFLOPs: 57.31 | 7: iteration 240/ 44073 | consumed samples: 122880 | consumed tokens: 251658240 | elapsed time per iteration (s): 4.19 | learning rate: 1.089E-04 | global batch size: 512 | lm loss: 5.367589E+00 | grad norm: 1.334 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.329 | TFLOPs: 57.01 | 7: iteration 250/ 44073 | consumed samples: 128000 | consumed tokens: 262144000 | elapsed time per iteration (s): 4.17 | learning rate: 1.134E-04 | global batch size: 512 | lm loss: 5.305630E+00 | grad norm: 1.483 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.881 | TFLOPs: 57.27 | 7: iteration 260/ 44073 | consumed samples: 133120 | consumed tokens: 272629760 | elapsed time per iteration (s): 4.16 | learning rate: 1.180E-04 | global batch size: 512 | lm loss: 5.271925E+00 | grad norm: 1.101 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.967 | TFLOPs: 57.31 | 7: iteration 270/ 44073 | consumed samples: 138240 | consumed tokens: 283115520 | elapsed time per iteration (s): 4.17 | learning rate: 1.225E-04 | global batch size: 512 | lm loss: 5.235262E+00 | grad norm: 1.745 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.822 | TFLOPs: 57.24 | 7: iteration 280/ 44073 | consumed samples: 143360 | consumed tokens: 293601280 | elapsed time per iteration (s): 4.15 | learning rate: 1.271E-04 | global batch size: 512 | lm loss: 5.242725E+00 | grad norm: 1.071 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.319 | TFLOPs: 57.47 | 7: iteration 290/ 44073 | consumed samples: 148480 | consumed tokens: 304087040 | elapsed time per iteration (s): 4.15 | learning rate: 1.316E-04 | global batch size: 512 | lm loss: 5.169073E+00 | grad norm: 0.945 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.521 | TFLOPs: 57.57 | 7: iteration 300/ 44073 | consumed samples: 153600 | consumed tokens: 314572800 | elapsed time per iteration (s): 4.15 | learning rate: 1.361E-04 | global batch size: 512 | lm loss: 5.102835E+00 | grad norm: 0.920 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.498 | TFLOPs: 57.56 | 7: iteration 310/ 44073 | consumed samples: 158720 | consumed tokens: 325058560 | elapsed time per iteration (s): 4.15 | learning rate: 1.407E-04 | global batch size: 512 | lm loss: 5.055267E+00 | grad norm: 0.978 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.315 | TFLOPs: 57.47 | 7: iteration 320/ 44073 | consumed samples: 163840 | consumed tokens: 335544320 | elapsed time per iteration (s): 4.19 | learning rate: 1.452E-04 | global batch size: 512 | lm loss: 5.080972E+00 | grad norm: 1.071 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.281 | TFLOPs: 56.99 | 7: iteration 330/ 44073 | consumed samples: 168960 | consumed tokens: 346030080 | elapsed time per iteration (s): 4.14 | learning rate: 1.497E-04 | global batch size: 512 | lm loss: 5.018291E+00 | grad norm: 1.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.642 | TFLOPs: 57.62 | 7: iteration 340/ 44073 | consumed samples: 174080 | consumed tokens: 356515840 | elapsed time per iteration (s): 4.15 | learning rate: 1.543E-04 | global batch size: 512 | lm loss: 4.940012E+00 | grad norm: 0.716 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.329 | TFLOPs: 57.48 | 7: iteration 350/ 44073 | consumed samples: 179200 | consumed tokens: 367001600 | elapsed time per iteration (s): 4.16 | learning rate: 1.588E-04 | global batch size: 512 | lm loss: 4.908014E+00 | grad norm: 0.819 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.074 | TFLOPs: 57.36 | 7: iteration 360/ 44073 | consumed samples: 184320 | consumed tokens: 377487360 | elapsed time per iteration (s): 4.14 | learning rate: 1.634E-04 | global batch size: 512 | lm loss: 4.930172E+00 | grad norm: 1.073 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.639 | TFLOPs: 57.62 | 7: iteration 370/ 44073 | consumed samples: 189440 | consumed tokens: 387973120 | elapsed time per iteration (s): 4.16 | learning rate: 1.679E-04 | global batch size: 512 | lm loss: 4.886673E+00 | grad norm: 1.019 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.007 | TFLOPs: 57.33 | 7: iteration 380/ 44073 | consumed samples: 194560 | consumed tokens: 398458880 | elapsed time per iteration (s): 4.30 | learning rate: 1.724E-04 | global batch size: 512 | lm loss: 4.803664E+00 | grad norm: 0.777 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.934 | TFLOPs: 55.43 | 7: iteration 390/ 44073 | consumed samples: 199680 | consumed tokens: 408944640 | elapsed time per iteration (s): 4.15 | learning rate: 1.770E-04 | global batch size: 512 | lm loss: 4.803553E+00 | grad norm: 0.872 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.349 | TFLOPs: 57.49 | 7: iteration 400/ 44073 | consumed samples: 204800 | consumed tokens: 419430400 | elapsed time per iteration (s): 4.14 | learning rate: 1.815E-04 | global batch size: 512 | lm loss: 4.708539E+00 | grad norm: 0.658 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.670 | TFLOPs: 57.64 | 7: iteration 410/ 44073 | consumed samples: 209920 | consumed tokens: 429916160 | elapsed time per iteration (s): 4.26 | learning rate: 1.861E-04 | global batch size: 512 | lm loss: 4.700165E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.110 | TFLOPs: 55.98 | 7: iteration 420/ 44073 | consumed samples: 215040 | consumed tokens: 440401920 | elapsed time per iteration (s): 4.15 | learning rate: 1.906E-04 | global batch size: 512 | lm loss: 4.663181E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.505 | TFLOPs: 57.56 | 7: iteration 430/ 44073 | consumed samples: 220160 | consumed tokens: 450887680 | elapsed time per iteration (s): 4.15 | learning rate: 1.951E-04 | global batch size: 512 | lm loss: 4.603146E+00 | grad norm: 0.677 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.328 | TFLOPs: 57.48 | 7: iteration 440/ 44073 | consumed samples: 225280 | consumed tokens: 461373440 | elapsed time per iteration (s): 4.14 | learning rate: 1.997E-04 | global batch size: 512 | lm loss: 4.622343E+00 | grad norm: 0.937 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.569 | TFLOPs: 57.59 | 7: iteration 450/ 44073 | consumed samples: 230400 | consumed tokens: 471859200 | elapsed time per iteration (s): 4.16 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 4.582607E+00 | grad norm: 1.365 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.159 | TFLOPs: 57.40 | 7: iteration 460/ 44073 | consumed samples: 235520 | consumed tokens: 482344960 | elapsed time per iteration (s): 4.14 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 4.559572E+00 | grad norm: 0.707 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.577 | TFLOPs: 57.59 | 7: iteration 470/ 44073 | consumed samples: 240640 | consumed tokens: 492830720 | elapsed time per iteration (s): 4.15 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 4.427233E+00 | grad norm: 0.608 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.318 | TFLOPs: 57.47 | 7: iteration 480/ 44073 | consumed samples: 245760 | consumed tokens: 503316480 | elapsed time per iteration (s): 4.14 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 4.425594E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.738 | TFLOPs: 57.67 | 7: iteration 490/ 44073 | consumed samples: 250880 | consumed tokens: 513802240 | elapsed time per iteration (s): 4.14 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 4.347851E+00 | grad norm: 0.586 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.748 | TFLOPs: 57.67 | 7: iteration 500/ 44073 | consumed samples: 256000 | consumed tokens: 524288000 | elapsed time per iteration (s): 4.16 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 4.300743E+00 | grad norm: 0.837 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.205 | TFLOPs: 57.42 | 7: iteration 510/ 44073 | consumed samples: 261120 | consumed tokens: 534773760 | elapsed time per iteration (s): 4.14 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 4.270889E+00 | grad norm: 0.669 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.762 | TFLOPs: 57.68 | 7: iteration 520/ 44073 | consumed samples: 266240 | consumed tokens: 545259520 | elapsed time per iteration (s): 4.14 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 4.238422E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.739 | TFLOPs: 57.67 | 7: iteration 530/ 44073 | consumed samples: 271360 | consumed tokens: 555745280 | elapsed time per iteration (s): 4.14 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 4.159966E+00 | grad norm: 0.888 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.722 | TFLOPs: 57.66 | 7: iteration 540/ 44073 | consumed samples: 276480 | consumed tokens: 566231040 | elapsed time per iteration (s): 4.17 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 4.100504E+00 | grad norm: 0.831 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.752 | TFLOPs: 57.21 | 7: iteration 550/ 44073 | consumed samples: 281600 | consumed tokens: 576716800 | elapsed time per iteration (s): 4.17 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 4.055560E+00 | grad norm: 0.705 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.899 | TFLOPs: 57.28 | 7: iteration 560/ 44073 | consumed samples: 286720 | consumed tokens: 587202560 | elapsed time per iteration (s): 4.15 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.953367E+00 | grad norm: 0.682 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.391 | TFLOPs: 57.51 | 7: iteration 570/ 44073 | consumed samples: 291840 | consumed tokens: 597688320 | elapsed time per iteration (s): 4.14 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.864960E+00 | grad norm: 0.589 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.719 | TFLOPs: 57.66 | 7: iteration 580/ 44073 | consumed samples: 296960 | consumed tokens: 608174080 | elapsed time per iteration (s): 4.15 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.812043E+00 | grad norm: 0.721 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.350 | TFLOPs: 57.49 | 7: iteration 590/ 44073 | consumed samples: 302080 | consumed tokens: 618659840 | elapsed time per iteration (s): 4.14 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.752113E+00 | grad norm: 0.550 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.576 | TFLOPs: 57.59 | 7: iteration 600/ 44073 | consumed samples: 307200 | consumed tokens: 629145600 | elapsed time per iteration (s): 4.16 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.704802E+00 | grad norm: 0.661 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.211 | TFLOPs: 57.42 | 7: iteration 610/ 44073 | consumed samples: 312320 | consumed tokens: 639631360 | elapsed time per iteration (s): 4.15 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.619638E+00 | grad norm: 0.472 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.358 | TFLOPs: 57.49 | 7: iteration 620/ 44073 | consumed samples: 317440 | consumed tokens: 650117120 | elapsed time per iteration (s): 4.14 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.597163E+00 | grad norm: 0.383 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.557 | TFLOPs: 57.58 | 7: iteration 630/ 44073 | consumed samples: 322560 | consumed tokens: 660602880 | elapsed time per iteration (s): 4.16 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.513525E+00 | grad norm: 0.396 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.958 | TFLOPs: 57.30 | 7: iteration 640/ 44073 | consumed samples: 327680 | consumed tokens: 671088640 | elapsed time per iteration (s): 4.16 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.537682E+00 | grad norm: 0.540 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.181 | TFLOPs: 57.41 | 7: iteration 650/ 44073 | consumed samples: 332800 | consumed tokens: 681574400 | elapsed time per iteration (s): 4.14 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.440879E+00 | grad norm: 0.632 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.596 | TFLOPs: 57.60 | 7: iteration 660/ 44073 | consumed samples: 337920 | consumed tokens: 692060160 | elapsed time per iteration (s): 4.15 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.460236E+00 | grad norm: 0.557 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.346 | TFLOPs: 57.49 | 7: iteration 670/ 44073 | consumed samples: 343040 | consumed tokens: 702545920 | elapsed time per iteration (s): 4.15 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.444968E+00 | grad norm: 0.493 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.452 | TFLOPs: 57.54 | 7: iteration 680/ 44073 | consumed samples: 348160 | consumed tokens: 713031680 | elapsed time per iteration (s): 4.14 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.385626E+00 | grad norm: 0.369 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.742 | TFLOPs: 57.67 | 7: iteration 690/ 44073 | consumed samples: 353280 | consumed tokens: 723517440 | elapsed time per iteration (s): 4.14 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.399790E+00 | grad norm: 0.428 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.772 | TFLOPs: 57.68 | 7: iteration 700/ 44073 | consumed samples: 358400 | consumed tokens: 734003200 | elapsed time per iteration (s): 4.15 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.385705E+00 | grad norm: 0.369 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.477 | TFLOPs: 57.55 | 7: iteration 710/ 44073 | consumed samples: 363520 | consumed tokens: 744488960 | elapsed time per iteration (s): 4.14 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.341087E+00 | grad norm: 0.493 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.540 | TFLOPs: 57.58 | 7: iteration 720/ 44073 | consumed samples: 368640 | consumed tokens: 754974720 | elapsed time per iteration (s): 4.14 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.327607E+00 | grad norm: 0.509 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.729 | TFLOPs: 57.66 | 7: iteration 730/ 44073 | consumed samples: 373760 | consumed tokens: 765460480 | elapsed time per iteration (s): 4.14 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.329799E+00 | grad norm: 0.431 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.626 | TFLOPs: 57.62 | 7: iteration 740/ 44073 | consumed samples: 378880 | consumed tokens: 775946240 | elapsed time per iteration (s): 4.14 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.287175E+00 | grad norm: 0.345 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.738 | TFLOPs: 57.67 | 7: iteration 750/ 44073 | consumed samples: 384000 | consumed tokens: 786432000 | elapsed time per iteration (s): 4.14 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.285693E+00 | grad norm: 0.438 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.635 | TFLOPs: 57.62 | 7: iteration 760/ 44073 | consumed samples: 389120 | consumed tokens: 796917760 | elapsed time per iteration (s): 4.15 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.276583E+00 | grad norm: 0.835 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.401 | TFLOPs: 57.51 | 7: iteration 770/ 44073 | consumed samples: 394240 | consumed tokens: 807403520 | elapsed time per iteration (s): 4.14 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.370159E+00 | grad norm: 0.898 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.539 | TFLOPs: 57.58 | 7: iteration 780/ 44073 | consumed samples: 399360 | consumed tokens: 817889280 | elapsed time per iteration (s): 4.15 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.329454E+00 | grad norm: 0.568 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.302 | TFLOPs: 57.47 | 7: iteration 790/ 44073 | consumed samples: 404480 | consumed tokens: 828375040 | elapsed time per iteration (s): 4.17 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.279073E+00 | grad norm: 0.396 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.909 | TFLOPs: 57.28 | 7: iteration 800/ 44073 | consumed samples: 409600 | consumed tokens: 838860800 | elapsed time per iteration (s): 4.14 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.242126E+00 | grad norm: 0.295 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.642 | TFLOPs: 57.62 | 7: iteration 810/ 44073 | consumed samples: 414720 | consumed tokens: 849346560 | elapsed time per iteration (s): 4.18 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.173577E+00 | grad norm: 0.273 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.607 | TFLOPs: 57.14 | 7: iteration 820/ 44073 | consumed samples: 419840 | consumed tokens: 859832320 | elapsed time per iteration (s): 4.14 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.160255E+00 | grad norm: 0.325 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.544 | TFLOPs: 57.58 | 7: iteration 830/ 44073 | consumed samples: 424960 | consumed tokens: 870318080 | elapsed time per iteration (s): 4.14 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.160024E+00 | grad norm: 0.306 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.697 | TFLOPs: 57.65 | 7: iteration 840/ 44073 | consumed samples: 430080 | consumed tokens: 880803840 | elapsed time per iteration (s): 4.15 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.137240E+00 | grad norm: 0.291 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.497 | TFLOPs: 57.56 | 7: iteration 850/ 44073 | consumed samples: 435200 | consumed tokens: 891289600 | elapsed time per iteration (s): 4.14 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.172816E+00 | grad norm: 0.366 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.740 | TFLOPs: 57.67 | 7: iteration 860/ 44073 | consumed samples: 440320 | consumed tokens: 901775360 | elapsed time per iteration (s): 4.16 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.122580E+00 | grad norm: 0.347 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.025 | TFLOPs: 57.34 | 7: iteration 870/ 44073 | consumed samples: 445440 | consumed tokens: 912261120 | elapsed time per iteration (s): 4.15 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.123966E+00 | grad norm: 0.294 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.262 | TFLOPs: 57.45 | 7: iteration 880/ 44073 | consumed samples: 450560 | consumed tokens: 922746880 | elapsed time per iteration (s): 4.14 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.116187E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.684 | TFLOPs: 57.64 | 7: iteration 890/ 44073 | consumed samples: 455680 | consumed tokens: 933232640 | elapsed time per iteration (s): 4.19 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.128916E+00 | grad norm: 0.817 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.253 | TFLOPs: 56.98 | 7: iteration 900/ 44073 | consumed samples: 460800 | consumed tokens: 943718400 | elapsed time per iteration (s): 4.16 | learning rate: 2.000E-04 | global batch size: 512 | lm loss: 3.284354E+00 | grad norm: 2.054 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.069 | TFLOPs: 57.36 | 7: iteration 910/ 44073 | consumed samples: 465920 | consumed tokens: 954204160 | elapsed time per iteration (s): 4.14 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 3.324784E+00 | grad norm: 0.903 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.547 | TFLOPs: 57.58 | 7: iteration 920/ 44073 | consumed samples: 471040 | consumed tokens: 964689920 | elapsed time per iteration (s): 4.16 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 3.283570E+00 | grad norm: 0.766 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.212 | TFLOPs: 57.42 | 7: iteration 930/ 44073 | consumed samples: 476160 | consumed tokens: 975175680 | elapsed time per iteration (s): 4.39 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 3.252379E+00 | grad norm: 0.599 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 116.500 | TFLOPs: 54.30 | 7: iteration 940/ 44073 | consumed samples: 481280 | consumed tokens: 985661440 | elapsed time per iteration (s): 4.22 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 3.137615E+00 | grad norm: 0.250 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.317 | TFLOPs: 56.54 | 7: iteration 950/ 44073 | consumed samples: 486400 | consumed tokens: 996147200 | elapsed time per iteration (s): 4.14 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 3.095798E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.640 | TFLOPs: 57.62 | 7: iteration 960/ 44073 | consumed samples: 491520 | consumed tokens: 1006632960 | elapsed time per iteration (s): 4.14 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 3.073556E+00 | grad norm: 0.219 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.764 | TFLOPs: 57.68 | 7: iteration 970/ 44073 | consumed samples: 496640 | consumed tokens: 1017118720 | elapsed time per iteration (s): 4.14 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 3.045887E+00 | grad norm: 0.257 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.529 | TFLOPs: 57.57 | 7: iteration 980/ 44073 | consumed samples: 501760 | consumed tokens: 1027604480 | elapsed time per iteration (s): 4.15 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 3.037700E+00 | grad norm: 0.242 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.516 | TFLOPs: 57.56 | 7: iteration 990/ 44073 | consumed samples: 506880 | consumed tokens: 1038090240 | elapsed time per iteration (s): 4.14 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 3.042038E+00 | grad norm: 0.350 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.760 | TFLOPs: 57.68 | 7: iteration 1000/ 44073 | consumed samples: 512000 | consumed tokens: 1048576000 | elapsed time per iteration (s): 4.14 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 3.046179E+00 | grad norm: 0.232 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.596 | TFLOPs: 57.60 | 7: ------------------------------------------------------------------------------------------ 7: valid loss at iteration 1000 | lm loss value: 2.946902E+00 | lm loss PPL: 1.904686E+01 | 7: ------------------------------------------------------------------------------------------ 0: saving checkpoint at iteration 1000 to checkpoints_2b2 0: [2022-11-25 11:08:08,828] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step1000 is begin to save! 0: [2022-11-25 11:08:09,784] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_01-model_00-model_states.pt... 0: [2022-11-25 11:08:10,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_01-model_00-model_states.pt. 0: [2022-11-25 11:08:10,180] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_03-model_00-model_states.pt... 0: [2022-11-25 11:08:10,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_03-model_00-model_states.pt. 0: [2022-11-25 11:08:10,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_04-model_00-model_states.pt... 0: [2022-11-25 11:08:10,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_04-model_00-model_states.pt. 0: [2022-11-25 11:08:10,466] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_05-model_00-model_states.pt... 0: [2022-11-25 11:08:10,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_05-model_00-model_states.pt. 0: [2022-11-25 11:08:10,604] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_06-model_00-model_states.pt... 0: [2022-11-25 11:08:10,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_06-model_00-model_states.pt. 0: [2022-11-25 11:08:10,741] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_07-model_00-model_states.pt... 0: [2022-11-25 11:08:10,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_07-model_00-model_states.pt. 0: [2022-11-25 11:08:10,879] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_08-model_00-model_states.pt... 0: [2022-11-25 11:08:11,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_08-model_00-model_states.pt. 0: [2022-11-25 11:08:11,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_09-model_00-model_states.pt... 0: [2022-11-25 11:08:11,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_09-model_00-model_states.pt. 0: [2022-11-25 11:08:11,146] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_10-model_00-model_states.pt... 0: [2022-11-25 11:08:11,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_10-model_00-model_states.pt. 0: [2022-11-25 11:08:11,280] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_11-model_00-model_states.pt... 0: [2022-11-25 11:08:11,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_11-model_00-model_states.pt. 0: [2022-11-25 11:08:11,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_12-model_00-model_states.pt... 0: [2022-11-25 11:08:11,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_12-model_00-model_states.pt. 0: [2022-11-25 11:08:11,548] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_13-model_00-model_states.pt... 0: [2022-11-25 11:08:11,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_13-model_00-model_states.pt. 0: [2022-11-25 11:08:11,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_14-model_00-model_states.pt... 0: [2022-11-25 11:08:11,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_14-model_00-model_states.pt. 0: [2022-11-25 11:08:11,815] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_15-model_00-model_states.pt... 0: [2022-11-25 11:08:11,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_15-model_00-model_states.pt. 0: [2022-11-25 11:08:11,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_16-model_00-model_states.pt... 0: [2022-11-25 11:08:12,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_16-model_00-model_states.pt. 0: [2022-11-25 11:08:12,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_17-model_00-model_states.pt... 0: [2022-11-25 11:08:12,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_17-model_00-model_states.pt. 0: [2022-11-25 11:08:12,218] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_18-model_00-model_states.pt... 0: [2022-11-25 11:08:12,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_18-model_00-model_states.pt. 0: [2022-11-25 11:08:12,352] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_19-model_00-model_states.pt... 0: [2022-11-25 11:08:12,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_19-model_00-model_states.pt. 0: [2022-11-25 11:08:12,487] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_20-model_00-model_states.pt... 0: [2022-11-25 11:08:12,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_20-model_00-model_states.pt. 0: [2022-11-25 11:08:12,622] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_21-model_00-model_states.pt... 0: [2022-11-25 11:08:12,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_21-model_00-model_states.pt. 0: [2022-11-25 11:08:12,757] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_22-model_00-model_states.pt... 0: [2022-11-25 11:08:12,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_22-model_00-model_states.pt. 0: [2022-11-25 11:08:12,892] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_23-model_00-model_states.pt... 0: [2022-11-25 11:08:13,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_23-model_00-model_states.pt. 0: [2022-11-25 11:08:13,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_24-model_00-model_states.pt... 0: [2022-11-25 11:08:13,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_24-model_00-model_states.pt. 0: [2022-11-25 11:08:13,160] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_25-model_00-model_states.pt... 0: [2022-11-25 11:08:13,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_25-model_00-model_states.pt. 0: [2022-11-25 11:08:13,291] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_26-model_00-model_states.pt... 0: [2022-11-25 11:08:13,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_26-model_00-model_states.pt. 0: [2022-11-25 11:08:13,426] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_27-model_00-model_states.pt... 0: [2022-11-25 11:08:13,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_27-model_00-model_states.pt. 0: [2022-11-25 11:08:13,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_28-model_00-model_states.pt... 0: [2022-11-25 11:08:13,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_28-model_00-model_states.pt. 0: [2022-11-25 11:08:13,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_29-model_00-model_states.pt... 0: [2022-11-25 11:08:13,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_29-model_00-model_states.pt. 0: [2022-11-25 11:08:13,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_30-model_00-model_states.pt... 0: [2022-11-25 11:08:13,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_30-model_00-model_states.pt. 0: [2022-11-25 11:08:13,960] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_31-model_00-model_states.pt... 0: [2022-11-25 11:08:14,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_31-model_00-model_states.pt. 0: [2022-11-25 11:08:14,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_32-model_00-model_states.pt... 0: [2022-11-25 11:08:14,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_32-model_00-model_states.pt. 0: [2022-11-25 11:08:14,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_33-model_00-model_states.pt... 0: [2022-11-25 11:08:14,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_33-model_00-model_states.pt. 0: [2022-11-25 11:08:14,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_34-model_00-model_states.pt... 0: [2022-11-25 11:08:14,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_34-model_00-model_states.pt. 0: [2022-11-25 11:08:14,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/layer_36-model_00-model_states.pt... 0: [2022-11-25 11:08:14,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/layer_36-model_00-model_states.pt. 0: [2022-11-25 11:08:14,502] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step1000/mp_rank_00_model_states.pt 0: [2022-11-25 11:08:14,502] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/mp_rank_00_model_states.pt... 0: [2022-11-25 11:08:14,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/mp_rank_00_model_states.pt. 0: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 4: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 2: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 5: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 0: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 7: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 3: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 4: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 1: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 2: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 5: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 0: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 7: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 3: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 1: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 5: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 0: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 3: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 1: [2022-11-25 11:08:14,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step1000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 0: [2022-11-25 11:08:15,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-25 11:08:15,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 11:08:15,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 11:08:15,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 0: [2022-11-25 11:08:15,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 11:08:15,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 11:08:15,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 0: [2022-11-25 11:08:15,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 11:08:15,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 11:08:15,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 0: [2022-11-25 11:08:15,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 11:08:15,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 11:08:15,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 0: [2022-11-25 11:08:15,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 11:08:15,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 11:08:15,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 0: [2022-11-25 11:08:15,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 11:08:15,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 11:08:15,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 0: [2022-11-25 11:08:15,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 11:08:15,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 11:08:15,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 2: [2022-11-25 11:08:15,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 11:08:15,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 11:08:15,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 11:08:15,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 11:08:15,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 11:08:15,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 11:08:15,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 11:08:15,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 11:08:15,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 11:08:15,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 11:08:15,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 2: [2022-11-25 11:08:15,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 2: [2022-11-25 11:08:15,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 2: [2022-11-25 11:08:15,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 2: [2022-11-25 11:08:15,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 2: [2022-11-25 11:08:15,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 11:08:15,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 11:08:15,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 2: [2022-11-25 11:08:15,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 11:08:15,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 11:08:15,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 2: [2022-11-25 11:08:15,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 11:08:15,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 11:08:15,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 7: [2022-11-25 11:08:15,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 11:08:15,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 11:08:15,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 11:08:15,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 11:08:15,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 11:08:15,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 11:08:15,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 11:08:15,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 11:08:15,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 7: [2022-11-25 11:08:15,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 7: [2022-11-25 11:08:15,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 7: [2022-11-25 11:08:15,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 7: [2022-11-25 11:08:15,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 11:08:15,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 11:08:15,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 11:08:15,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 11:08:15,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 11:08:15,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 11:08:15,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 7: [2022-11-25 11:08:15,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 11:08:15,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 7: [2022-11-25 11:08:15,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 11:08:15,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 7: [2022-11-25 11:08:15,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 6: [2022-11-25 11:08:15,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 11:08:15,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 11:08:15,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 11:08:15,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 11:08:15,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 11:08:15,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 11:08:15,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 6: [2022-11-25 11:08:15,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 11:08:15,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 11:08:15,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 6: [2022-11-25 11:08:15,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 6: [2022-11-25 11:08:15,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 6: [2022-11-25 11:08:15,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 11:08:15,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 11:08:15,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 11:08:15,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 11:08:15,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 11:08:15,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 11:08:15,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 11:08:15,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 11:08:15,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 11:08:15,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 11:08:15,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 11:08:15,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 11:08:15,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 11:08:15,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 11:08:15,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 11:08:15,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 11:08:15,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 11:08:15,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 11:08:15,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 11:08:15,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 11:08:15,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 11:08:15,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 11:08:15,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 11:08:15,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 11:08:15,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 11:08:15,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 11:08:15,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 6: [2022-11-25 11:08:15,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 11:08:15,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 11:08:15,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 11:08:15,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 11:08:15,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 11:08:15,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 6: [2022-11-25 11:08:15,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 11:08:15,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 6: [2022-11-25 11:08:15,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 3: [2022-11-25 11:08:15,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 11:08:15,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 11:08:15,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 11:08:15,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 11:08:15,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 11:08:15,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 11:08:15,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 3: [2022-11-25 11:08:15,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 3: [2022-11-25 11:08:15,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 3: [2022-11-25 11:08:15,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 11:08:15,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 11:08:15,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 11:08:15,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 11:08:15,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 3: [2022-11-25 11:08:15,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 3: [2022-11-25 11:08:15,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 11:08:15,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 11:08:15,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 3: [2022-11-25 11:08:15,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 11:08:15,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 11:08:15,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 3: [2022-11-25 11:08:15,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 11:08:15,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 11:08:15,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 5: [2022-11-25 11:08:15,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 11:08:15,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 11:08:15,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 11:08:15,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 11:08:15,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 11:08:15,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 11:08:15,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 5: [2022-11-25 11:08:15,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 5: [2022-11-25 11:08:15,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 5: [2022-11-25 11:08:15,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 11:08:15,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 11:08:15,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 11:08:15,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 11:08:15,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 5: [2022-11-25 11:08:15,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 5: [2022-11-25 11:08:15,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 11:08:15,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 11:08:15,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 5: [2022-11-25 11:08:15,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 11:08:15,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 11:08:15,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 5: [2022-11-25 11:08:15,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 11:08:15,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 11:08:15,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 0: [2022-11-25 11:08:15,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 11:08:15,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 4: [2022-11-25 11:08:15,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 11:08:15,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 11:08:15,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 11:08:15,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 11:08:15,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 11:08:15,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 11:08:15,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 4: [2022-11-25 11:08:15,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 4: [2022-11-25 11:08:15,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 4: [2022-11-25 11:08:15,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 11:08:15,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 11:08:15,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 4: [2022-11-25 11:08:15,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 11:08:15,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 11:08:15,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 11:08:15,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 11:08:15,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 11:08:15,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 11:08:15,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 4: [2022-11-25 11:08:15,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 4: [2022-11-25 11:08:15,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 4: [2022-11-25 11:08:15,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 11:08:15,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step1000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 11:08:15,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 0: successfully saved checkpoint at iteration 1000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 7002.08 7: iteration 1010/ 44073 | consumed samples: 517120 | consumed tokens: 1059061760 | elapsed time per iteration (s): 5.18 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 3.048956E+00 | grad norm: 0.254 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 98.891 | TFLOPs: 46.09 | 7: iteration 1020/ 44073 | consumed samples: 522240 | consumed tokens: 1069547520 | elapsed time per iteration (s): 4.16 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 2.982562E+00 | grad norm: 0.308 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.098 | TFLOPs: 57.37 | 7: iteration 1030/ 44073 | consumed samples: 527360 | consumed tokens: 1080033280 | elapsed time per iteration (s): 4.14 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 3.006166E+00 | grad norm: 0.310 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.595 | TFLOPs: 57.60 | 7: iteration 1040/ 44073 | consumed samples: 532480 | consumed tokens: 1090519040 | elapsed time per iteration (s): 4.14 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 2.977444E+00 | grad norm: 0.265 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.712 | TFLOPs: 57.66 | 7: iteration 1050/ 44073 | consumed samples: 537600 | consumed tokens: 1101004800 | elapsed time per iteration (s): 4.17 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 2.968232E+00 | grad norm: 0.257 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.751 | TFLOPs: 57.21 | 7: iteration 1060/ 44073 | consumed samples: 542720 | consumed tokens: 1111490560 | elapsed time per iteration (s): 4.14 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 2.952261E+00 | grad norm: 0.292 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.633 | TFLOPs: 57.62 | 7: iteration 1070/ 44073 | consumed samples: 547840 | consumed tokens: 1121976320 | elapsed time per iteration (s): 4.15 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 2.956355E+00 | grad norm: 0.289 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.385 | TFLOPs: 57.50 | 7: iteration 1080/ 44073 | consumed samples: 552960 | consumed tokens: 1132462080 | elapsed time per iteration (s): 4.14 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 2.939719E+00 | grad norm: 0.346 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.706 | TFLOPs: 57.65 | 7: iteration 1090/ 44073 | consumed samples: 558080 | consumed tokens: 1142947840 | elapsed time per iteration (s): 4.14 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 2.961987E+00 | grad norm: 0.284 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.609 | TFLOPs: 57.61 | 7: iteration 1100/ 44073 | consumed samples: 563200 | consumed tokens: 1153433600 | elapsed time per iteration (s): 4.14 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 2.952326E+00 | grad norm: 0.308 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.727 | TFLOPs: 57.66 | 7: iteration 1110/ 44073 | consumed samples: 568320 | consumed tokens: 1163919360 | elapsed time per iteration (s): 4.14 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 2.944314E+00 | grad norm: 0.329 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.693 | TFLOPs: 57.65 | 7: iteration 1120/ 44073 | consumed samples: 573440 | consumed tokens: 1174405120 | elapsed time per iteration (s): 4.14 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 2.942591E+00 | grad norm: 0.275 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.739 | TFLOPs: 57.67 | 7: iteration 1130/ 44073 | consumed samples: 578560 | consumed tokens: 1184890880 | elapsed time per iteration (s): 4.15 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 2.893078E+00 | grad norm: 0.262 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.256 | TFLOPs: 57.44 | 7: iteration 1140/ 44073 | consumed samples: 583680 | consumed tokens: 1195376640 | elapsed time per iteration (s): 4.14 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 2.889978E+00 | grad norm: 0.257 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.735 | TFLOPs: 57.67 | 7: iteration 1150/ 44073 | consumed samples: 588800 | consumed tokens: 1205862400 | elapsed time per iteration (s): 4.14 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 2.916191E+00 | grad norm: 0.295 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.626 | TFLOPs: 57.62 | 7: iteration 1160/ 44073 | consumed samples: 593920 | consumed tokens: 1216348160 | elapsed time per iteration (s): 4.15 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 2.921078E+00 | grad norm: 0.279 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.428 | TFLOPs: 57.52 | 7: iteration 1170/ 44073 | consumed samples: 599040 | consumed tokens: 1226833920 | elapsed time per iteration (s): 4.15 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 2.851636E+00 | grad norm: 0.269 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.348 | TFLOPs: 57.49 | 7: iteration 1180/ 44073 | consumed samples: 604160 | consumed tokens: 1237319680 | elapsed time per iteration (s): 4.14 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 2.891707E+00 | grad norm: 0.267 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.609 | TFLOPs: 57.61 | 7: iteration 1190/ 44073 | consumed samples: 609280 | consumed tokens: 1247805440 | elapsed time per iteration (s): 4.15 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 2.897008E+00 | grad norm: 0.331 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.354 | TFLOPs: 57.49 | 7: iteration 1200/ 44073 | consumed samples: 614400 | consumed tokens: 1258291200 | elapsed time per iteration (s): 4.14 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 2.892445E+00 | grad norm: 0.245 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.669 | TFLOPs: 57.64 | 7: iteration 1210/ 44073 | consumed samples: 619520 | consumed tokens: 1268776960 | elapsed time per iteration (s): 4.16 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 2.872112E+00 | grad norm: 0.326 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.935 | TFLOPs: 57.29 | 7: iteration 1220/ 44073 | consumed samples: 624640 | consumed tokens: 1279262720 | elapsed time per iteration (s): 4.14 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 2.856385E+00 | grad norm: 0.268 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.609 | TFLOPs: 57.61 | 7: iteration 1230/ 44073 | consumed samples: 629760 | consumed tokens: 1289748480 | elapsed time per iteration (s): 4.14 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 2.947136E+00 | grad norm: 1.215 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.561 | TFLOPs: 57.59 | 7: iteration 1240/ 44073 | consumed samples: 634880 | consumed tokens: 1300234240 | elapsed time per iteration (s): 4.15 | learning rate: 1.999E-04 | global batch size: 512 | lm loss: 3.083501E+00 | grad norm: 1.043 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.270 | TFLOPs: 57.45 | 7: iteration 1250/ 44073 | consumed samples: 640000 | consumed tokens: 1310720000 | elapsed time per iteration (s): 4.15 | learning rate: 1.998E-04 | global batch size: 512 | lm loss: 3.329999E+00 | grad norm: 2.294 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.362 | TFLOPs: 57.49 | 7: iteration 1260/ 44073 | consumed samples: 645120 | consumed tokens: 1321205760 | elapsed time per iteration (s): 4.20 | learning rate: 1.998E-04 | global batch size: 512 | lm loss: 3.343958E+00 | grad norm: 1.028 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.882 | TFLOPs: 56.80 | 7: iteration 1270/ 44073 | consumed samples: 650240 | consumed tokens: 1331691520 | elapsed time per iteration (s): 4.15 | learning rate: 1.998E-04 | global batch size: 512 | lm loss: 3.246570E+00 | grad norm: 0.989 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.468 | TFLOPs: 57.54 | 7: iteration 1280/ 44073 | consumed samples: 655360 | consumed tokens: 1342177280 | elapsed time per iteration (s): 4.15 | learning rate: 1.998E-04 | global batch size: 512 | lm loss: 3.228189E+00 | grad norm: 1.065 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.355 | TFLOPs: 57.49 | 7: iteration 1290/ 44073 | consumed samples: 660480 | consumed tokens: 1352663040 | elapsed time per iteration (s): 4.15 | learning rate: 1.998E-04 | global batch size: 512 | lm loss: 3.215653E+00 | grad norm: 1.039 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.480 | TFLOPs: 57.55 | 7: iteration 1300/ 44073 | consumed samples: 665600 | consumed tokens: 1363148800 | elapsed time per iteration (s): 4.14 | learning rate: 1.998E-04 | global batch size: 512 | lm loss: 3.142997E+00 | grad norm: 0.589 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.612 | TFLOPs: 57.61 | 7: iteration 1310/ 44073 | consumed samples: 670720 | consumed tokens: 1373634560 | elapsed time per iteration (s): 4.14 | learning rate: 1.998E-04 | global batch size: 512 | lm loss: 3.040951E+00 | grad norm: 0.218 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.559 | TFLOPs: 57.58 | 7: iteration 1320/ 44073 | consumed samples: 675840 | consumed tokens: 1384120320 | elapsed time per iteration (s): 4.14 | learning rate: 1.998E-04 | global batch size: 512 | lm loss: 2.981597E+00 | grad norm: 0.223 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.676 | TFLOPs: 57.64 | 7: iteration 1330/ 44073 | consumed samples: 680960 | consumed tokens: 1394606080 | elapsed time per iteration (s): 4.14 | learning rate: 1.998E-04 | global batch size: 512 | lm loss: 2.922380E+00 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.669 | TFLOPs: 57.64 | 7: iteration 1340/ 44073 | consumed samples: 686080 | consumed tokens: 1405091840 | elapsed time per iteration (s): 4.16 | learning rate: 1.998E-04 | global batch size: 512 | lm loss: 2.877888E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.193 | TFLOPs: 57.41 | 7: iteration 1350/ 44073 | consumed samples: 691200 | consumed tokens: 1415577600 | elapsed time per iteration (s): 4.14 | learning rate: 1.998E-04 | global batch size: 512 | lm loss: 2.877750E+00 | grad norm: 0.216 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.591 | TFLOPs: 57.60 | 7: iteration 1360/ 44073 | consumed samples: 696320 | consumed tokens: 1426063360 | elapsed time per iteration (s): 4.15 | learning rate: 1.998E-04 | global batch size: 512 | lm loss: 2.864160E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.335 | TFLOPs: 57.48 | 7: iteration 1370/ 44073 | consumed samples: 701440 | consumed tokens: 1436549120 | elapsed time per iteration (s): 4.15 | learning rate: 1.998E-04 | global batch size: 512 | lm loss: 2.832541E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.314 | TFLOPs: 57.47 | 7: iteration 1380/ 44073 | consumed samples: 706560 | consumed tokens: 1447034880 | elapsed time per iteration (s): 4.17 | learning rate: 1.998E-04 | global batch size: 512 | lm loss: 2.831197E+00 | grad norm: 0.216 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.680 | TFLOPs: 57.18 | 7: iteration 1390/ 44073 | consumed samples: 711680 | consumed tokens: 1457520640 | elapsed time per iteration (s): 4.14 | learning rate: 1.998E-04 | global batch size: 512 | lm loss: 2.849655E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.700 | TFLOPs: 57.65 | 7: iteration 1400/ 44073 | consumed samples: 716800 | consumed tokens: 1468006400 | elapsed time per iteration (s): 4.14 | learning rate: 1.998E-04 | global batch size: 512 | lm loss: 2.863799E+00 | grad norm: 0.217 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.645 | TFLOPs: 57.62 | 7: iteration 1410/ 44073 | consumed samples: 721920 | consumed tokens: 1478492160 | elapsed time per iteration (s): 4.14 | learning rate: 1.998E-04 | global batch size: 512 | lm loss: 2.826973E+00 | grad norm: 0.215 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.603 | TFLOPs: 57.61 | 7: iteration 1420/ 44073 | consumed samples: 727040 | consumed tokens: 1488977920 | elapsed time per iteration (s): 4.14 | learning rate: 1.998E-04 | global batch size: 512 | lm loss: 2.829591E+00 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.680 | TFLOPs: 57.64 | 7: iteration 1430/ 44073 | consumed samples: 732160 | consumed tokens: 1499463680 | elapsed time per iteration (s): 4.15 | learning rate: 1.998E-04 | global batch size: 512 | lm loss: 2.822718E+00 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.283 | TFLOPs: 57.46 | 7: iteration 1440/ 44073 | consumed samples: 737280 | consumed tokens: 1509949440 | elapsed time per iteration (s): 4.14 | learning rate: 1.998E-04 | global batch size: 512 | lm loss: 2.819917E+00 | grad norm: 0.213 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.634 | TFLOPs: 57.62 | 7: iteration 1450/ 44073 | consumed samples: 742400 | consumed tokens: 1520435200 | elapsed time per iteration (s): 4.15 | learning rate: 1.998E-04 | global batch size: 512 | lm loss: 2.796312E+00 | grad norm: 0.238 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.483 | TFLOPs: 57.55 | 7: iteration 1460/ 44073 | consumed samples: 747520 | consumed tokens: 1530920960 | elapsed time per iteration (s): 4.15 | learning rate: 1.998E-04 | global batch size: 512 | lm loss: 2.785123E+00 | grad norm: 0.235 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.474 | TFLOPs: 57.55 | 7: iteration 1470/ 44073 | consumed samples: 752640 | consumed tokens: 1541406720 | elapsed time per iteration (s): 4.14 | learning rate: 1.998E-04 | global batch size: 512 | lm loss: 2.763860E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.713 | TFLOPs: 57.66 | 7: iteration 1480/ 44073 | consumed samples: 757760 | consumed tokens: 1551892480 | elapsed time per iteration (s): 4.15 | learning rate: 1.997E-04 | global batch size: 512 | lm loss: 2.785118E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.374 | TFLOPs: 57.50 | 7: iteration 1490/ 44073 | consumed samples: 762880 | consumed tokens: 1562378240 | elapsed time per iteration (s): 4.14 | learning rate: 1.997E-04 | global batch size: 512 | lm loss: 2.803974E+00 | grad norm: 0.233 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.734 | TFLOPs: 57.67 | 7: iteration 1500/ 44073 | consumed samples: 768000 | consumed tokens: 1572864000 | elapsed time per iteration (s): 4.14 | learning rate: 1.997E-04 | global batch size: 512 | lm loss: 2.778476E+00 | grad norm: 0.250 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.752 | TFLOPs: 57.67 | 7: iteration 1510/ 44073 | consumed samples: 773120 | consumed tokens: 1583349760 | elapsed time per iteration (s): 4.14 | learning rate: 1.997E-04 | global batch size: 512 | lm loss: 2.766838E+00 | grad norm: 0.268 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.709 | TFLOPs: 57.65 | 7: iteration 1520/ 44073 | consumed samples: 778240 | consumed tokens: 1593835520 | elapsed time per iteration (s): 4.14 | learning rate: 1.997E-04 | global batch size: 512 | lm loss: 2.751607E+00 | grad norm: 0.219 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.692 | TFLOPs: 57.65 | 7: iteration 1530/ 44073 | consumed samples: 783360 | consumed tokens: 1604321280 | elapsed time per iteration (s): 4.17 | learning rate: 1.997E-04 | global batch size: 512 | lm loss: 2.783700E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.808 | TFLOPs: 57.23 | 7: iteration 1540/ 44073 | consumed samples: 788480 | consumed tokens: 1614807040 | elapsed time per iteration (s): 4.14 | learning rate: 1.997E-04 | global batch size: 512 | lm loss: 2.775693E+00 | grad norm: 0.243 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.715 | TFLOPs: 57.66 | 7: iteration 1550/ 44073 | consumed samples: 793600 | consumed tokens: 1625292800 | elapsed time per iteration (s): 4.14 | learning rate: 1.997E-04 | global batch size: 512 | lm loss: 2.773174E+00 | grad norm: 0.248 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.621 | TFLOPs: 57.61 | 7: iteration 1560/ 44073 | consumed samples: 798720 | consumed tokens: 1635778560 | elapsed time per iteration (s): 4.14 | learning rate: 1.997E-04 | global batch size: 512 | lm loss: 2.742114E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.663 | TFLOPs: 57.63 | 7: iteration 1570/ 44073 | consumed samples: 803840 | consumed tokens: 1646264320 | elapsed time per iteration (s): 4.14 | learning rate: 1.997E-04 | global batch size: 512 | lm loss: 2.739991E+00 | grad norm: 0.222 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.717 | TFLOPs: 57.66 | 7: iteration 1580/ 44073 | consumed samples: 808960 | consumed tokens: 1656750080 | elapsed time per iteration (s): 4.14 | learning rate: 1.997E-04 | global batch size: 512 | lm loss: 2.739775E+00 | grad norm: 0.222 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.703 | TFLOPs: 57.65 | 7: iteration 1590/ 44073 | consumed samples: 814080 | consumed tokens: 1667235840 | elapsed time per iteration (s): 4.14 | learning rate: 1.997E-04 | global batch size: 512 | lm loss: 2.739748E+00 | grad norm: 0.231 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.704 | TFLOPs: 57.65 | 7: iteration 1600/ 44073 | consumed samples: 819200 | consumed tokens: 1677721600 | elapsed time per iteration (s): 4.14 | learning rate: 1.997E-04 | global batch size: 512 | lm loss: 2.707207E+00 | grad norm: 0.226 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.606 | TFLOPs: 57.61 | 7: iteration 1610/ 44073 | consumed samples: 824320 | consumed tokens: 1688207360 | elapsed time per iteration (s): 4.15 | learning rate: 1.997E-04 | global batch size: 512 | lm loss: 2.743061E+00 | grad norm: 0.266 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.434 | TFLOPs: 57.53 | 7: iteration 1620/ 44073 | consumed samples: 829440 | consumed tokens: 1698693120 | elapsed time per iteration (s): 4.14 | learning rate: 1.997E-04 | global batch size: 512 | lm loss: 2.734517E+00 | grad norm: 0.223 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.690 | TFLOPs: 57.65 | 7: iteration 1630/ 44073 | consumed samples: 834560 | consumed tokens: 1709178880 | elapsed time per iteration (s): 4.14 | learning rate: 1.997E-04 | global batch size: 512 | lm loss: 2.823770E+00 | grad norm: 0.977 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.563 | TFLOPs: 57.59 | 7: iteration 1640/ 44073 | consumed samples: 839680 | consumed tokens: 1719664640 | elapsed time per iteration (s): 4.15 | learning rate: 1.997E-04 | global batch size: 512 | lm loss: 2.997407E+00 | grad norm: 1.710 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.385 | TFLOPs: 57.50 | 7: iteration 1650/ 44073 | consumed samples: 844800 | consumed tokens: 1730150400 | elapsed time per iteration (s): 4.15 | learning rate: 1.997E-04 | global batch size: 512 | lm loss: 3.078922E+00 | grad norm: 1.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.346 | TFLOPs: 57.49 | 7: iteration 1660/ 44073 | consumed samples: 849920 | consumed tokens: 1740636160 | elapsed time per iteration (s): 4.15 | learning rate: 1.997E-04 | global batch size: 512 | lm loss: 3.014038E+00 | grad norm: 0.646 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.429 | TFLOPs: 57.52 | 7: iteration 1670/ 44073 | consumed samples: 855040 | consumed tokens: 1751121920 | elapsed time per iteration (s): 4.18 | learning rate: 1.996E-04 | global batch size: 512 | lm loss: 2.931498E+00 | grad norm: 0.240 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.520 | TFLOPs: 57.10 | 7: iteration 1680/ 44073 | consumed samples: 860160 | consumed tokens: 1761607680 | elapsed time per iteration (s): 4.14 | learning rate: 1.996E-04 | global batch size: 512 | lm loss: 2.829433E+00 | grad norm: 0.215 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.703 | TFLOPs: 57.65 | 7: iteration 1690/ 44073 | consumed samples: 865280 | consumed tokens: 1772093440 | elapsed time per iteration (s): 4.17 | learning rate: 1.996E-04 | global batch size: 512 | lm loss: 2.800837E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.755 | TFLOPs: 57.21 | 7: iteration 1700/ 44073 | consumed samples: 870400 | consumed tokens: 1782579200 | elapsed time per iteration (s): 4.14 | learning rate: 1.996E-04 | global batch size: 512 | lm loss: 2.754030E+00 | grad norm: 0.217 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.680 | TFLOPs: 57.64 | 7: iteration 1710/ 44073 | consumed samples: 875520 | consumed tokens: 1793064960 | elapsed time per iteration (s): 4.14 | learning rate: 1.996E-04 | global batch size: 512 | lm loss: 2.784577E+00 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.682 | TFLOPs: 57.64 | 7: iteration 1720/ 44073 | consumed samples: 880640 | consumed tokens: 1803550720 | elapsed time per iteration (s): 4.14 | learning rate: 1.996E-04 | global batch size: 512 | lm loss: 2.710626E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.631 | TFLOPs: 57.62 | 7: iteration 1730/ 44073 | consumed samples: 885760 | consumed tokens: 1814036480 | elapsed time per iteration (s): 4.14 | learning rate: 1.996E-04 | global batch size: 512 | lm loss: 2.742508E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.696 | TFLOPs: 57.65 | 7: iteration 1740/ 44073 | consumed samples: 890880 | consumed tokens: 1824522240 | elapsed time per iteration (s): 4.16 | learning rate: 1.996E-04 | global batch size: 512 | lm loss: 2.746787E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.975 | TFLOPs: 57.31 | 7: iteration 1750/ 44073 | consumed samples: 896000 | consumed tokens: 1835008000 | elapsed time per iteration (s): 4.14 | learning rate: 1.996E-04 | global batch size: 512 | lm loss: 2.707799E+00 | grad norm: 0.225 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.667 | TFLOPs: 57.63 | 7: iteration 1760/ 44073 | consumed samples: 901120 | consumed tokens: 1845493760 | elapsed time per iteration (s): 4.14 | learning rate: 1.996E-04 | global batch size: 512 | lm loss: 2.708617E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.657 | TFLOPs: 57.63 | 7: iteration 1770/ 44073 | consumed samples: 906240 | consumed tokens: 1855979520 | elapsed time per iteration (s): 4.14 | learning rate: 1.996E-04 | global batch size: 512 | lm loss: 2.716687E+00 | grad norm: 0.220 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.660 | TFLOPs: 57.63 | 7: iteration 1780/ 44073 | consumed samples: 911360 | consumed tokens: 1866465280 | elapsed time per iteration (s): 4.16 | learning rate: 1.996E-04 | global batch size: 512 | lm loss: 2.742217E+00 | grad norm: 0.328 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.200 | TFLOPs: 57.42 | 7: iteration 1790/ 44073 | consumed samples: 916480 | consumed tokens: 1876951040 | elapsed time per iteration (s): 4.14 | learning rate: 1.996E-04 | global batch size: 512 | lm loss: 2.738997E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.682 | TFLOPs: 57.64 | 7: iteration 1800/ 44073 | consumed samples: 921600 | consumed tokens: 1887436800 | elapsed time per iteration (s): 4.15 | learning rate: 1.996E-04 | global batch size: 512 | lm loss: 2.696916E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.378 | TFLOPs: 57.50 | 7: iteration 1810/ 44073 | consumed samples: 926720 | consumed tokens: 1897922560 | elapsed time per iteration (s): 4.49 | learning rate: 1.996E-04 | global batch size: 512 | lm loss: 2.703772E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 113.924 | TFLOPs: 53.09 | 7: iteration 1820/ 44073 | consumed samples: 931840 | consumed tokens: 1908408320 | elapsed time per iteration (s): 4.18 | learning rate: 1.996E-04 | global batch size: 512 | lm loss: 2.718303E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.606 | TFLOPs: 57.14 | 7: iteration 1830/ 44073 | consumed samples: 936960 | consumed tokens: 1918894080 | elapsed time per iteration (s): 4.19 | learning rate: 1.996E-04 | global batch size: 512 | lm loss: 2.681231E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.238 | TFLOPs: 56.97 | 7: iteration 1840/ 44073 | consumed samples: 942080 | consumed tokens: 1929379840 | elapsed time per iteration (s): 4.22 | learning rate: 1.995E-04 | global batch size: 512 | lm loss: 2.694712E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.423 | TFLOPs: 56.59 | 7: iteration 1850/ 44073 | consumed samples: 947200 | consumed tokens: 1939865600 | elapsed time per iteration (s): 4.18 | learning rate: 1.995E-04 | global batch size: 512 | lm loss: 2.696009E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.614 | TFLOPs: 57.14 | 7: iteration 1860/ 44073 | consumed samples: 952320 | consumed tokens: 1950351360 | elapsed time per iteration (s): 4.36 | learning rate: 1.995E-04 | global batch size: 512 | lm loss: 2.695874E+00 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.494 | TFLOPs: 54.76 | 7: iteration 1870/ 44073 | consumed samples: 957440 | consumed tokens: 1960837120 | elapsed time per iteration (s): 4.16 | learning rate: 1.995E-04 | global batch size: 512 | lm loss: 2.681295E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.935 | TFLOPs: 57.29 | 7: iteration 1880/ 44073 | consumed samples: 962560 | consumed tokens: 1971322880 | elapsed time per iteration (s): 4.29 | learning rate: 1.995E-04 | global batch size: 512 | lm loss: 2.678469E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.385 | TFLOPs: 55.64 | 7: iteration 1890/ 44073 | consumed samples: 967680 | consumed tokens: 1981808640 | elapsed time per iteration (s): 4.16 | learning rate: 1.995E-04 | global batch size: 512 | lm loss: 2.676262E+00 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.214 | TFLOPs: 57.42 | 7: iteration 1900/ 44073 | consumed samples: 972800 | consumed tokens: 1992294400 | elapsed time per iteration (s): 4.17 | learning rate: 1.995E-04 | global batch size: 512 | lm loss: 2.683139E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.820 | TFLOPs: 57.24 | 7: iteration 1910/ 44073 | consumed samples: 977920 | consumed tokens: 2002780160 | elapsed time per iteration (s): 4.18 | learning rate: 1.995E-04 | global batch size: 512 | lm loss: 2.677628E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.628 | TFLOPs: 57.15 | 7: iteration 1920/ 44073 | consumed samples: 983040 | consumed tokens: 2013265920 | elapsed time per iteration (s): 4.17 | learning rate: 1.995E-04 | global batch size: 512 | lm loss: 2.682176E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.638 | TFLOPs: 57.16 | 7: iteration 1930/ 44073 | consumed samples: 988160 | consumed tokens: 2023751680 | elapsed time per iteration (s): 4.15 | learning rate: 1.995E-04 | global batch size: 512 | lm loss: 2.647984E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.274 | TFLOPs: 57.45 | 7: iteration 1940/ 44073 | consumed samples: 993280 | consumed tokens: 2034237440 | elapsed time per iteration (s): 4.17 | learning rate: 1.995E-04 | global batch size: 512 | lm loss: 2.652192E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.719 | TFLOPs: 57.19 | 7: iteration 1950/ 44073 | consumed samples: 998400 | consumed tokens: 2044723200 | elapsed time per iteration (s): 4.47 | learning rate: 1.995E-04 | global batch size: 512 | lm loss: 2.735960E+00 | grad norm: 0.635 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 114.420 | TFLOPs: 53.33 | 7: iteration 1960/ 44073 | consumed samples: 1003520 | consumed tokens: 2055208960 | elapsed time per iteration (s): 4.18 | learning rate: 1.995E-04 | global batch size: 512 | lm loss: 2.742381E+00 | grad norm: 0.671 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.377 | TFLOPs: 57.03 | 7: iteration 1970/ 44073 | consumed samples: 1008640 | consumed tokens: 2065694720 | elapsed time per iteration (s): 4.19 | learning rate: 1.995E-04 | global batch size: 512 | lm loss: 2.774289E+00 | grad norm: 0.297 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.227 | TFLOPs: 56.96 | 7: iteration 1980/ 44073 | consumed samples: 1013760 | consumed tokens: 2076180480 | elapsed time per iteration (s): 4.19 | learning rate: 1.994E-04 | global batch size: 512 | lm loss: 2.694393E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.090 | TFLOPs: 56.90 | 7: iteration 1990/ 44073 | consumed samples: 1018880 | consumed tokens: 2086666240 | elapsed time per iteration (s): 4.18 | learning rate: 1.994E-04 | global batch size: 512 | lm loss: 2.703384E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.487 | TFLOPs: 57.09 | 0: [2022-11-25 12:17:38,591] [INFO] [logging.py:68:log_dist] [Rank 0] step=2000, skipped=0, lr=[0.00019943341201512647, 0.00019943341201512647, 0.00019943341201512647], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 7: iteration 2000/ 44073 | consumed samples: 1024000 | consumed tokens: 2097152000 | elapsed time per iteration (s): 4.18 | learning rate: 1.994E-04 | global batch size: 512 | lm loss: 2.671627E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.416 | TFLOPs: 57.05 | 0: steps: 2000 loss: 2.6860 iter time (s): 4.167 samples/sec: 122.864 7: ------------------------------------------------------------------------------------------ 7: valid loss at iteration 2000 | lm loss value: 2.678066E+00 | lm loss PPL: 1.455691E+01 | 7: ------------------------------------------------------------------------------------------ 0: saving checkpoint at iteration 2000 to checkpoints_2b2 0: [2022-11-25 12:17:40,020] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step2000 is begin to save! 0: [2022-11-25 12:17:40,059] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_01-model_00-model_states.pt... 0: [2022-11-25 12:17:40,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_01-model_00-model_states.pt. 0: [2022-11-25 12:17:40,459] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_03-model_00-model_states.pt... 0: [2022-11-25 12:17:40,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_03-model_00-model_states.pt. 0: [2022-11-25 12:17:40,609] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_04-model_00-model_states.pt... 0: [2022-11-25 12:17:40,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_04-model_00-model_states.pt. 0: [2022-11-25 12:17:40,750] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_05-model_00-model_states.pt... 0: [2022-11-25 12:17:40,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_05-model_00-model_states.pt. 0: [2022-11-25 12:17:40,896] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_06-model_00-model_states.pt... 0: [2022-11-25 12:17:41,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_06-model_00-model_states.pt. 0: [2022-11-25 12:17:41,041] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_07-model_00-model_states.pt... 0: [2022-11-25 12:17:41,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_07-model_00-model_states.pt. 0: [2022-11-25 12:17:41,187] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_08-model_00-model_states.pt... 0: [2022-11-25 12:17:41,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_08-model_00-model_states.pt. 0: [2022-11-25 12:17:41,332] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_09-model_00-model_states.pt... 0: [2022-11-25 12:17:41,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_09-model_00-model_states.pt. 0: [2022-11-25 12:17:41,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_10-model_00-model_states.pt... 0: [2022-11-25 12:17:41,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_10-model_00-model_states.pt. 0: [2022-11-25 12:17:41,619] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_11-model_00-model_states.pt... 0: [2022-11-25 12:17:41,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_11-model_00-model_states.pt. 0: [2022-11-25 12:17:41,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_12-model_00-model_states.pt... 0: [2022-11-25 12:17:41,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_12-model_00-model_states.pt. 0: [2022-11-25 12:17:41,910] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_13-model_00-model_states.pt... 0: [2022-11-25 12:17:42,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_13-model_00-model_states.pt. 0: [2022-11-25 12:17:42,047] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_14-model_00-model_states.pt... 0: [2022-11-25 12:17:42,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_14-model_00-model_states.pt. 0: [2022-11-25 12:17:42,184] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_15-model_00-model_states.pt... 0: [2022-11-25 12:17:42,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_15-model_00-model_states.pt. 0: [2022-11-25 12:17:42,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_16-model_00-model_states.pt... 0: [2022-11-25 12:17:42,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_16-model_00-model_states.pt. 0: [2022-11-25 12:17:42,466] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_17-model_00-model_states.pt... 0: [2022-11-25 12:17:42,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_17-model_00-model_states.pt. 0: [2022-11-25 12:17:42,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_18-model_00-model_states.pt... 0: [2022-11-25 12:17:42,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_18-model_00-model_states.pt. 0: [2022-11-25 12:17:42,737] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_19-model_00-model_states.pt... 0: [2022-11-25 12:17:42,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_19-model_00-model_states.pt. 0: [2022-11-25 12:17:42,878] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_20-model_00-model_states.pt... 0: [2022-11-25 12:17:43,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_20-model_00-model_states.pt. 0: [2022-11-25 12:17:43,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_21-model_00-model_states.pt... 0: [2022-11-25 12:17:43,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_21-model_00-model_states.pt. 0: [2022-11-25 12:17:43,158] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_22-model_00-model_states.pt... 0: [2022-11-25 12:17:43,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_22-model_00-model_states.pt. 0: [2022-11-25 12:17:43,291] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_23-model_00-model_states.pt... 0: [2022-11-25 12:17:43,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_23-model_00-model_states.pt. 0: [2022-11-25 12:17:43,431] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_24-model_00-model_states.pt... 0: [2022-11-25 12:17:43,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_24-model_00-model_states.pt. 0: [2022-11-25 12:17:43,569] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_25-model_00-model_states.pt... 0: [2022-11-25 12:17:43,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_25-model_00-model_states.pt. 0: [2022-11-25 12:17:43,706] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_26-model_00-model_states.pt... 0: [2022-11-25 12:17:43,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_26-model_00-model_states.pt. 0: [2022-11-25 12:17:43,837] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_27-model_00-model_states.pt... 0: [2022-11-25 12:17:43,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_27-model_00-model_states.pt. 0: [2022-11-25 12:17:43,971] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_28-model_00-model_states.pt... 0: [2022-11-25 12:17:44,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_28-model_00-model_states.pt. 0: [2022-11-25 12:17:44,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_29-model_00-model_states.pt... 0: [2022-11-25 12:17:44,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_29-model_00-model_states.pt. 0: [2022-11-25 12:17:44,248] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_30-model_00-model_states.pt... 0: [2022-11-25 12:17:44,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_30-model_00-model_states.pt. 0: [2022-11-25 12:17:44,381] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_31-model_00-model_states.pt... 0: [2022-11-25 12:17:44,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_31-model_00-model_states.pt. 0: [2022-11-25 12:17:44,517] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_32-model_00-model_states.pt... 0: [2022-11-25 12:17:44,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_32-model_00-model_states.pt. 0: [2022-11-25 12:17:44,649] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_33-model_00-model_states.pt... 0: [2022-11-25 12:17:44,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_33-model_00-model_states.pt. 0: [2022-11-25 12:17:44,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_34-model_00-model_states.pt... 0: [2022-11-25 12:17:44,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_34-model_00-model_states.pt. 0: [2022-11-25 12:17:44,919] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/layer_36-model_00-model_states.pt... 0: [2022-11-25 12:17:44,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/layer_36-model_00-model_states.pt. 0: [2022-11-25 12:17:44,925] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step2000/mp_rank_00_model_states.pt 0: [2022-11-25 12:17:44,925] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/mp_rank_00_model_states.pt... 0: [2022-11-25 12:17:44,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/mp_rank_00_model_states.pt. 0: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 2: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 5: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 0: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 3: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 4: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 5: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 6: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 7: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 3: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 4: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 1: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 2: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 0: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 7: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 1: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 0: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 7: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 1: [2022-11-25 12:17:44,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step2000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 0: [2022-11-25 12:17:45,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 12:17:45,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 12:17:45,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: [2022-11-25 12:17:45,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 12:17:45,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 12:17:45,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: [2022-11-25 12:17:45,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-25 12:17:45,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 12:17:45,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 12:17:45,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: [2022-11-25 12:17:45,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 12:17:45,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 12:17:45,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: [2022-11-25 12:17:45,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 12:17:45,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 12:17:45,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: [2022-11-25 12:17:45,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 12:17:45,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 12:17:45,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: [2022-11-25 12:17:45,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 12:17:45,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 12:17:45,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 2: [2022-11-25 12:17:45,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 12:17:45,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 12:17:45,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 12:17:45,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 12:17:45,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 12:17:45,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 12:17:45,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 12:17:45,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 12:17:45,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 12:17:45,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 2: [2022-11-25 12:17:45,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 12:17:45,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 2: [2022-11-25 12:17:45,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 12:17:45,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 12:17:45,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 12:17:45,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 12:17:45,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 12:17:45,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 12:17:45,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 2: [2022-11-25 12:17:45,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 2: [2022-11-25 12:17:45,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 2: [2022-11-25 12:17:45,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 2: [2022-11-25 12:17:45,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 2: [2022-11-25 12:17:45,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 6: [2022-11-25 12:17:45,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 12:17:45,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 12:17:45,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 12:17:45,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 12:17:45,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 12:17:45,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 3: [2022-11-25 12:17:45,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 6: [2022-11-25 12:17:45,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 12:17:45,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 12:17:45,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 12:17:45,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 12:17:45,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 12:17:45,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 3: [2022-11-25 12:17:45,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 6: [2022-11-25 12:17:45,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 3: [2022-11-25 12:17:45,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 6: [2022-11-25 12:17:45,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 6: [2022-11-25 12:17:45,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 12:17:45,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 12:17:45,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 12:17:45,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 6: [2022-11-25 12:17:45,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 6: [2022-11-25 12:17:45,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 3: [2022-11-25 12:17:45,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 12:17:45,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 12:17:45,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 12:17:45,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 6: [2022-11-25 12:17:45,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 6: [2022-11-25 12:17:45,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 6: [2022-11-25 12:17:45,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 3: [2022-11-25 12:17:45,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 12:17:45,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 12:17:45,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 6: [2022-11-25 12:17:45,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 3: [2022-11-25 12:17:45,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 12:17:45,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 12:17:45,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 12:17:45,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 3: [2022-11-25 12:17:45,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 12:17:45,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 3: [2022-11-25 12:17:45,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 3: [2022-11-25 12:17:45,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 3: [2022-11-25 12:17:45,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 12:17:45,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 12:17:45,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 3: [2022-11-25 12:17:45,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 12:17:45,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 3: [2022-11-25 12:17:45,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 7: [2022-11-25 12:17:46,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 12:17:46,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 12:17:46,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 12:17:46,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 12:17:46,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 12:17:46,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 12:17:46,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 12:17:46,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 12:17:46,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 12:17:46,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 12:17:46,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 12:17:46,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 12:17:46,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 12:17:46,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 7: [2022-11-25 12:17:46,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 12:17:46,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 12:17:46,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 7: [2022-11-25 12:17:46,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 7: [2022-11-25 12:17:46,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 7: [2022-11-25 12:17:46,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 7: [2022-11-25 12:17:46,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 12:17:46,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 7: [2022-11-25 12:17:46,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 7: [2022-11-25 12:17:46,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 5: [2022-11-25 12:17:46,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 12:17:46,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 12:17:46,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 12:17:46,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 12:17:46,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 12:17:46,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 12:17:46,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 12:17:46,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 12:17:46,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 5: [2022-11-25 12:17:46,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 12:17:46,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 12:17:46,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 5: [2022-11-25 12:17:46,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 5: [2022-11-25 12:17:46,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 5: [2022-11-25 12:17:46,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 5: [2022-11-25 12:17:46,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 12:17:46,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 12:17:46,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 5: [2022-11-25 12:17:46,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 12:17:46,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 12:17:46,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 5: [2022-11-25 12:17:46,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 12:17:46,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 12:17:46,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 12:17:46,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 12:17:46,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 12:17:46,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 12:17:46,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 12:17:46,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 12:17:46,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 12:17:46,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 12:17:46,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 12:17:46,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 12:17:46,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 12:17:46,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 12:17:46,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 12:17:46,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 12:17:46,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 12:17:46,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 12:17:46,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 12:17:46,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 12:17:46,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 12:17:46,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 12:17:46,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 12:17:46,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 12:17:46,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 12:17:46,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 12:17:46,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 4: [2022-11-25 12:17:46,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 12:17:46,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 12:17:46,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 12:17:46,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 12:17:46,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 12:17:46,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 12:17:46,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 12:17:46,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 12:17:46,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 4: [2022-11-25 12:17:46,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 4: [2022-11-25 12:17:46,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 4: [2022-11-25 12:17:46,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 4: [2022-11-25 12:17:46,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 12:17:46,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 12:17:46,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 12:17:46,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 12:17:46,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 12:17:46,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 12:17:46,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 12:17:46,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 4: [2022-11-25 12:17:46,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 4: [2022-11-25 12:17:46,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 12:17:46,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 4: [2022-11-25 12:17:46,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: [2022-11-25 12:17:46,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step2000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 12:17:46,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: successfully saved checkpoint at iteration 2000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 6276.96 7: iteration 2010/ 44073 | consumed samples: 1029120 | consumed tokens: 2107637760 | elapsed time per iteration (s): 4.97 | learning rate: 1.994E-04 | global batch size: 512 | lm loss: 2.667940E+00 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 103.106 | TFLOPs: 48.05 | 7: iteration 2020/ 44073 | consumed samples: 1034240 | consumed tokens: 2118123520 | elapsed time per iteration (s): 4.19 | learning rate: 1.994E-04 | global batch size: 512 | lm loss: 2.636174E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.130 | TFLOPs: 56.92 | 7: iteration 2030/ 44073 | consumed samples: 1039360 | consumed tokens: 2128609280 | elapsed time per iteration (s): 4.17 | learning rate: 1.994E-04 | global batch size: 512 | lm loss: 2.649468E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.925 | TFLOPs: 57.29 | 7: iteration 2040/ 44073 | consumed samples: 1044480 | consumed tokens: 2139095040 | elapsed time per iteration (s): 4.21 | learning rate: 1.994E-04 | global batch size: 512 | lm loss: 2.659915E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.690 | TFLOPs: 56.71 | 7: iteration 2050/ 44073 | consumed samples: 1049600 | consumed tokens: 2149580800 | elapsed time per iteration (s): 4.22 | learning rate: 1.994E-04 | global batch size: 512 | lm loss: 2.646695E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.404 | TFLOPs: 56.58 | 7: iteration 2060/ 44073 | consumed samples: 1054720 | consumed tokens: 2160066560 | elapsed time per iteration (s): 4.15 | learning rate: 1.994E-04 | global batch size: 512 | lm loss: 2.603301E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.247 | TFLOPs: 57.44 | 7: iteration 2070/ 44073 | consumed samples: 1059840 | consumed tokens: 2170552320 | elapsed time per iteration (s): 4.20 | learning rate: 1.994E-04 | global batch size: 512 | lm loss: 2.613984E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.822 | TFLOPs: 56.78 | 7: iteration 2080/ 44073 | consumed samples: 1064960 | consumed tokens: 2181038080 | elapsed time per iteration (s): 4.21 | learning rate: 1.994E-04 | global batch size: 512 | lm loss: 2.633107E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.667 | TFLOPs: 56.70 | 7: iteration 2090/ 44073 | consumed samples: 1070080 | consumed tokens: 2191523840 | elapsed time per iteration (s): 4.21 | learning rate: 1.994E-04 | global batch size: 512 | lm loss: 2.615896E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.537 | TFLOPs: 56.64 | 7: iteration 2100/ 44073 | consumed samples: 1075200 | consumed tokens: 2202009600 | elapsed time per iteration (s): 4.16 | learning rate: 1.994E-04 | global batch size: 512 | lm loss: 2.608874E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.066 | TFLOPs: 57.36 | 7: iteration 2110/ 44073 | consumed samples: 1080320 | consumed tokens: 2212495360 | elapsed time per iteration (s): 4.26 | learning rate: 1.994E-04 | global batch size: 512 | lm loss: 2.630955E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.297 | TFLOPs: 56.06 | 7: iteration 2120/ 44073 | consumed samples: 1085440 | consumed tokens: 2222981120 | elapsed time per iteration (s): 4.29 | learning rate: 1.993E-04 | global batch size: 512 | lm loss: 2.618124E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.402 | TFLOPs: 55.65 | 7: iteration 2130/ 44073 | consumed samples: 1090560 | consumed tokens: 2233466880 | elapsed time per iteration (s): 4.19 | learning rate: 1.993E-04 | global batch size: 512 | lm loss: 2.610464E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.064 | TFLOPs: 56.89 | 7: iteration 2140/ 44073 | consumed samples: 1095680 | consumed tokens: 2243952640 | elapsed time per iteration (s): 4.21 | learning rate: 1.993E-04 | global batch size: 512 | lm loss: 2.599285E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.659 | TFLOPs: 56.70 | 7: iteration 2150/ 44073 | consumed samples: 1100800 | consumed tokens: 2254438400 | elapsed time per iteration (s): 4.24 | learning rate: 1.993E-04 | global batch size: 512 | lm loss: 2.611056E+00 | grad norm: 0.291 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.795 | TFLOPs: 56.30 | 7: iteration 2160/ 44073 | consumed samples: 1105920 | consumed tokens: 2264924160 | elapsed time per iteration (s): 4.25 | learning rate: 1.993E-04 | global batch size: 512 | lm loss: 2.608069E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.550 | TFLOPs: 56.18 | 7: iteration 2170/ 44073 | consumed samples: 1111040 | consumed tokens: 2275409920 | elapsed time per iteration (s): 4.19 | learning rate: 1.993E-04 | global batch size: 512 | lm loss: 2.611055E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.143 | TFLOPs: 56.93 | 7: iteration 2180/ 44073 | consumed samples: 1116160 | consumed tokens: 2285895680 | elapsed time per iteration (s): 4.21 | learning rate: 1.993E-04 | global batch size: 512 | lm loss: 2.606837E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.717 | TFLOPs: 56.73 | 7: iteration 2190/ 44073 | consumed samples: 1121280 | consumed tokens: 2296381440 | elapsed time per iteration (s): 4.20 | learning rate: 1.993E-04 | global batch size: 512 | lm loss: 2.618109E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.002 | TFLOPs: 56.86 | 7: iteration 2200/ 44073 | consumed samples: 1126400 | consumed tokens: 2306867200 | elapsed time per iteration (s): 4.16 | learning rate: 1.993E-04 | global batch size: 512 | lm loss: 2.618779E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.992 | TFLOPs: 57.32 | 7: iteration 2210/ 44073 | consumed samples: 1131520 | consumed tokens: 2317352960 | elapsed time per iteration (s): 4.18 | learning rate: 1.993E-04 | global batch size: 512 | lm loss: 2.588359E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.586 | TFLOPs: 57.13 | 7: iteration 2220/ 44073 | consumed samples: 1136640 | consumed tokens: 2327838720 | elapsed time per iteration (s): 4.19 | learning rate: 1.993E-04 | global batch size: 512 | lm loss: 2.588308E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.100 | TFLOPs: 56.90 | 7: iteration 2230/ 44073 | consumed samples: 1141760 | consumed tokens: 2338324480 | elapsed time per iteration (s): 4.21 | learning rate: 1.993E-04 | global batch size: 512 | lm loss: 2.576404E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.749 | TFLOPs: 56.74 | 7: iteration 2240/ 44073 | consumed samples: 1146880 | consumed tokens: 2348810240 | elapsed time per iteration (s): 4.17 | learning rate: 1.992E-04 | global batch size: 512 | lm loss: 2.592848E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.861 | TFLOPs: 57.26 | 7: iteration 2250/ 44073 | consumed samples: 1152000 | consumed tokens: 2359296000 | elapsed time per iteration (s): 4.22 | learning rate: 1.992E-04 | global batch size: 512 | lm loss: 2.571815E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.253 | TFLOPs: 56.51 | 7: iteration 2260/ 44073 | consumed samples: 1157120 | consumed tokens: 2369781760 | elapsed time per iteration (s): 4.18 | learning rate: 1.992E-04 | global batch size: 512 | lm loss: 2.580780E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.536 | TFLOPs: 57.11 | 7: iteration 2270/ 44073 | consumed samples: 1162240 | consumed tokens: 2380267520 | elapsed time per iteration (s): 4.18 | learning rate: 1.992E-04 | global batch size: 512 | lm loss: 2.604621E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.627 | TFLOPs: 57.15 | 7: iteration 2280/ 44073 | consumed samples: 1167360 | consumed tokens: 2390753280 | elapsed time per iteration (s): 4.20 | learning rate: 1.992E-04 | global batch size: 512 | lm loss: 2.576210E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.777 | TFLOPs: 56.75 | 7: iteration 2290/ 44073 | consumed samples: 1172480 | consumed tokens: 2401239040 | elapsed time per iteration (s): 4.18 | learning rate: 1.992E-04 | global batch size: 512 | lm loss: 2.584446E+00 | grad norm: 0.380 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.553 | TFLOPs: 57.12 | 7: iteration 2300/ 44073 | consumed samples: 1177600 | consumed tokens: 2411724800 | elapsed time per iteration (s): 4.18 | learning rate: 1.992E-04 | global batch size: 512 | lm loss: 2.740653E+00 | grad norm: 3.242 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.475 | TFLOPs: 57.08 | 7: iteration 2310/ 44073 | consumed samples: 1182720 | consumed tokens: 2422210560 | elapsed time per iteration (s): 4.18 | learning rate: 1.992E-04 | global batch size: 512 | lm loss: 3.061937E+00 | grad norm: 1.348 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.391 | TFLOPs: 57.04 | 7: iteration 2320/ 44073 | consumed samples: 1187840 | consumed tokens: 2432696320 | elapsed time per iteration (s): 4.20 | learning rate: 1.992E-04 | global batch size: 512 | lm loss: 2.995409E+00 | grad norm: 1.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.926 | TFLOPs: 56.82 | 7: iteration 2330/ 44073 | consumed samples: 1192960 | consumed tokens: 2443182080 | elapsed time per iteration (s): 4.15 | learning rate: 1.992E-04 | global batch size: 512 | lm loss: 2.923487E+00 | grad norm: 0.791 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.241 | TFLOPs: 57.44 | 7: iteration 2340/ 44073 | consumed samples: 1198080 | consumed tokens: 2453667840 | elapsed time per iteration (s): 4.17 | learning rate: 1.992E-04 | global batch size: 512 | lm loss: 2.860063E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.833 | TFLOPs: 57.25 | 7: iteration 2350/ 44073 | consumed samples: 1203200 | consumed tokens: 2464153600 | elapsed time per iteration (s): 4.16 | learning rate: 1.992E-04 | global batch size: 512 | lm loss: 2.817703E+00 | grad norm: 0.277 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.066 | TFLOPs: 57.35 | 7: iteration 2360/ 44073 | consumed samples: 1208320 | consumed tokens: 2474639360 | elapsed time per iteration (s): 4.17 | learning rate: 1.991E-04 | global batch size: 512 | lm loss: 2.696893E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.820 | TFLOPs: 57.24 | 7: iteration 2370/ 44073 | consumed samples: 1213440 | consumed tokens: 2485125120 | elapsed time per iteration (s): 4.14 | learning rate: 1.991E-04 | global batch size: 512 | lm loss: 2.650282E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.528 | TFLOPs: 57.57 | 7: iteration 2380/ 44073 | consumed samples: 1218560 | consumed tokens: 2495610880 | elapsed time per iteration (s): 4.16 | learning rate: 1.991E-04 | global batch size: 512 | lm loss: 2.635935E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.964 | TFLOPs: 57.31 | 7: iteration 2390/ 44073 | consumed samples: 1223680 | consumed tokens: 2506096640 | elapsed time per iteration (s): 4.19 | learning rate: 1.991E-04 | global batch size: 512 | lm loss: 2.630346E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.252 | TFLOPs: 56.98 | 7: iteration 2400/ 44073 | consumed samples: 1228800 | consumed tokens: 2516582400 | elapsed time per iteration (s): 4.15 | learning rate: 1.991E-04 | global batch size: 512 | lm loss: 2.629663E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.238 | TFLOPs: 57.43 | 7: iteration 2410/ 44073 | consumed samples: 1233920 | consumed tokens: 2527068160 | elapsed time per iteration (s): 4.18 | learning rate: 1.991E-04 | global batch size: 512 | lm loss: 2.608670E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.602 | TFLOPs: 57.14 | 7: iteration 2420/ 44073 | consumed samples: 1239040 | consumed tokens: 2537553920 | elapsed time per iteration (s): 4.15 | learning rate: 1.991E-04 | global batch size: 512 | lm loss: 2.573076E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.503 | TFLOPs: 57.56 | 7: iteration 2430/ 44073 | consumed samples: 1244160 | consumed tokens: 2548039680 | elapsed time per iteration (s): 4.17 | learning rate: 1.991E-04 | global batch size: 512 | lm loss: 2.608138E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.639 | TFLOPs: 57.16 | 7: iteration 2440/ 44073 | consumed samples: 1249280 | consumed tokens: 2558525440 | elapsed time per iteration (s): 4.16 | learning rate: 1.991E-04 | global batch size: 512 | lm loss: 2.601952E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.938 | TFLOPs: 57.30 | 7: iteration 2450/ 44073 | consumed samples: 1254400 | consumed tokens: 2569011200 | elapsed time per iteration (s): 4.20 | learning rate: 1.991E-04 | global batch size: 512 | lm loss: 2.587174E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.975 | TFLOPs: 56.85 | 7: iteration 2460/ 44073 | consumed samples: 1259520 | consumed tokens: 2579496960 | elapsed time per iteration (s): 4.16 | learning rate: 1.991E-04 | global batch size: 512 | lm loss: 2.580599E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.221 | TFLOPs: 57.43 | 7: iteration 2470/ 44073 | consumed samples: 1264640 | consumed tokens: 2589982720 | elapsed time per iteration (s): 4.17 | learning rate: 1.990E-04 | global batch size: 512 | lm loss: 2.582730E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.666 | TFLOPs: 57.17 | 7: iteration 2480/ 44073 | consumed samples: 1269760 | consumed tokens: 2600468480 | elapsed time per iteration (s): 4.18 | learning rate: 1.990E-04 | global batch size: 512 | lm loss: 2.592010E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.352 | TFLOPs: 57.02 | 7: iteration 2490/ 44073 | consumed samples: 1274880 | consumed tokens: 2610954240 | elapsed time per iteration (s): 4.15 | learning rate: 1.990E-04 | global batch size: 512 | lm loss: 2.576089E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.486 | TFLOPs: 57.55 | 7: iteration 2500/ 44073 | consumed samples: 1280000 | consumed tokens: 2621440000 | elapsed time per iteration (s): 4.16 | learning rate: 1.990E-04 | global batch size: 512 | lm loss: 2.556760E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.082 | TFLOPs: 57.36 | 7: iteration 2510/ 44073 | consumed samples: 1285120 | consumed tokens: 2631925760 | elapsed time per iteration (s): 4.20 | learning rate: 1.990E-04 | global batch size: 512 | lm loss: 2.551686E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.970 | TFLOPs: 56.84 | 7: iteration 2520/ 44073 | consumed samples: 1290240 | consumed tokens: 2642411520 | elapsed time per iteration (s): 4.17 | learning rate: 1.990E-04 | global batch size: 512 | lm loss: 2.545279E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.772 | TFLOPs: 57.22 | 7: iteration 2530/ 44073 | consumed samples: 1295360 | consumed tokens: 2652897280 | elapsed time per iteration (s): 4.17 | learning rate: 1.990E-04 | global batch size: 512 | lm loss: 2.531496E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.659 | TFLOPs: 57.17 | 7: iteration 2540/ 44073 | consumed samples: 1300480 | consumed tokens: 2663383040 | elapsed time per iteration (s): 4.16 | learning rate: 1.990E-04 | global batch size: 512 | lm loss: 2.564799E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.117 | TFLOPs: 57.38 | 7: iteration 2550/ 44073 | consumed samples: 1305600 | consumed tokens: 2673868800 | elapsed time per iteration (s): 4.16 | learning rate: 1.990E-04 | global batch size: 512 | lm loss: 2.568092E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.975 | TFLOPs: 57.31 | 7: iteration 2560/ 44073 | consumed samples: 1310720 | consumed tokens: 2684354560 | elapsed time per iteration (s): 4.19 | learning rate: 1.990E-04 | global batch size: 512 | lm loss: 2.564003E+00 | grad norm: 0.882 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.304 | TFLOPs: 57.00 | 7: iteration 2570/ 44073 | consumed samples: 1315840 | consumed tokens: 2694840320 | elapsed time per iteration (s): 4.17 | learning rate: 1.989E-04 | global batch size: 512 | lm loss: 2.563660E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.702 | TFLOPs: 57.19 | 7: iteration 2580/ 44073 | consumed samples: 1320960 | consumed tokens: 2705326080 | elapsed time per iteration (s): 4.20 | learning rate: 1.989E-04 | global batch size: 512 | lm loss: 2.568767E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.972 | TFLOPs: 56.85 | 7: iteration 2590/ 44073 | consumed samples: 1326080 | consumed tokens: 2715811840 | elapsed time per iteration (s): 4.21 | learning rate: 1.989E-04 | global batch size: 512 | lm loss: 2.551929E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.609 | TFLOPs: 56.68 | 7: iteration 2600/ 44073 | consumed samples: 1331200 | consumed tokens: 2726297600 | elapsed time per iteration (s): 4.17 | learning rate: 1.989E-04 | global batch size: 512 | lm loss: 2.538818E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.826 | TFLOPs: 57.24 | 7: iteration 2610/ 44073 | consumed samples: 1336320 | consumed tokens: 2736783360 | elapsed time per iteration (s): 4.20 | learning rate: 1.989E-04 | global batch size: 512 | lm loss: 2.528184E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.776 | TFLOPs: 56.75 | 7: iteration 2620/ 44073 | consumed samples: 1341440 | consumed tokens: 2747269120 | elapsed time per iteration (s): 4.21 | learning rate: 1.989E-04 | global batch size: 512 | lm loss: 2.546983E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.757 | TFLOPs: 56.74 | 7: iteration 2630/ 44073 | consumed samples: 1346560 | consumed tokens: 2757754880 | elapsed time per iteration (s): 4.16 | learning rate: 1.989E-04 | global batch size: 512 | lm loss: 2.555309E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.035 | TFLOPs: 57.34 | 7: iteration 2640/ 44073 | consumed samples: 1351680 | consumed tokens: 2768240640 | elapsed time per iteration (s): 4.18 | learning rate: 1.989E-04 | global batch size: 512 | lm loss: 2.528287E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.506 | TFLOPs: 57.09 | 7: iteration 2650/ 44073 | consumed samples: 1356800 | consumed tokens: 2778726400 | elapsed time per iteration (s): 4.16 | learning rate: 1.989E-04 | global batch size: 512 | lm loss: 2.520861E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.955 | TFLOPs: 57.30 | 7: iteration 2660/ 44073 | consumed samples: 1361920 | consumed tokens: 2789212160 | elapsed time per iteration (s): 4.18 | learning rate: 1.989E-04 | global batch size: 512 | lm loss: 2.545004E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.450 | TFLOPs: 57.07 | 7: iteration 2670/ 44073 | consumed samples: 1367040 | consumed tokens: 2799697920 | elapsed time per iteration (s): 4.20 | learning rate: 1.988E-04 | global batch size: 512 | lm loss: 2.546003E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.774 | TFLOPs: 56.75 | 7: iteration 2680/ 44073 | consumed samples: 1372160 | consumed tokens: 2810183680 | elapsed time per iteration (s): 4.16 | learning rate: 1.988E-04 | global batch size: 512 | lm loss: 2.533483E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.941 | TFLOPs: 57.30 | 7: iteration 2690/ 44073 | consumed samples: 1377280 | consumed tokens: 2820669440 | elapsed time per iteration (s): 4.20 | learning rate: 1.988E-04 | global batch size: 512 | lm loss: 2.523167E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.992 | TFLOPs: 56.85 | 7: iteration 2700/ 44073 | consumed samples: 1382400 | consumed tokens: 2831155200 | elapsed time per iteration (s): 4.16 | learning rate: 1.988E-04 | global batch size: 512 | lm loss: 2.519511E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.013 | TFLOPs: 57.33 | 7: iteration 2710/ 44073 | consumed samples: 1387520 | consumed tokens: 2841640960 | elapsed time per iteration (s): 4.20 | learning rate: 1.988E-04 | global batch size: 512 | lm loss: 2.528016E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.023 | TFLOPs: 56.87 | 7: iteration 2720/ 44073 | consumed samples: 1392640 | consumed tokens: 2852126720 | elapsed time per iteration (s): 4.16 | learning rate: 1.988E-04 | global batch size: 512 | lm loss: 2.535865E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.058 | TFLOPs: 57.35 | 7: iteration 2730/ 44073 | consumed samples: 1397760 | consumed tokens: 2862612480 | elapsed time per iteration (s): 4.19 | learning rate: 1.988E-04 | global batch size: 512 | lm loss: 2.539503E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.296 | TFLOPs: 57.00 | 7: iteration 2740/ 44073 | consumed samples: 1402880 | consumed tokens: 2873098240 | elapsed time per iteration (s): 4.31 | learning rate: 1.988E-04 | global batch size: 512 | lm loss: 2.526006E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.843 | TFLOPs: 55.39 | 7: iteration 2750/ 44073 | consumed samples: 1408000 | consumed tokens: 2883584000 | elapsed time per iteration (s): 4.16 | learning rate: 1.988E-04 | global batch size: 512 | lm loss: 2.527079E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.081 | TFLOPs: 57.36 | 7: iteration 2760/ 44073 | consumed samples: 1413120 | consumed tokens: 2894069760 | elapsed time per iteration (s): 4.15 | learning rate: 1.987E-04 | global batch size: 512 | lm loss: 2.536926E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.360 | TFLOPs: 57.49 | 7: iteration 2770/ 44073 | consumed samples: 1418240 | consumed tokens: 2904555520 | elapsed time per iteration (s): 4.17 | learning rate: 1.987E-04 | global batch size: 512 | lm loss: 2.539932E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.715 | TFLOPs: 57.19 | 7: iteration 2780/ 44073 | consumed samples: 1423360 | consumed tokens: 2915041280 | elapsed time per iteration (s): 4.18 | learning rate: 1.987E-04 | global batch size: 512 | lm loss: 2.522013E+00 | grad norm: 0.258 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.503 | TFLOPs: 57.09 | 7: iteration 2790/ 44073 | consumed samples: 1428480 | consumed tokens: 2925527040 | elapsed time per iteration (s): 4.19 | learning rate: 1.987E-04 | global batch size: 512 | lm loss: 2.523648E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.296 | TFLOPs: 57.00 | 7: iteration 2800/ 44073 | consumed samples: 1433600 | consumed tokens: 2936012800 | elapsed time per iteration (s): 4.22 | learning rate: 1.987E-04 | global batch size: 512 | lm loss: 2.508442E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.319 | TFLOPs: 56.54 | 7: iteration 2810/ 44073 | consumed samples: 1438720 | consumed tokens: 2946498560 | elapsed time per iteration (s): 4.14 | learning rate: 1.987E-04 | global batch size: 512 | lm loss: 2.529320E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.620 | TFLOPs: 57.61 | 7: iteration 2820/ 44073 | consumed samples: 1443840 | consumed tokens: 2956984320 | elapsed time per iteration (s): 4.14 | learning rate: 1.987E-04 | global batch size: 512 | lm loss: 2.511625E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.523 | TFLOPs: 57.57 | 7: iteration 2830/ 44073 | consumed samples: 1448960 | consumed tokens: 2967470080 | elapsed time per iteration (s): 4.14 | learning rate: 1.987E-04 | global batch size: 512 | lm loss: 2.513475E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.555 | TFLOPs: 57.58 | 7: iteration 2840/ 44073 | consumed samples: 1454080 | consumed tokens: 2977955840 | elapsed time per iteration (s): 4.15 | learning rate: 1.987E-04 | global batch size: 512 | lm loss: 2.529021E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.486 | TFLOPs: 57.55 | 7: iteration 2850/ 44073 | consumed samples: 1459200 | consumed tokens: 2988441600 | elapsed time per iteration (s): 4.15 | learning rate: 1.986E-04 | global batch size: 512 | lm loss: 2.548199E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.411 | TFLOPs: 57.52 | 7: iteration 2860/ 44073 | consumed samples: 1464320 | consumed tokens: 2998927360 | elapsed time per iteration (s): 4.14 | learning rate: 1.986E-04 | global batch size: 512 | lm loss: 2.497098E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.573 | TFLOPs: 57.59 | 7: iteration 2870/ 44073 | consumed samples: 1469440 | consumed tokens: 3009413120 | elapsed time per iteration (s): 4.15 | learning rate: 1.986E-04 | global batch size: 512 | lm loss: 2.519225E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.480 | TFLOPs: 57.55 | 7: iteration 2880/ 44073 | consumed samples: 1474560 | consumed tokens: 3019898880 | elapsed time per iteration (s): 4.31 | learning rate: 1.986E-04 | global batch size: 512 | lm loss: 2.483479E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.728 | TFLOPs: 55.33 | 7: iteration 2890/ 44073 | consumed samples: 1479680 | consumed tokens: 3030384640 | elapsed time per iteration (s): 4.16 | learning rate: 1.986E-04 | global batch size: 512 | lm loss: 2.495645E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.946 | TFLOPs: 57.30 | 7: iteration 2900/ 44073 | consumed samples: 1484800 | consumed tokens: 3040870400 | elapsed time per iteration (s): 4.17 | learning rate: 1.986E-04 | global batch size: 512 | lm loss: 2.515503E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.723 | TFLOPs: 57.19 | 7: iteration 2910/ 44073 | consumed samples: 1489920 | consumed tokens: 3051356160 | elapsed time per iteration (s): 4.17 | learning rate: 1.986E-04 | global batch size: 512 | lm loss: 2.507381E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.838 | TFLOPs: 57.25 | 7: iteration 2920/ 44073 | consumed samples: 1495040 | consumed tokens: 3061841920 | elapsed time per iteration (s): 4.16 | learning rate: 1.986E-04 | global batch size: 512 | lm loss: 2.498160E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.981 | TFLOPs: 57.32 | 7: iteration 2930/ 44073 | consumed samples: 1500160 | consumed tokens: 3072327680 | elapsed time per iteration (s): 4.16 | learning rate: 1.986E-04 | global batch size: 512 | lm loss: 2.508227E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.971 | TFLOPs: 57.31 | 7: iteration 2940/ 44073 | consumed samples: 1505280 | consumed tokens: 3082813440 | elapsed time per iteration (s): 4.15 | learning rate: 1.985E-04 | global batch size: 512 | lm loss: 2.503542E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.406 | TFLOPs: 57.51 | 7: iteration 2950/ 44073 | consumed samples: 1510400 | consumed tokens: 3093299200 | elapsed time per iteration (s): 4.47 | learning rate: 1.985E-04 | global batch size: 512 | lm loss: 2.469196E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 114.553 | TFLOPs: 53.39 | 7: iteration 2960/ 44073 | consumed samples: 1515520 | consumed tokens: 3103784960 | elapsed time per iteration (s): 4.15 | learning rate: 1.985E-04 | global batch size: 512 | lm loss: 2.502155E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.427 | TFLOPs: 57.52 | 7: iteration 2970/ 44073 | consumed samples: 1520640 | consumed tokens: 3114270720 | elapsed time per iteration (s): 4.14 | learning rate: 1.985E-04 | global batch size: 512 | lm loss: 2.493769E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.528 | TFLOPs: 57.57 | 7: iteration 2980/ 44073 | consumed samples: 1525760 | consumed tokens: 3124756480 | elapsed time per iteration (s): 4.16 | learning rate: 1.985E-04 | global batch size: 512 | lm loss: 2.502287E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.021 | TFLOPs: 57.33 | 7: iteration 2990/ 44073 | consumed samples: 1530880 | consumed tokens: 3135242240 | elapsed time per iteration (s): 4.15 | learning rate: 1.985E-04 | global batch size: 512 | lm loss: 2.520154E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.262 | TFLOPs: 57.45 | 7: iteration 3000/ 44073 | consumed samples: 1536000 | consumed tokens: 3145728000 | elapsed time per iteration (s): 4.17 | learning rate: 1.985E-04 | global batch size: 512 | lm loss: 2.500864E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.871 | TFLOPs: 57.26 | 7: ------------------------------------------------------------------------------------------ 7: valid loss at iteration 3000 | lm loss value: 2.518705E+00 | lm loss PPL: 1.241252E+01 | 7: ------------------------------------------------------------------------------------------ 0: saving checkpoint at iteration 3000 to checkpoints_2b2 0: [2022-11-25 13:27:32,152] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step3000 is begin to save! 0: [2022-11-25 13:27:32,287] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_01-model_00-model_states.pt... 0: [2022-11-25 13:27:32,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_01-model_00-model_states.pt. 0: [2022-11-25 13:27:32,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_03-model_00-model_states.pt... 0: [2022-11-25 13:27:32,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_03-model_00-model_states.pt. 0: [2022-11-25 13:27:32,827] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_04-model_00-model_states.pt... 0: [2022-11-25 13:27:32,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_04-model_00-model_states.pt. 0: [2022-11-25 13:27:32,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_05-model_00-model_states.pt... 0: [2022-11-25 13:27:33,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_05-model_00-model_states.pt. 0: [2022-11-25 13:27:33,119] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_06-model_00-model_states.pt... 0: [2022-11-25 13:27:33,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_06-model_00-model_states.pt. 0: [2022-11-25 13:27:33,256] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_07-model_00-model_states.pt... 0: [2022-11-25 13:27:33,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_07-model_00-model_states.pt. 0: [2022-11-25 13:27:33,398] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_08-model_00-model_states.pt... 0: [2022-11-25 13:27:33,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_08-model_00-model_states.pt. 0: [2022-11-25 13:27:33,542] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_09-model_00-model_states.pt... 0: [2022-11-25 13:27:33,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_09-model_00-model_states.pt. 0: [2022-11-25 13:27:33,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_10-model_00-model_states.pt... 0: [2022-11-25 13:27:33,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_10-model_00-model_states.pt. 0: [2022-11-25 13:27:33,818] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_11-model_00-model_states.pt... 0: [2022-11-25 13:27:33,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_11-model_00-model_states.pt. 0: [2022-11-25 13:27:33,959] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_12-model_00-model_states.pt... 0: [2022-11-25 13:27:34,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_12-model_00-model_states.pt. 0: [2022-11-25 13:27:34,093] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_13-model_00-model_states.pt... 0: [2022-11-25 13:27:34,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_13-model_00-model_states.pt. 0: [2022-11-25 13:27:34,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_14-model_00-model_states.pt... 0: [2022-11-25 13:27:34,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_14-model_00-model_states.pt. 0: [2022-11-25 13:27:34,371] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_15-model_00-model_states.pt... 0: [2022-11-25 13:27:34,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_15-model_00-model_states.pt. 0: [2022-11-25 13:27:34,502] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_16-model_00-model_states.pt... 0: [2022-11-25 13:27:34,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_16-model_00-model_states.pt. 0: [2022-11-25 13:27:34,640] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_17-model_00-model_states.pt... 0: [2022-11-25 13:27:34,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_17-model_00-model_states.pt. 0: [2022-11-25 13:27:34,782] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_18-model_00-model_states.pt... 0: [2022-11-25 13:27:34,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_18-model_00-model_states.pt. 0: [2022-11-25 13:27:34,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_19-model_00-model_states.pt... 0: [2022-11-25 13:27:35,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_19-model_00-model_states.pt. 0: [2022-11-25 13:27:35,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_20-model_00-model_states.pt... 0: [2022-11-25 13:27:35,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_20-model_00-model_states.pt. 0: [2022-11-25 13:27:35,192] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_21-model_00-model_states.pt... 0: [2022-11-25 13:27:35,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_21-model_00-model_states.pt. 0: [2022-11-25 13:27:35,322] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_22-model_00-model_states.pt... 0: [2022-11-25 13:27:35,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_22-model_00-model_states.pt. 0: [2022-11-25 13:27:35,458] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_23-model_00-model_states.pt... 0: [2022-11-25 13:27:35,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_23-model_00-model_states.pt. 0: [2022-11-25 13:27:35,595] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_24-model_00-model_states.pt... 0: [2022-11-25 13:27:35,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_24-model_00-model_states.pt. 0: [2022-11-25 13:27:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_25-model_00-model_states.pt... 0: [2022-11-25 13:27:35,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_25-model_00-model_states.pt. 0: [2022-11-25 13:27:35,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_26-model_00-model_states.pt... 0: [2022-11-25 13:27:35,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_26-model_00-model_states.pt. 0: [2022-11-25 13:27:36,000] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_27-model_00-model_states.pt... 0: [2022-11-25 13:27:36,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_27-model_00-model_states.pt. 0: [2022-11-25 13:27:36,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_28-model_00-model_states.pt... 0: [2022-11-25 13:27:36,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_28-model_00-model_states.pt. 0: [2022-11-25 13:27:36,268] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_29-model_00-model_states.pt... 0: [2022-11-25 13:27:36,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_29-model_00-model_states.pt. 0: [2022-11-25 13:27:36,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_30-model_00-model_states.pt... 0: [2022-11-25 13:27:36,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_30-model_00-model_states.pt. 0: [2022-11-25 13:27:36,537] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_31-model_00-model_states.pt... 0: [2022-11-25 13:27:36,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_31-model_00-model_states.pt. 0: [2022-11-25 13:27:36,674] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_32-model_00-model_states.pt... 0: [2022-11-25 13:27:36,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_32-model_00-model_states.pt. 0: [2022-11-25 13:27:36,811] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_33-model_00-model_states.pt... 0: [2022-11-25 13:27:36,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_33-model_00-model_states.pt. 0: [2022-11-25 13:27:36,939] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_34-model_00-model_states.pt... 0: [2022-11-25 13:27:37,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_34-model_00-model_states.pt. 0: [2022-11-25 13:27:37,073] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/layer_36-model_00-model_states.pt... 0: [2022-11-25 13:27:37,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/layer_36-model_00-model_states.pt. 0: [2022-11-25 13:27:37,078] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step3000/mp_rank_00_model_states.pt 0: [2022-11-25 13:27:37,078] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/mp_rank_00_model_states.pt... 0: [2022-11-25 13:27:37,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/mp_rank_00_model_states.pt. 0: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 6: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 7: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 3: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 4: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 1: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 2: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 5: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 0: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 7: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 3: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 4: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 1: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 2: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 0: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 2: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 0: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-25 13:27:37,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step3000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-25 13:27:37,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 13:27:37,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 13:27:37,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 5: [2022-11-25 13:27:37,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 13:27:37,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 13:27:37,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 3: [2022-11-25 13:27:37,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 13:27:37,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 13:27:37,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: [2022-11-25 13:27:37,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 13:27:37,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 13:27:37,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-25 13:27:37,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 13:27:37,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: [2022-11-25 13:27:37,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 13:27:37,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 3: [2022-11-25 13:27:37,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 13:27:37,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 13:27:37,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 3: [2022-11-25 13:27:37,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 13:27:37,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 13:27:37,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 3: [2022-11-25 13:27:37,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 13:27:37,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 13:27:37,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 3: [2022-11-25 13:27:37,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 13:27:37,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 13:27:37,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 13:27:37,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 3: [2022-11-25 13:27:37,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 13:27:37,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: [2022-11-25 13:27:37,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 13:27:37,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 13:27:37,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: [2022-11-25 13:27:37,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 13:27:37,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 13:27:37,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 3: [2022-11-25 13:27:37,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 13:27:37,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 13:27:37,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 5: [2022-11-25 13:27:37,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 13:27:37,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 13:27:37,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 5: [2022-11-25 13:27:37,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 13:27:37,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 13:27:37,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 3: [2022-11-25 13:27:37,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 13:27:37,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 13:27:37,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 5: [2022-11-25 13:27:37,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 13:27:37,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 13:27:37,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: [2022-11-25 13:27:37,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 13:27:37,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 13:27:37,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: [2022-11-25 13:27:37,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 13:27:37,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 13:27:37,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 5: [2022-11-25 13:27:37,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 13:27:37,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 13:27:37,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 5: [2022-11-25 13:27:37,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 13:27:37,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 13:27:37,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 1: [2022-11-25 13:27:37,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 13:27:37,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 13:27:37,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 13:27:37,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 1: [2022-11-25 13:27:37,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 13:27:37,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 1: [2022-11-25 13:27:37,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 13:27:37,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 13:27:37,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 13:27:37,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 13:27:37,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 1: [2022-11-25 13:27:37,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 1: [2022-11-25 13:27:37,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 13:27:37,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 13:27:37,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 13:27:37,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 13:27:37,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 13:27:37,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 13:27:37,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 1: [2022-11-25 13:27:37,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 1: [2022-11-25 13:27:37,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 13:27:37,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 13:27:37,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 1: [2022-11-25 13:27:37,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 4: [2022-11-25 13:27:38,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 13:27:38,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 13:27:38,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 13:27:38,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 13:27:38,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 13:27:38,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 13:27:38,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 4: [2022-11-25 13:27:38,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 4: [2022-11-25 13:27:38,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 4: [2022-11-25 13:27:38,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 13:27:38,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 13:27:38,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 13:27:38,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 13:27:38,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 13:27:38,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 13:27:38,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 4: [2022-11-25 13:27:38,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 4: [2022-11-25 13:27:38,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: [2022-11-25 13:27:38,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 13:27:38,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 5: [2022-11-25 13:27:38,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 13:27:38,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 13:27:38,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 6: [2022-11-25 13:27:38,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 13:27:38,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 13:27:38,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 13:27:38,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 13:27:38,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 13:27:38,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 6: [2022-11-25 13:27:38,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 13:27:38,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 13:27:38,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 13:27:38,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 6: [2022-11-25 13:27:38,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 13:27:38,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 13:27:38,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 6: [2022-11-25 13:27:38,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 6: [2022-11-25 13:27:38,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 6: [2022-11-25 13:27:38,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 13:27:38,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 13:27:38,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 13:27:38,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 13:27:38,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 13:27:38,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 13:27:38,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 6: [2022-11-25 13:27:38,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 6: [2022-11-25 13:27:38,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 4: [2022-11-25 13:27:38,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 13:27:38,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 13:27:38,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 2: [2022-11-25 13:27:38,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 13:27:38,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 13:27:38,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 13:27:38,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 13:27:38,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 2: [2022-11-25 13:27:38,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 13:27:38,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 13:27:38,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 13:27:38,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 13:27:38,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 2: [2022-11-25 13:27:38,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 2: [2022-11-25 13:27:38,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 13:27:38,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 13:27:38,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 2: [2022-11-25 13:27:38,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 2: [2022-11-25 13:27:38,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 13:27:38,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 13:27:38,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 2: [2022-11-25 13:27:38,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 13:27:38,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 13:27:38,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 2: [2022-11-25 13:27:38,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 13:27:38,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 13:27:38,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 4: [2022-11-25 13:27:38,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 13:27:38,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 13:27:38,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 7: [2022-11-25 13:27:38,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 13:27:38,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 13:27:38,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 13:27:38,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 13:27:38,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 13:27:38,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 13:27:38,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 13:27:38,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 13:27:38,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 13:27:38,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 13:27:38,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 13:27:38,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 7: [2022-11-25 13:27:38,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 13:27:38,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 13:27:38,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 7: [2022-11-25 13:27:38,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 13:27:38,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 7: [2022-11-25 13:27:38,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 7: [2022-11-25 13:27:38,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 7: [2022-11-25 13:27:38,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 13:27:38,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 7: [2022-11-25 13:27:38,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 13:27:38,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 7: [2022-11-25 13:27:38,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 5: [2022-11-25 13:27:38,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 13:27:38,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step3000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 13:27:38,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: successfully saved checkpoint at iteration 3000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 6369.28 7: iteration 3010/ 44073 | consumed samples: 1541120 | consumed tokens: 3156213760 | elapsed time per iteration (s): 4.94 | learning rate: 1.985E-04 | global batch size: 512 | lm loss: 2.501414E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 103.620 | TFLOPs: 48.29 | 7: iteration 3020/ 44073 | consumed samples: 1546240 | consumed tokens: 3166699520 | elapsed time per iteration (s): 4.19 | learning rate: 1.985E-04 | global batch size: 512 | lm loss: 2.500460E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.211 | TFLOPs: 56.96 | 7: iteration 3030/ 44073 | consumed samples: 1551360 | consumed tokens: 3177185280 | elapsed time per iteration (s): 4.18 | learning rate: 1.984E-04 | global batch size: 512 | lm loss: 2.472802E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.498 | TFLOPs: 57.09 | 7: iteration 3040/ 44073 | consumed samples: 1556480 | consumed tokens: 3187671040 | elapsed time per iteration (s): 4.17 | learning rate: 1.984E-04 | global batch size: 512 | lm loss: 2.508948E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.772 | TFLOPs: 57.22 | 7: iteration 3050/ 44073 | consumed samples: 1561600 | consumed tokens: 3198156800 | elapsed time per iteration (s): 4.18 | learning rate: 1.984E-04 | global batch size: 512 | lm loss: 2.493024E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.592 | TFLOPs: 57.13 | 7: iteration 3060/ 44073 | consumed samples: 1566720 | consumed tokens: 3208642560 | elapsed time per iteration (s): 4.15 | learning rate: 1.984E-04 | global batch size: 512 | lm loss: 2.463602E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.350 | TFLOPs: 57.49 | 7: iteration 3070/ 44073 | consumed samples: 1571840 | consumed tokens: 3219128320 | elapsed time per iteration (s): 4.18 | learning rate: 1.984E-04 | global batch size: 512 | lm loss: 2.480939E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.576 | TFLOPs: 57.13 | 7: iteration 3080/ 44073 | consumed samples: 1576960 | consumed tokens: 3229614080 | elapsed time per iteration (s): 4.18 | learning rate: 1.984E-04 | global batch size: 512 | lm loss: 2.472904E+00 | grad norm: 0.234 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.386 | TFLOPs: 57.04 | 7: iteration 3090/ 44073 | consumed samples: 1582080 | consumed tokens: 3240099840 | elapsed time per iteration (s): 4.18 | learning rate: 1.984E-04 | global batch size: 512 | lm loss: 2.481435E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.534 | TFLOPs: 57.11 | 7: iteration 3100/ 44073 | consumed samples: 1587200 | consumed tokens: 3250585600 | elapsed time per iteration (s): 4.17 | learning rate: 1.984E-04 | global batch size: 512 | lm loss: 2.498338E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.822 | TFLOPs: 57.24 | 7: iteration 3110/ 44073 | consumed samples: 1592320 | consumed tokens: 3261071360 | elapsed time per iteration (s): 4.15 | learning rate: 1.983E-04 | global batch size: 512 | lm loss: 2.508925E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.449 | TFLOPs: 57.53 | 7: iteration 3120/ 44073 | consumed samples: 1597440 | consumed tokens: 3271557120 | elapsed time per iteration (s): 4.15 | learning rate: 1.983E-04 | global batch size: 512 | lm loss: 2.510841E+00 | grad norm: 0.466 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.255 | TFLOPs: 57.44 | 7: iteration 3130/ 44073 | consumed samples: 1602560 | consumed tokens: 3282042880 | elapsed time per iteration (s): 4.16 | learning rate: 1.983E-04 | global batch size: 512 | lm loss: 2.490585E+00 | grad norm: 0.302 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.990 | TFLOPs: 57.32 | 7: iteration 3140/ 44073 | consumed samples: 1607680 | consumed tokens: 3292528640 | elapsed time per iteration (s): 4.19 | learning rate: 1.983E-04 | global batch size: 512 | lm loss: 2.478363E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.307 | TFLOPs: 57.00 | 7: iteration 3150/ 44073 | consumed samples: 1612800 | consumed tokens: 3303014400 | elapsed time per iteration (s): 4.17 | learning rate: 1.983E-04 | global batch size: 512 | lm loss: 2.470381E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.883 | TFLOPs: 57.27 | 7: iteration 3160/ 44073 | consumed samples: 1617920 | consumed tokens: 3313500160 | elapsed time per iteration (s): 4.16 | learning rate: 1.983E-04 | global batch size: 512 | lm loss: 2.469394E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.965 | TFLOPs: 57.31 | 7: iteration 3170/ 44073 | consumed samples: 1623040 | consumed tokens: 3323985920 | elapsed time per iteration (s): 4.18 | learning rate: 1.983E-04 | global batch size: 512 | lm loss: 2.473151E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.398 | TFLOPs: 57.04 | 7: iteration 3180/ 44073 | consumed samples: 1628160 | consumed tokens: 3334471680 | elapsed time per iteration (s): 4.24 | learning rate: 1.983E-04 | global batch size: 512 | lm loss: 2.511717E+00 | grad norm: 0.975 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.688 | TFLOPs: 56.25 | 7: iteration 3190/ 44073 | consumed samples: 1633280 | consumed tokens: 3344957440 | elapsed time per iteration (s): 4.24 | learning rate: 1.982E-04 | global batch size: 512 | lm loss: 3.005795E+00 | grad norm: 2.060 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.650 | TFLOPs: 56.23 | 7: iteration 3200/ 44073 | consumed samples: 1638400 | consumed tokens: 3355443200 | elapsed time per iteration (s): 4.18 | learning rate: 1.982E-04 | global batch size: 512 | lm loss: 3.048395E+00 | grad norm: 1.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.633 | TFLOPs: 57.15 | 7: iteration 3210/ 44073 | consumed samples: 1643520 | consumed tokens: 3365928960 | elapsed time per iteration (s): 4.16 | learning rate: 1.982E-04 | global batch size: 512 | lm loss: 2.974366E+00 | grad norm: 1.786 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.934 | TFLOPs: 57.29 | 7: iteration 3220/ 44073 | consumed samples: 1648640 | consumed tokens: 3376414720 | elapsed time per iteration (s): 4.17 | learning rate: 1.982E-04 | global batch size: 512 | lm loss: 2.968522E+00 | grad norm: 0.856 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.867 | TFLOPs: 57.26 | 7: iteration 3230/ 44073 | consumed samples: 1653760 | consumed tokens: 3386900480 | elapsed time per iteration (s): 4.21 | learning rate: 1.982E-04 | global batch size: 512 | lm loss: 2.918617E+00 | grad norm: 0.974 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.686 | TFLOPs: 56.71 | 7: iteration 3240/ 44073 | consumed samples: 1658880 | consumed tokens: 3397386240 | elapsed time per iteration (s): 4.19 | learning rate: 1.982E-04 | global batch size: 512 | lm loss: 2.887879E+00 | grad norm: 0.683 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.332 | TFLOPs: 57.01 | 7: iteration 3250/ 44073 | consumed samples: 1664000 | consumed tokens: 3407872000 | elapsed time per iteration (s): 4.19 | learning rate: 1.982E-04 | global batch size: 512 | lm loss: 2.740550E+00 | grad norm: 0.314 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.222 | TFLOPs: 56.96 | 7: iteration 3260/ 44073 | consumed samples: 1669120 | consumed tokens: 3418357760 | elapsed time per iteration (s): 4.17 | learning rate: 1.982E-04 | global batch size: 512 | lm loss: 2.657637E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.678 | TFLOPs: 57.17 | 7: iteration 3270/ 44073 | consumed samples: 1674240 | consumed tokens: 3428843520 | elapsed time per iteration (s): 4.18 | learning rate: 1.981E-04 | global batch size: 512 | lm loss: 2.576731E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.392 | TFLOPs: 57.04 | 7: iteration 3280/ 44073 | consumed samples: 1679360 | consumed tokens: 3439329280 | elapsed time per iteration (s): 4.27 | learning rate: 1.981E-04 | global batch size: 512 | lm loss: 2.573126E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.923 | TFLOPs: 55.89 | 7: iteration 3290/ 44073 | consumed samples: 1684480 | consumed tokens: 3449815040 | elapsed time per iteration (s): 4.20 | learning rate: 1.981E-04 | global batch size: 512 | lm loss: 2.566412E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.874 | TFLOPs: 56.80 | 7: iteration 3300/ 44073 | consumed samples: 1689600 | consumed tokens: 3460300800 | elapsed time per iteration (s): 4.21 | learning rate: 1.981E-04 | global batch size: 512 | lm loss: 2.544158E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.706 | TFLOPs: 56.72 | 7: iteration 3310/ 44073 | consumed samples: 1694720 | consumed tokens: 3470786560 | elapsed time per iteration (s): 4.17 | learning rate: 1.981E-04 | global batch size: 512 | lm loss: 2.525013E+00 | grad norm: 0.275 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.782 | TFLOPs: 57.22 | 7: iteration 3320/ 44073 | consumed samples: 1699840 | consumed tokens: 3481272320 | elapsed time per iteration (s): 4.17 | learning rate: 1.981E-04 | global batch size: 512 | lm loss: 2.523210E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.735 | TFLOPs: 57.20 | 7: iteration 3330/ 44073 | consumed samples: 1704960 | consumed tokens: 3491758080 | elapsed time per iteration (s): 4.16 | learning rate: 1.981E-04 | global batch size: 512 | lm loss: 2.512925E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.120 | TFLOPs: 57.38 | 7: iteration 3340/ 44073 | consumed samples: 1710080 | consumed tokens: 3502243840 | elapsed time per iteration (s): 4.18 | learning rate: 1.980E-04 | global batch size: 512 | lm loss: 2.496305E+00 | grad norm: 0.281 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.372 | TFLOPs: 57.03 | 7: iteration 3350/ 44073 | consumed samples: 1715200 | consumed tokens: 3512729600 | elapsed time per iteration (s): 4.19 | learning rate: 1.980E-04 | global batch size: 512 | lm loss: 2.514453E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.251 | TFLOPs: 56.98 | 7: iteration 3360/ 44073 | consumed samples: 1720320 | consumed tokens: 3523215360 | elapsed time per iteration (s): 4.17 | learning rate: 1.980E-04 | global batch size: 512 | lm loss: 2.501866E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.717 | TFLOPs: 57.19 | 7: iteration 3370/ 44073 | consumed samples: 1725440 | consumed tokens: 3533701120 | elapsed time per iteration (s): 4.18 | learning rate: 1.980E-04 | global batch size: 512 | lm loss: 2.509407E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.529 | TFLOPs: 57.10 | 7: iteration 3380/ 44073 | consumed samples: 1730560 | consumed tokens: 3544186880 | elapsed time per iteration (s): 4.21 | learning rate: 1.980E-04 | global batch size: 512 | lm loss: 2.481453E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.579 | TFLOPs: 56.66 | 7: iteration 3390/ 44073 | consumed samples: 1735680 | consumed tokens: 3554672640 | elapsed time per iteration (s): 4.22 | learning rate: 1.980E-04 | global batch size: 512 | lm loss: 2.458088E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.197 | TFLOPs: 56.48 | 7: iteration 3400/ 44073 | consumed samples: 1740800 | consumed tokens: 3565158400 | elapsed time per iteration (s): 4.22 | learning rate: 1.980E-04 | global batch size: 512 | lm loss: 2.509101E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.390 | TFLOPs: 56.57 | 7: iteration 3410/ 44073 | consumed samples: 1745920 | consumed tokens: 3575644160 | elapsed time per iteration (s): 4.18 | learning rate: 1.980E-04 | global batch size: 512 | lm loss: 2.474905E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.470 | TFLOPs: 57.08 | 7: iteration 3420/ 44073 | consumed samples: 1751040 | consumed tokens: 3586129920 | elapsed time per iteration (s): 4.16 | learning rate: 1.979E-04 | global batch size: 512 | lm loss: 2.488220E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.931 | TFLOPs: 57.29 | 7: iteration 3430/ 44073 | consumed samples: 1756160 | consumed tokens: 3596615680 | elapsed time per iteration (s): 4.17 | learning rate: 1.979E-04 | global batch size: 512 | lm loss: 2.468357E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.784 | TFLOPs: 57.22 | 7: iteration 3440/ 44073 | consumed samples: 1761280 | consumed tokens: 3607101440 | elapsed time per iteration (s): 4.18 | learning rate: 1.979E-04 | global batch size: 512 | lm loss: 2.501355E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.508 | TFLOPs: 57.09 | 7: iteration 3450/ 44073 | consumed samples: 1766400 | consumed tokens: 3617587200 | elapsed time per iteration (s): 4.19 | learning rate: 1.979E-04 | global batch size: 512 | lm loss: 2.432758E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.118 | TFLOPs: 56.91 | 7: iteration 3460/ 44073 | consumed samples: 1771520 | consumed tokens: 3628072960 | elapsed time per iteration (s): 4.20 | learning rate: 1.979E-04 | global batch size: 512 | lm loss: 2.467693E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.049 | TFLOPs: 56.88 | 7: iteration 3470/ 44073 | consumed samples: 1776640 | consumed tokens: 3638558720 | elapsed time per iteration (s): 4.21 | learning rate: 1.979E-04 | global batch size: 512 | lm loss: 2.468913E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.752 | TFLOPs: 56.74 | 7: iteration 3480/ 44073 | consumed samples: 1781760 | consumed tokens: 3649044480 | elapsed time per iteration (s): 4.18 | learning rate: 1.979E-04 | global batch size: 512 | lm loss: 2.474522E+00 | grad norm: 0.507 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.378 | TFLOPs: 57.03 | 7: iteration 3490/ 44073 | consumed samples: 1786880 | consumed tokens: 3659530240 | elapsed time per iteration (s): 4.22 | learning rate: 1.978E-04 | global batch size: 512 | lm loss: 2.444181E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.346 | TFLOPs: 56.55 | 7: iteration 3500/ 44073 | consumed samples: 1792000 | consumed tokens: 3670016000 | elapsed time per iteration (s): 4.14 | learning rate: 1.978E-04 | global batch size: 512 | lm loss: 2.480674E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.527 | TFLOPs: 57.57 | 7: iteration 3510/ 44073 | consumed samples: 1797120 | consumed tokens: 3680501760 | elapsed time per iteration (s): 4.15 | learning rate: 1.978E-04 | global batch size: 512 | lm loss: 2.471727E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.435 | TFLOPs: 57.53 | 7: iteration 3520/ 44073 | consumed samples: 1802240 | consumed tokens: 3690987520 | elapsed time per iteration (s): 4.20 | learning rate: 1.978E-04 | global batch size: 512 | lm loss: 2.455891E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.039 | TFLOPs: 56.88 | 7: iteration 3530/ 44073 | consumed samples: 1807360 | consumed tokens: 3701473280 | elapsed time per iteration (s): 4.20 | learning rate: 1.978E-04 | global batch size: 512 | lm loss: 2.459863E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.911 | TFLOPs: 56.82 | 7: iteration 3540/ 44073 | consumed samples: 1812480 | consumed tokens: 3711959040 | elapsed time per iteration (s): 4.20 | learning rate: 1.978E-04 | global batch size: 512 | lm loss: 2.451777E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.875 | TFLOPs: 56.80 | 7: iteration 3550/ 44073 | consumed samples: 1817600 | consumed tokens: 3722444800 | elapsed time per iteration (s): 4.17 | learning rate: 1.978E-04 | global batch size: 512 | lm loss: 2.446255E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.899 | TFLOPs: 57.28 | 7: iteration 3560/ 44073 | consumed samples: 1822720 | consumed tokens: 3732930560 | elapsed time per iteration (s): 4.20 | learning rate: 1.977E-04 | global batch size: 512 | lm loss: 2.467263E+00 | grad norm: 0.292 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.768 | TFLOPs: 56.75 | 7: iteration 3570/ 44073 | consumed samples: 1827840 | consumed tokens: 3743416320 | elapsed time per iteration (s): 4.18 | learning rate: 1.977E-04 | global batch size: 512 | lm loss: 2.471759E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.446 | TFLOPs: 57.07 | 7: iteration 3580/ 44073 | consumed samples: 1832960 | consumed tokens: 3753902080 | elapsed time per iteration (s): 4.17 | learning rate: 1.977E-04 | global batch size: 512 | lm loss: 2.470513E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.764 | TFLOPs: 57.21 | 7: iteration 3590/ 44073 | consumed samples: 1838080 | consumed tokens: 3764387840 | elapsed time per iteration (s): 4.19 | learning rate: 1.977E-04 | global batch size: 512 | lm loss: 2.470849E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.084 | TFLOPs: 56.90 | 7: iteration 3600/ 44073 | consumed samples: 1843200 | consumed tokens: 3774873600 | elapsed time per iteration (s): 4.18 | learning rate: 1.977E-04 | global batch size: 512 | lm loss: 2.436213E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.623 | TFLOPs: 57.15 | 7: iteration 3610/ 44073 | consumed samples: 1848320 | consumed tokens: 3785359360 | elapsed time per iteration (s): 4.19 | learning rate: 1.977E-04 | global batch size: 512 | lm loss: 2.449372E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.156 | TFLOPs: 56.93 | 7: iteration 3620/ 44073 | consumed samples: 1853440 | consumed tokens: 3795845120 | elapsed time per iteration (s): 4.18 | learning rate: 1.977E-04 | global batch size: 512 | lm loss: 2.461925E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.607 | TFLOPs: 57.14 | 7: iteration 3630/ 44073 | consumed samples: 1858560 | consumed tokens: 3806330880 | elapsed time per iteration (s): 4.15 | learning rate: 1.976E-04 | global batch size: 512 | lm loss: 2.452827E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.443 | TFLOPs: 57.53 | 7: iteration 3640/ 44073 | consumed samples: 1863680 | consumed tokens: 3816816640 | elapsed time per iteration (s): 4.43 | learning rate: 1.976E-04 | global batch size: 512 | lm loss: 2.467082E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 115.633 | TFLOPs: 53.89 | 7: iteration 3650/ 44073 | consumed samples: 1868800 | consumed tokens: 3827302400 | elapsed time per iteration (s): 4.19 | learning rate: 1.976E-04 | global batch size: 512 | lm loss: 2.460991E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.246 | TFLOPs: 56.97 | 7: iteration 3660/ 44073 | consumed samples: 1873920 | consumed tokens: 3837788160 | elapsed time per iteration (s): 4.15 | learning rate: 1.976E-04 | global batch size: 512 | lm loss: 2.445359E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.263 | TFLOPs: 57.45 | 7: iteration 3670/ 44073 | consumed samples: 1879040 | consumed tokens: 3848273920 | elapsed time per iteration (s): 4.46 | learning rate: 1.976E-04 | global batch size: 512 | lm loss: 2.436924E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 114.695 | TFLOPs: 53.45 | 7: iteration 3680/ 44073 | consumed samples: 1884160 | consumed tokens: 3858759680 | elapsed time per iteration (s): 4.16 | learning rate: 1.976E-04 | global batch size: 512 | lm loss: 2.448527E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.152 | TFLOPs: 57.40 | 7: iteration 3690/ 44073 | consumed samples: 1889280 | consumed tokens: 3869245440 | elapsed time per iteration (s): 4.17 | learning rate: 1.975E-04 | global batch size: 512 | lm loss: 2.440751E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.742 | TFLOPs: 57.20 | 7: iteration 3700/ 44073 | consumed samples: 1894400 | consumed tokens: 3879731200 | elapsed time per iteration (s): 4.22 | learning rate: 1.975E-04 | global batch size: 512 | lm loss: 2.431059E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.208 | TFLOPs: 56.49 | 7: iteration 3710/ 44073 | consumed samples: 1899520 | consumed tokens: 3890216960 | elapsed time per iteration (s): 4.18 | learning rate: 1.975E-04 | global batch size: 512 | lm loss: 2.436164E+00 | grad norm: 0.213 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.398 | TFLOPs: 57.04 | 7: iteration 3720/ 44073 | consumed samples: 1904640 | consumed tokens: 3900702720 | elapsed time per iteration (s): 4.27 | learning rate: 1.975E-04 | global batch size: 512 | lm loss: 2.417682E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.777 | TFLOPs: 55.82 | 7: iteration 3730/ 44073 | consumed samples: 1909760 | consumed tokens: 3911188480 | elapsed time per iteration (s): 4.23 | learning rate: 1.975E-04 | global batch size: 512 | lm loss: 2.442210E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.035 | TFLOPs: 56.41 | 7: iteration 3740/ 44073 | consumed samples: 1914880 | consumed tokens: 3921674240 | elapsed time per iteration (s): 4.18 | learning rate: 1.975E-04 | global batch size: 512 | lm loss: 2.411998E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.500 | TFLOPs: 57.09 | 7: iteration 3750/ 44073 | consumed samples: 1920000 | consumed tokens: 3932160000 | elapsed time per iteration (s): 4.15 | learning rate: 1.975E-04 | global batch size: 512 | lm loss: 2.423022E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.419 | TFLOPs: 57.52 | 7: iteration 3760/ 44073 | consumed samples: 1925120 | consumed tokens: 3942645760 | elapsed time per iteration (s): 4.15 | learning rate: 1.974E-04 | global batch size: 512 | lm loss: 2.434138E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.244 | TFLOPs: 57.44 | 7: iteration 3770/ 44073 | consumed samples: 1930240 | consumed tokens: 3953131520 | elapsed time per iteration (s): 4.16 | learning rate: 1.974E-04 | global batch size: 512 | lm loss: 2.448205E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.214 | TFLOPs: 57.42 | 7: iteration 3780/ 44073 | consumed samples: 1935360 | consumed tokens: 3963617280 | elapsed time per iteration (s): 4.18 | learning rate: 1.974E-04 | global batch size: 512 | lm loss: 2.416473E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.581 | TFLOPs: 57.13 | 7: iteration 3790/ 44073 | consumed samples: 1940480 | consumed tokens: 3974103040 | elapsed time per iteration (s): 4.17 | learning rate: 1.974E-04 | global batch size: 512 | lm loss: 2.441776E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.715 | TFLOPs: 57.19 | 7: iteration 3800/ 44073 | consumed samples: 1945600 | consumed tokens: 3984588800 | elapsed time per iteration (s): 4.17 | learning rate: 1.974E-04 | global batch size: 512 | lm loss: 2.439358E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.800 | TFLOPs: 57.23 | 7: iteration 3810/ 44073 | consumed samples: 1950720 | consumed tokens: 3995074560 | elapsed time per iteration (s): 4.37 | learning rate: 1.974E-04 | global batch size: 512 | lm loss: 2.428748E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.101 | TFLOPs: 54.58 | 7: iteration 3820/ 44073 | consumed samples: 1955840 | consumed tokens: 4005560320 | elapsed time per iteration (s): 4.20 | learning rate: 1.973E-04 | global batch size: 512 | lm loss: 2.425951E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.778 | TFLOPs: 56.75 | 7: iteration 3830/ 44073 | consumed samples: 1960960 | consumed tokens: 4016046080 | elapsed time per iteration (s): 4.20 | learning rate: 1.973E-04 | global batch size: 512 | lm loss: 2.412665E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.919 | TFLOPs: 56.82 | 7: iteration 3840/ 44073 | consumed samples: 1966080 | consumed tokens: 4026531840 | elapsed time per iteration (s): 4.17 | learning rate: 1.973E-04 | global batch size: 512 | lm loss: 2.416212E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.731 | TFLOPs: 57.20 | 7: iteration 3850/ 44073 | consumed samples: 1971200 | consumed tokens: 4037017600 | elapsed time per iteration (s): 4.21 | learning rate: 1.973E-04 | global batch size: 512 | lm loss: 2.429298E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.725 | TFLOPs: 56.73 | 7: iteration 3860/ 44073 | consumed samples: 1976320 | consumed tokens: 4047503360 | elapsed time per iteration (s): 4.17 | learning rate: 1.973E-04 | global batch size: 512 | lm loss: 2.449447E+00 | grad norm: 0.466 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.905 | TFLOPs: 57.28 | 7: iteration 3870/ 44073 | consumed samples: 1981440 | consumed tokens: 4057989120 | elapsed time per iteration (s): 4.17 | learning rate: 1.973E-04 | global batch size: 512 | lm loss: 2.473199E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.857 | TFLOPs: 57.26 | 7: iteration 3880/ 44073 | consumed samples: 1986560 | consumed tokens: 4068474880 | elapsed time per iteration (s): 4.16 | learning rate: 1.973E-04 | global batch size: 512 | lm loss: 2.446705E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.175 | TFLOPs: 57.41 | 7: iteration 3890/ 44073 | consumed samples: 1991680 | consumed tokens: 4078960640 | elapsed time per iteration (s): 4.16 | learning rate: 1.972E-04 | global batch size: 512 | lm loss: 2.449501E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.207 | TFLOPs: 57.42 | 7: iteration 3900/ 44073 | consumed samples: 1996800 | consumed tokens: 4089446400 | elapsed time per iteration (s): 4.18 | learning rate: 1.972E-04 | global batch size: 512 | lm loss: 2.417696E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.544 | TFLOPs: 57.11 | 7: iteration 3910/ 44073 | consumed samples: 2001920 | consumed tokens: 4099932160 | elapsed time per iteration (s): 4.18 | learning rate: 1.972E-04 | global batch size: 512 | lm loss: 2.429999E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.418 | TFLOPs: 57.05 | 7: iteration 3920/ 44073 | consumed samples: 2007040 | consumed tokens: 4110417920 | elapsed time per iteration (s): 4.16 | learning rate: 1.972E-04 | global batch size: 512 | lm loss: 2.405777E+00 | grad norm: 0.328 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.074 | TFLOPs: 57.36 | 7: iteration 3930/ 44073 | consumed samples: 2012160 | consumed tokens: 4120903680 | elapsed time per iteration (s): 4.17 | learning rate: 1.972E-04 | global batch size: 512 | lm loss: 2.425753E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.687 | TFLOPs: 57.18 | 7: iteration 3940/ 44073 | consumed samples: 2017280 | consumed tokens: 4131389440 | elapsed time per iteration (s): 4.31 | learning rate: 1.972E-04 | global batch size: 512 | lm loss: 2.420885E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.749 | TFLOPs: 55.34 | 7: iteration 3950/ 44073 | consumed samples: 2022400 | consumed tokens: 4141875200 | elapsed time per iteration (s): 4.18 | learning rate: 1.971E-04 | global batch size: 512 | lm loss: 2.402656E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.510 | TFLOPs: 57.10 | 7: iteration 3960/ 44073 | consumed samples: 2027520 | consumed tokens: 4152360960 | elapsed time per iteration (s): 4.20 | learning rate: 1.971E-04 | global batch size: 512 | lm loss: 2.417758E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.995 | TFLOPs: 56.86 | 7: iteration 3970/ 44073 | consumed samples: 2032640 | consumed tokens: 4162846720 | elapsed time per iteration (s): 4.17 | learning rate: 1.971E-04 | global batch size: 512 | lm loss: 2.413486E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.751 | TFLOPs: 57.21 | 7: iteration 3980/ 44073 | consumed samples: 2037760 | consumed tokens: 4173332480 | elapsed time per iteration (s): 4.16 | learning rate: 1.971E-04 | global batch size: 512 | lm loss: 2.422588E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.113 | TFLOPs: 57.38 | 7: iteration 3990/ 44073 | consumed samples: 2042880 | consumed tokens: 4183818240 | elapsed time per iteration (s): 4.17 | learning rate: 1.971E-04 | global batch size: 512 | lm loss: 2.415223E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.917 | TFLOPs: 57.29 | 0: [2022-11-25 14:37:29,078] [INFO] [logging.py:68:log_dist] [Rank 0] step=4000, skipped=0, lr=[0.00019706081683852067, 0.00019706081683852067, 0.00019706081683852067], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 7: iteration 4000/ 44073 | consumed samples: 2048000 | consumed tokens: 4194304000 | elapsed time per iteration (s): 4.17 | learning rate: 1.971E-04 | global batch size: 512 | lm loss: 2.426867E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.811 | TFLOPs: 57.24 | 0: steps: 4000 loss: 2.3977 iter time (s): 4.185 samples/sec: 122.353 7: ------------------------------------------------------------------------------------------ 7: valid loss at iteration 4000 | lm loss value: 2.389601E+00 | lm loss PPL: 1.090914E+01 | 7: ------------------------------------------------------------------------------------------ 0: saving checkpoint at iteration 4000 to checkpoints_2b2 0: [2022-11-25 14:37:30,448] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step4000 is begin to save! 0: [2022-11-25 14:37:30,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_01-model_00-model_states.pt... 0: [2022-11-25 14:37:31,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_01-model_00-model_states.pt. 0: [2022-11-25 14:37:31,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_03-model_00-model_states.pt... 0: [2022-11-25 14:37:31,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_03-model_00-model_states.pt. 0: [2022-11-25 14:37:31,203] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_04-model_00-model_states.pt... 0: [2022-11-25 14:37:31,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_04-model_00-model_states.pt. 0: [2022-11-25 14:37:31,342] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_05-model_00-model_states.pt... 0: [2022-11-25 14:37:31,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_05-model_00-model_states.pt. 0: [2022-11-25 14:37:31,477] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_06-model_00-model_states.pt... 0: [2022-11-25 14:37:31,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_06-model_00-model_states.pt. 0: [2022-11-25 14:37:31,622] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_07-model_00-model_states.pt... 0: [2022-11-25 14:37:31,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_07-model_00-model_states.pt. 0: [2022-11-25 14:37:31,769] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_08-model_00-model_states.pt... 0: [2022-11-25 14:37:31,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_08-model_00-model_states.pt. 0: [2022-11-25 14:37:31,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_09-model_00-model_states.pt... 0: [2022-11-25 14:37:32,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_09-model_00-model_states.pt. 0: [2022-11-25 14:37:32,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_10-model_00-model_states.pt... 0: [2022-11-25 14:37:32,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_10-model_00-model_states.pt. 0: [2022-11-25 14:37:32,200] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_11-model_00-model_states.pt... 0: [2022-11-25 14:37:32,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_11-model_00-model_states.pt. 0: [2022-11-25 14:37:32,340] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_12-model_00-model_states.pt... 0: [2022-11-25 14:37:32,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_12-model_00-model_states.pt. 0: [2022-11-25 14:37:32,477] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_13-model_00-model_states.pt... 0: [2022-11-25 14:37:32,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_13-model_00-model_states.pt. 0: [2022-11-25 14:37:32,619] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_14-model_00-model_states.pt... 0: [2022-11-25 14:37:32,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_14-model_00-model_states.pt. 0: [2022-11-25 14:37:32,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_15-model_00-model_states.pt... 0: [2022-11-25 14:37:32,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_15-model_00-model_states.pt. 0: [2022-11-25 14:37:32,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_16-model_00-model_states.pt... 0: [2022-11-25 14:37:33,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_16-model_00-model_states.pt. 0: [2022-11-25 14:37:33,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_17-model_00-model_states.pt... 0: [2022-11-25 14:37:33,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_17-model_00-model_states.pt. 0: [2022-11-25 14:37:33,170] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_18-model_00-model_states.pt... 0: [2022-11-25 14:37:33,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_18-model_00-model_states.pt. 0: [2022-11-25 14:37:33,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_19-model_00-model_states.pt... 0: [2022-11-25 14:37:33,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_19-model_00-model_states.pt. 0: [2022-11-25 14:37:33,447] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_20-model_00-model_states.pt... 0: [2022-11-25 14:37:33,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_20-model_00-model_states.pt. 0: [2022-11-25 14:37:33,581] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_21-model_00-model_states.pt... 0: [2022-11-25 14:37:33,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_21-model_00-model_states.pt. 0: [2022-11-25 14:37:33,717] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_22-model_00-model_states.pt... 0: [2022-11-25 14:37:33,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_22-model_00-model_states.pt. 0: [2022-11-25 14:37:33,853] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_23-model_00-model_states.pt... 0: [2022-11-25 14:37:33,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_23-model_00-model_states.pt. 0: [2022-11-25 14:37:33,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_24-model_00-model_states.pt... 0: [2022-11-25 14:37:34,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_24-model_00-model_states.pt. 0: [2022-11-25 14:37:34,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_25-model_00-model_states.pt... 0: [2022-11-25 14:37:34,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_25-model_00-model_states.pt. 0: [2022-11-25 14:37:34,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_26-model_00-model_states.pt... 0: [2022-11-25 14:37:34,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_26-model_00-model_states.pt. 0: [2022-11-25 14:37:34,403] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_27-model_00-model_states.pt... 0: [2022-11-25 14:37:34,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_27-model_00-model_states.pt. 0: [2022-11-25 14:37:34,538] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_28-model_00-model_states.pt... 0: [2022-11-25 14:37:34,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_28-model_00-model_states.pt. 0: [2022-11-25 14:37:34,675] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_29-model_00-model_states.pt... 0: [2022-11-25 14:37:34,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_29-model_00-model_states.pt. 0: [2022-11-25 14:37:34,811] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_30-model_00-model_states.pt... 0: [2022-11-25 14:37:34,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_30-model_00-model_states.pt. 0: [2022-11-25 14:37:34,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_31-model_00-model_states.pt... 0: [2022-11-25 14:37:35,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_31-model_00-model_states.pt. 0: [2022-11-25 14:37:35,082] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_32-model_00-model_states.pt... 0: [2022-11-25 14:37:35,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_32-model_00-model_states.pt. 0: [2022-11-25 14:37:35,216] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_33-model_00-model_states.pt... 0: [2022-11-25 14:37:35,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_33-model_00-model_states.pt. 0: [2022-11-25 14:37:35,355] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_34-model_00-model_states.pt... 0: [2022-11-25 14:37:35,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_34-model_00-model_states.pt. 0: [2022-11-25 14:37:35,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/layer_36-model_00-model_states.pt... 0: [2022-11-25 14:37:35,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/layer_36-model_00-model_states.pt. 0: [2022-11-25 14:37:35,498] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step4000/mp_rank_00_model_states.pt 0: [2022-11-25 14:37:35,498] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/mp_rank_00_model_states.pt... 0: [2022-11-25 14:37:35,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/mp_rank_00_model_states.pt. 0: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 0: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 7: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 3: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 4: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 1: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 2: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 5: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 0: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 7: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 3: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 4: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 1: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 2: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 0: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 3: [2022-11-25 14:37:35,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step4000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 0: [2022-11-25 14:37:36,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 14:37:36,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 14:37:36,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: [2022-11-25 14:37:36,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-25 14:37:36,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 14:37:36,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 14:37:36,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: [2022-11-25 14:37:36,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 14:37:36,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 14:37:36,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: [2022-11-25 14:37:36,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 14:37:36,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 14:37:36,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: [2022-11-25 14:37:36,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 14:37:36,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 14:37:36,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: [2022-11-25 14:37:36,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 14:37:36,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 14:37:36,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: [2022-11-25 14:37:36,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 14:37:36,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 14:37:36,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 3: [2022-11-25 14:37:36,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 14:37:36,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 14:37:36,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 14:37:36,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 14:37:36,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 14:37:36,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 14:37:36,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 14:37:36,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 14:37:36,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 3: [2022-11-25 14:37:36,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 14:37:36,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 3: [2022-11-25 14:37:36,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 14:37:36,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 3: [2022-11-25 14:37:36,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 3: [2022-11-25 14:37:36,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 3: [2022-11-25 14:37:36,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 14:37:36,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 14:37:36,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 3: [2022-11-25 14:37:36,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 14:37:36,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 14:37:36,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 3: [2022-11-25 14:37:36,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 14:37:36,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 14:37:36,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 6: [2022-11-25 14:37:36,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 14:37:36,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 14:37:36,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 14:37:36,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 14:37:36,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 6: [2022-11-25 14:37:36,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 6: [2022-11-25 14:37:36,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 14:37:36,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 14:37:36,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 14:37:36,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 14:37:36,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 6: [2022-11-25 14:37:36,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 14:37:36,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 14:37:36,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 6: [2022-11-25 14:37:36,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 1: [2022-11-25 14:37:36,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 14:37:36,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 14:37:36,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 14:37:36,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 14:37:36,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 14:37:36,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 14:37:36,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 1: [2022-11-25 14:37:36,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 1: [2022-11-25 14:37:36,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 14:37:36,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 14:37:36,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 1: [2022-11-25 14:37:36,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 1: [2022-11-25 14:37:36,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 14:37:36,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 14:37:36,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 14:37:36,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 1: [2022-11-25 14:37:36,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 14:37:36,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 14:37:36,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 14:37:36,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 14:37:36,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 1: [2022-11-25 14:37:36,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 14:37:36,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 1: [2022-11-25 14:37:36,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 6: [2022-11-25 14:37:36,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 14:37:36,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 14:37:36,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 6: [2022-11-25 14:37:36,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 14:37:36,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 14:37:36,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 6: [2022-11-25 14:37:36,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 14:37:36,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 14:37:36,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 4: [2022-11-25 14:37:36,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 14:37:36,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 14:37:36,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 14:37:36,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 14:37:36,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 14:37:36,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 14:37:36,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 14:37:36,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 14:37:36,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 14:37:36,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 4: [2022-11-25 14:37:36,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 4: [2022-11-25 14:37:36,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 14:37:36,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 14:37:36,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 14:37:36,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 4: [2022-11-25 14:37:36,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 4: [2022-11-25 14:37:36,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 4: [2022-11-25 14:37:36,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 4: [2022-11-25 14:37:36,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 14:37:36,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 14:37:36,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 14:37:36,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 14:37:36,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 4: [2022-11-25 14:37:36,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 2: [2022-11-25 14:37:36,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 14:37:36,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 14:37:36,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 14:37:36,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 14:37:36,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 14:37:36,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 2: [2022-11-25 14:37:36,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 14:37:36,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 14:37:36,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 14:37:36,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 14:37:36,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 14:37:36,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 2: [2022-11-25 14:37:36,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 14:37:36,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 14:37:36,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 2: [2022-11-25 14:37:36,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 14:37:36,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 14:37:36,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 2: [2022-11-25 14:37:36,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 14:37:36,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 2: [2022-11-25 14:37:36,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 2: [2022-11-25 14:37:36,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 2: [2022-11-25 14:37:36,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 14:37:36,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: [2022-11-25 14:37:36,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 14:37:36,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 7: [2022-11-25 14:37:36,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 14:37:36,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 14:37:36,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 14:37:36,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 14:37:36,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 14:37:36,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 14:37:36,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 7: [2022-11-25 14:37:36,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 7: [2022-11-25 14:37:36,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 7: [2022-11-25 14:37:36,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 14:37:36,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 14:37:36,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 14:37:36,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 14:37:36,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 14:37:36,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 14:37:36,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 14:37:36,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 14:37:36,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 14:37:36,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 14:37:36,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 7: [2022-11-25 14:37:36,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 7: [2022-11-25 14:37:36,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 7: [2022-11-25 14:37:36,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 7: [2022-11-25 14:37:36,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 5: [2022-11-25 14:37:36,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 14:37:36,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 14:37:36,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 14:37:36,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 14:37:36,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 14:37:36,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 14:37:36,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 14:37:36,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 14:37:36,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 14:37:36,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 14:37:36,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 14:37:36,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 14:37:36,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 14:37:36,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 5: [2022-11-25 14:37:36,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 5: [2022-11-25 14:37:36,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 5: [2022-11-25 14:37:36,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 5: [2022-11-25 14:37:36,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 5: [2022-11-25 14:37:36,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 14:37:36,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 5: [2022-11-25 14:37:36,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 5: [2022-11-25 14:37:36,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 14:37:36,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step4000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 14:37:36,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: successfully saved checkpoint at iteration 4000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 6504.51 7: iteration 4010/ 44073 | consumed samples: 2053120 | consumed tokens: 4204789760 | elapsed time per iteration (s): 4.97 | learning rate: 1.970E-04 | global batch size: 512 | lm loss: 2.418774E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 103.064 | TFLOPs: 48.03 | 7: iteration 4020/ 44073 | consumed samples: 2058240 | consumed tokens: 4215275520 | elapsed time per iteration (s): 4.16 | learning rate: 1.970E-04 | global batch size: 512 | lm loss: 2.399221E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.035 | TFLOPs: 57.34 | 7: iteration 4030/ 44073 | consumed samples: 2063360 | consumed tokens: 4225761280 | elapsed time per iteration (s): 4.19 | learning rate: 1.970E-04 | global batch size: 512 | lm loss: 2.413215E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.117 | TFLOPs: 56.91 | 7: iteration 4040/ 44073 | consumed samples: 2068480 | consumed tokens: 4236247040 | elapsed time per iteration (s): 4.16 | learning rate: 1.970E-04 | global batch size: 512 | lm loss: 2.393093E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.989 | TFLOPs: 57.32 | 7: iteration 4050/ 44073 | consumed samples: 2073600 | consumed tokens: 4246732800 | elapsed time per iteration (s): 4.20 | learning rate: 1.970E-04 | global batch size: 512 | lm loss: 2.397388E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.961 | TFLOPs: 56.84 | 7: iteration 4060/ 44073 | consumed samples: 2078720 | consumed tokens: 4257218560 | elapsed time per iteration (s): 4.15 | learning rate: 1.970E-04 | global batch size: 512 | lm loss: 2.395676E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.512 | TFLOPs: 57.56 | 7: iteration 4070/ 44073 | consumed samples: 2083840 | consumed tokens: 4267704320 | elapsed time per iteration (s): 4.18 | learning rate: 1.969E-04 | global batch size: 512 | lm loss: 2.404736E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.442 | TFLOPs: 57.06 | 7: iteration 4080/ 44073 | consumed samples: 2088960 | consumed tokens: 4278190080 | elapsed time per iteration (s): 4.14 | learning rate: 1.969E-04 | global batch size: 512 | lm loss: 2.408962E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.592 | TFLOPs: 57.60 | 7: iteration 4090/ 44073 | consumed samples: 2094080 | consumed tokens: 4288675840 | elapsed time per iteration (s): 4.18 | learning rate: 1.969E-04 | global batch size: 512 | lm loss: 2.397797E+00 | grad norm: 0.364 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.478 | TFLOPs: 57.08 | 7: iteration 4100/ 44073 | consumed samples: 2099200 | consumed tokens: 4299161600 | elapsed time per iteration (s): 4.14 | learning rate: 1.969E-04 | global batch size: 512 | lm loss: 2.442252E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.588 | TFLOPs: 57.60 | 7: iteration 4110/ 44073 | consumed samples: 2104320 | consumed tokens: 4309647360 | elapsed time per iteration (s): 4.16 | learning rate: 1.969E-04 | global batch size: 512 | lm loss: 2.409397E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.999 | TFLOPs: 57.32 | 7: iteration 4120/ 44073 | consumed samples: 2109440 | consumed tokens: 4320133120 | elapsed time per iteration (s): 4.18 | learning rate: 1.969E-04 | global batch size: 512 | lm loss: 2.385623E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.412 | TFLOPs: 57.05 | 7: iteration 4130/ 44073 | consumed samples: 2114560 | consumed tokens: 4330618880 | elapsed time per iteration (s): 4.21 | learning rate: 1.968E-04 | global batch size: 512 | lm loss: 2.417806E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.722 | TFLOPs: 56.73 | 7: iteration 4140/ 44073 | consumed samples: 2119680 | consumed tokens: 4341104640 | elapsed time per iteration (s): 4.17 | learning rate: 1.968E-04 | global batch size: 512 | lm loss: 2.397386E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.653 | TFLOPs: 57.16 | 7: iteration 4150/ 44073 | consumed samples: 2124800 | consumed tokens: 4351590400 | elapsed time per iteration (s): 4.18 | learning rate: 1.968E-04 | global batch size: 512 | lm loss: 2.428199E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.393 | TFLOPs: 57.04 | 7: iteration 4160/ 44073 | consumed samples: 2129920 | consumed tokens: 4362076160 | elapsed time per iteration (s): 4.19 | learning rate: 1.968E-04 | global batch size: 512 | lm loss: 2.412618E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.193 | TFLOPs: 56.95 | 7: iteration 4170/ 44073 | consumed samples: 2135040 | consumed tokens: 4372561920 | elapsed time per iteration (s): 4.18 | learning rate: 1.968E-04 | global batch size: 512 | lm loss: 2.426413E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.586 | TFLOPs: 57.13 | 7: iteration 4180/ 44073 | consumed samples: 2140160 | consumed tokens: 4383047680 | elapsed time per iteration (s): 4.16 | learning rate: 1.968E-04 | global batch size: 512 | lm loss: 2.406737E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.198 | TFLOPs: 57.42 | 7: iteration 4190/ 44073 | consumed samples: 2145280 | consumed tokens: 4393533440 | elapsed time per iteration (s): 4.17 | learning rate: 1.967E-04 | global batch size: 512 | lm loss: 2.381380E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.686 | TFLOPs: 57.18 | 7: iteration 4200/ 44073 | consumed samples: 2150400 | consumed tokens: 4404019200 | elapsed time per iteration (s): 4.18 | learning rate: 1.967E-04 | global batch size: 512 | lm loss: 2.379512E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.518 | TFLOPs: 57.10 | 7: iteration 4210/ 44073 | consumed samples: 2155520 | consumed tokens: 4414504960 | elapsed time per iteration (s): 4.15 | learning rate: 1.967E-04 | global batch size: 512 | lm loss: 2.365398E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.491 | TFLOPs: 57.55 | 7: iteration 4220/ 44073 | consumed samples: 2160640 | consumed tokens: 4424990720 | elapsed time per iteration (s): 4.14 | learning rate: 1.967E-04 | global batch size: 512 | lm loss: 2.397319E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.615 | TFLOPs: 57.61 | 7: iteration 4230/ 44073 | consumed samples: 2165760 | consumed tokens: 4435476480 | elapsed time per iteration (s): 4.14 | learning rate: 1.967E-04 | global batch size: 512 | lm loss: 2.407310E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.656 | TFLOPs: 57.63 | 7: iteration 4240/ 44073 | consumed samples: 2170880 | consumed tokens: 4445962240 | elapsed time per iteration (s): 4.16 | learning rate: 1.967E-04 | global batch size: 512 | lm loss: 2.378183E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.104 | TFLOPs: 57.37 | 7: iteration 4250/ 44073 | consumed samples: 2176000 | consumed tokens: 4456448000 | elapsed time per iteration (s): 4.17 | learning rate: 1.966E-04 | global batch size: 512 | lm loss: 2.405255E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.912 | TFLOPs: 57.28 | 7: iteration 4260/ 44073 | consumed samples: 2181120 | consumed tokens: 4466933760 | elapsed time per iteration (s): 4.15 | learning rate: 1.966E-04 | global batch size: 512 | lm loss: 2.356765E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.356 | TFLOPs: 57.49 | 7: iteration 4270/ 44073 | consumed samples: 2186240 | consumed tokens: 4477419520 | elapsed time per iteration (s): 4.15 | learning rate: 1.966E-04 | global batch size: 512 | lm loss: 2.376925E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.404 | TFLOPs: 57.51 | 7: iteration 4280/ 44073 | consumed samples: 2191360 | consumed tokens: 4487905280 | elapsed time per iteration (s): 4.14 | learning rate: 1.966E-04 | global batch size: 512 | lm loss: 2.384640E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.539 | TFLOPs: 57.58 | 7: iteration 4290/ 44073 | consumed samples: 2196480 | consumed tokens: 4498391040 | elapsed time per iteration (s): 4.16 | learning rate: 1.966E-04 | global batch size: 512 | lm loss: 2.385738E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.955 | TFLOPs: 57.30 | 7: iteration 4300/ 44073 | consumed samples: 2201600 | consumed tokens: 4508876800 | elapsed time per iteration (s): 4.16 | learning rate: 1.965E-04 | global batch size: 512 | lm loss: 2.408186E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.162 | TFLOPs: 57.40 | 7: iteration 4310/ 44073 | consumed samples: 2206720 | consumed tokens: 4519362560 | elapsed time per iteration (s): 4.16 | learning rate: 1.965E-04 | global batch size: 512 | lm loss: 2.394680E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.160 | TFLOPs: 57.40 | 7: iteration 4320/ 44073 | consumed samples: 2211840 | consumed tokens: 4529848320 | elapsed time per iteration (s): 4.16 | learning rate: 1.965E-04 | global batch size: 512 | lm loss: 2.364961E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.215 | TFLOPs: 57.42 | 7: iteration 4330/ 44073 | consumed samples: 2216960 | consumed tokens: 4540334080 | elapsed time per iteration (s): 4.15 | learning rate: 1.965E-04 | global batch size: 512 | lm loss: 2.369899E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.435 | TFLOPs: 57.53 | 7: iteration 4340/ 44073 | consumed samples: 2222080 | consumed tokens: 4550819840 | elapsed time per iteration (s): 4.17 | learning rate: 1.965E-04 | global batch size: 512 | lm loss: 2.376595E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.831 | TFLOPs: 57.25 | 7: iteration 4350/ 44073 | consumed samples: 2227200 | consumed tokens: 4561305600 | elapsed time per iteration (s): 4.14 | learning rate: 1.965E-04 | global batch size: 512 | lm loss: 2.369461E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.663 | TFLOPs: 57.63 | 7: iteration 4360/ 44073 | consumed samples: 2232320 | consumed tokens: 4571791360 | elapsed time per iteration (s): 4.31 | learning rate: 1.964E-04 | global batch size: 512 | lm loss: 2.359175E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.900 | TFLOPs: 55.41 | 7: iteration 4370/ 44073 | consumed samples: 2237440 | consumed tokens: 4582277120 | elapsed time per iteration (s): 4.15 | learning rate: 1.964E-04 | global batch size: 512 | lm loss: 2.374695E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.293 | TFLOPs: 57.46 | 7: iteration 4380/ 44073 | consumed samples: 2242560 | consumed tokens: 4592762880 | elapsed time per iteration (s): 4.14 | learning rate: 1.964E-04 | global batch size: 512 | lm loss: 2.383851E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.682 | TFLOPs: 57.64 | 7: iteration 4390/ 44073 | consumed samples: 2247680 | consumed tokens: 4603248640 | elapsed time per iteration (s): 4.19 | learning rate: 1.964E-04 | global batch size: 512 | lm loss: 2.358695E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.272 | TFLOPs: 56.98 | 7: iteration 4400/ 44073 | consumed samples: 2252800 | consumed tokens: 4613734400 | elapsed time per iteration (s): 4.14 | learning rate: 1.964E-04 | global batch size: 512 | lm loss: 2.369423E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.623 | TFLOPs: 57.61 | 7: iteration 4410/ 44073 | consumed samples: 2257920 | consumed tokens: 4624220160 | elapsed time per iteration (s): 4.14 | learning rate: 1.963E-04 | global batch size: 512 | lm loss: 2.370321E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.576 | TFLOPs: 57.59 | 7: iteration 4420/ 44073 | consumed samples: 2263040 | consumed tokens: 4634705920 | elapsed time per iteration (s): 4.15 | learning rate: 1.963E-04 | global batch size: 512 | lm loss: 2.383261E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.446 | TFLOPs: 57.53 | 7: iteration 4430/ 44073 | consumed samples: 2268160 | consumed tokens: 4645191680 | elapsed time per iteration (s): 4.16 | learning rate: 1.963E-04 | global batch size: 512 | lm loss: 2.372821E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.171 | TFLOPs: 57.40 | 7: iteration 4440/ 44073 | consumed samples: 2273280 | consumed tokens: 4655677440 | elapsed time per iteration (s): 4.19 | learning rate: 1.963E-04 | global batch size: 512 | lm loss: 2.361685E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.207 | TFLOPs: 56.95 | 7: iteration 4450/ 44073 | consumed samples: 2278400 | consumed tokens: 4666163200 | elapsed time per iteration (s): 4.15 | learning rate: 1.963E-04 | global batch size: 512 | lm loss: 2.362945E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.334 | TFLOPs: 57.48 | 7: iteration 4460/ 44073 | consumed samples: 2283520 | consumed tokens: 4676648960 | elapsed time per iteration (s): 4.18 | learning rate: 1.963E-04 | global batch size: 512 | lm loss: 2.453702E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.622 | TFLOPs: 57.15 | 7: iteration 4470/ 44073 | consumed samples: 2288640 | consumed tokens: 4687134720 | elapsed time per iteration (s): 4.18 | learning rate: 1.962E-04 | global batch size: 512 | lm loss: 2.589665E+00 | grad norm: 1.494 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.529 | TFLOPs: 57.10 | 7: iteration 4480/ 44073 | consumed samples: 2293760 | consumed tokens: 4697620480 | elapsed time per iteration (s): 4.19 | learning rate: 1.962E-04 | global batch size: 512 | lm loss: 2.741286E+00 | grad norm: 0.937 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.289 | TFLOPs: 56.99 | 7: iteration 4490/ 44073 | consumed samples: 2298880 | consumed tokens: 4708106240 | elapsed time per iteration (s): 4.16 | learning rate: 1.962E-04 | global batch size: 512 | lm loss: 2.715267E+00 | grad norm: 0.575 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.984 | TFLOPs: 57.32 | 7: iteration 4500/ 44073 | consumed samples: 2304000 | consumed tokens: 4718592000 | elapsed time per iteration (s): 4.15 | learning rate: 1.962E-04 | global batch size: 512 | lm loss: 2.570080E+00 | grad norm: 0.317 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.297 | TFLOPs: 57.46 | 7: iteration 4510/ 44073 | consumed samples: 2309120 | consumed tokens: 4729077760 | elapsed time per iteration (s): 4.16 | learning rate: 1.962E-04 | global batch size: 512 | lm loss: 2.492377E+00 | grad norm: 0.213 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.933 | TFLOPs: 57.29 | 7: iteration 4520/ 44073 | consumed samples: 2314240 | consumed tokens: 4739563520 | elapsed time per iteration (s): 4.15 | learning rate: 1.961E-04 | global batch size: 512 | lm loss: 2.462484E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.512 | TFLOPs: 57.56 | 7: iteration 4530/ 44073 | consumed samples: 2319360 | consumed tokens: 4750049280 | elapsed time per iteration (s): 4.15 | learning rate: 1.961E-04 | global batch size: 512 | lm loss: 2.447688E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.406 | TFLOPs: 57.51 | 7: iteration 4540/ 44073 | consumed samples: 2324480 | consumed tokens: 4760535040 | elapsed time per iteration (s): 4.15 | learning rate: 1.961E-04 | global batch size: 512 | lm loss: 2.430104E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.511 | TFLOPs: 57.56 | 7: iteration 4550/ 44073 | consumed samples: 2329600 | consumed tokens: 4771020800 | elapsed time per iteration (s): 4.14 | learning rate: 1.961E-04 | global batch size: 512 | lm loss: 2.407114E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.677 | TFLOPs: 57.64 | 7: iteration 4560/ 44073 | consumed samples: 2334720 | consumed tokens: 4781506560 | elapsed time per iteration (s): 4.16 | learning rate: 1.961E-04 | global batch size: 512 | lm loss: 2.394292E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.992 | TFLOPs: 57.32 | 7: iteration 4570/ 44073 | consumed samples: 2339840 | consumed tokens: 4791992320 | elapsed time per iteration (s): 4.17 | learning rate: 1.961E-04 | global batch size: 512 | lm loss: 2.395064E+00 | grad norm: 0.402 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.746 | TFLOPs: 57.21 | 7: iteration 4580/ 44073 | consumed samples: 2344960 | consumed tokens: 4802478080 | elapsed time per iteration (s): 4.18 | learning rate: 1.960E-04 | global batch size: 512 | lm loss: 2.434696E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.533 | TFLOPs: 57.11 | 7: iteration 4590/ 44073 | consumed samples: 2350080 | consumed tokens: 4812963840 | elapsed time per iteration (s): 4.15 | learning rate: 1.960E-04 | global batch size: 512 | lm loss: 2.437819E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.252 | TFLOPs: 57.44 | 7: iteration 4600/ 44073 | consumed samples: 2355200 | consumed tokens: 4823449600 | elapsed time per iteration (s): 4.38 | learning rate: 1.960E-04 | global batch size: 512 | lm loss: 2.400467E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 116.991 | TFLOPs: 54.52 | 7: iteration 4610/ 44073 | consumed samples: 2360320 | consumed tokens: 4833935360 | elapsed time per iteration (s): 4.27 | learning rate: 1.960E-04 | global batch size: 512 | lm loss: 2.363060E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.886 | TFLOPs: 55.87 | 7: iteration 4620/ 44073 | consumed samples: 2365440 | consumed tokens: 4844421120 | elapsed time per iteration (s): 4.18 | learning rate: 1.960E-04 | global batch size: 512 | lm loss: 2.393676E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.391 | TFLOPs: 57.04 | 7: iteration 4630/ 44073 | consumed samples: 2370560 | consumed tokens: 4854906880 | elapsed time per iteration (s): 4.19 | learning rate: 1.959E-04 | global batch size: 512 | lm loss: 2.395271E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.227 | TFLOPs: 56.96 | 7: iteration 4640/ 44073 | consumed samples: 2375680 | consumed tokens: 4865392640 | elapsed time per iteration (s): 4.18 | learning rate: 1.959E-04 | global batch size: 512 | lm loss: 2.396628E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.612 | TFLOPs: 57.14 | 7: iteration 4650/ 44073 | consumed samples: 2380800 | consumed tokens: 4875878400 | elapsed time per iteration (s): 4.20 | learning rate: 1.959E-04 | global batch size: 512 | lm loss: 2.372289E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.950 | TFLOPs: 56.83 | 7: iteration 4660/ 44073 | consumed samples: 2385920 | consumed tokens: 4886364160 | elapsed time per iteration (s): 4.27 | learning rate: 1.959E-04 | global batch size: 512 | lm loss: 2.379276E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.009 | TFLOPs: 55.93 | 7: iteration 4670/ 44073 | consumed samples: 2391040 | consumed tokens: 4896849920 | elapsed time per iteration (s): 4.18 | learning rate: 1.959E-04 | global batch size: 512 | lm loss: 2.396158E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.383 | TFLOPs: 57.04 | 7: iteration 4680/ 44073 | consumed samples: 2396160 | consumed tokens: 4907335680 | elapsed time per iteration (s): 4.17 | learning rate: 1.958E-04 | global batch size: 512 | lm loss: 2.366377E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.841 | TFLOPs: 57.25 | 7: iteration 4690/ 44073 | consumed samples: 2401280 | consumed tokens: 4917821440 | elapsed time per iteration (s): 4.15 | learning rate: 1.958E-04 | global batch size: 512 | lm loss: 2.375724E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.287 | TFLOPs: 57.46 | 7: iteration 4700/ 44073 | consumed samples: 2406400 | consumed tokens: 4928307200 | elapsed time per iteration (s): 4.14 | learning rate: 1.958E-04 | global batch size: 512 | lm loss: 2.365001E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.608 | TFLOPs: 57.61 | 7: iteration 4710/ 44073 | consumed samples: 2411520 | consumed tokens: 4938792960 | elapsed time per iteration (s): 4.19 | learning rate: 1.958E-04 | global batch size: 512 | lm loss: 2.361508E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.235 | TFLOPs: 56.97 | 7: iteration 4720/ 44073 | consumed samples: 2416640 | consumed tokens: 4949278720 | elapsed time per iteration (s): 4.20 | learning rate: 1.958E-04 | global batch size: 512 | lm loss: 2.369111E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.018 | TFLOPs: 56.87 | 7: iteration 4730/ 44073 | consumed samples: 2421760 | consumed tokens: 4959764480 | elapsed time per iteration (s): 4.20 | learning rate: 1.957E-04 | global batch size: 512 | lm loss: 2.398822E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.039 | TFLOPs: 56.88 | 7: iteration 4740/ 44073 | consumed samples: 2426880 | consumed tokens: 4970250240 | elapsed time per iteration (s): 4.18 | learning rate: 1.957E-04 | global batch size: 512 | lm loss: 2.364708E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.455 | TFLOPs: 57.07 | 7: iteration 4750/ 44073 | consumed samples: 2432000 | consumed tokens: 4980736000 | elapsed time per iteration (s): 4.30 | learning rate: 1.957E-04 | global batch size: 512 | lm loss: 2.358136E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.205 | TFLOPs: 55.56 | 7: iteration 4760/ 44073 | consumed samples: 2437120 | consumed tokens: 4991221760 | elapsed time per iteration (s): 4.16 | learning rate: 1.957E-04 | global batch size: 512 | lm loss: 2.358661E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.090 | TFLOPs: 57.37 | 7: iteration 4770/ 44073 | consumed samples: 2442240 | consumed tokens: 5001707520 | elapsed time per iteration (s): 4.17 | learning rate: 1.957E-04 | global batch size: 512 | lm loss: 2.385531E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.925 | TFLOPs: 57.29 | 7: iteration 4780/ 44073 | consumed samples: 2447360 | consumed tokens: 5012193280 | elapsed time per iteration (s): 4.15 | learning rate: 1.956E-04 | global batch size: 512 | lm loss: 2.373404E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.279 | TFLOPs: 57.45 | 7: iteration 4790/ 44073 | consumed samples: 2452480 | consumed tokens: 5022679040 | elapsed time per iteration (s): 4.17 | learning rate: 1.956E-04 | global batch size: 512 | lm loss: 2.337454E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.879 | TFLOPs: 57.27 | 7: iteration 4800/ 44073 | consumed samples: 2457600 | consumed tokens: 5033164800 | elapsed time per iteration (s): 4.18 | learning rate: 1.956E-04 | global batch size: 512 | lm loss: 2.342335E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.477 | TFLOPs: 57.08 | 7: iteration 4810/ 44073 | consumed samples: 2462720 | consumed tokens: 5043650560 | elapsed time per iteration (s): 4.15 | learning rate: 1.956E-04 | global batch size: 512 | lm loss: 2.371906E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.400 | TFLOPs: 57.51 | 7: iteration 4820/ 44073 | consumed samples: 2467840 | consumed tokens: 5054136320 | elapsed time per iteration (s): 4.15 | learning rate: 1.956E-04 | global batch size: 512 | lm loss: 2.352529E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.413 | TFLOPs: 57.52 | 7: iteration 4830/ 44073 | consumed samples: 2472960 | consumed tokens: 5064622080 | elapsed time per iteration (s): 4.20 | learning rate: 1.955E-04 | global batch size: 512 | lm loss: 2.369298E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.863 | TFLOPs: 56.79 | 7: iteration 4840/ 44073 | consumed samples: 2478080 | consumed tokens: 5075107840 | elapsed time per iteration (s): 4.14 | learning rate: 1.955E-04 | global batch size: 512 | lm loss: 2.341834E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.597 | TFLOPs: 57.60 | 7: iteration 4850/ 44073 | consumed samples: 2483200 | consumed tokens: 5085593600 | elapsed time per iteration (s): 4.16 | learning rate: 1.955E-04 | global batch size: 512 | lm loss: 2.349003E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.090 | TFLOPs: 57.37 | 7: iteration 4860/ 44073 | consumed samples: 2488320 | consumed tokens: 5096079360 | elapsed time per iteration (s): 4.16 | learning rate: 1.955E-04 | global batch size: 512 | lm loss: 2.349356E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.963 | TFLOPs: 57.31 | 7: iteration 4870/ 44073 | consumed samples: 2493440 | consumed tokens: 5106565120 | elapsed time per iteration (s): 4.18 | learning rate: 1.955E-04 | global batch size: 512 | lm loss: 2.383768E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.458 | TFLOPs: 57.07 | 7: iteration 4880/ 44073 | consumed samples: 2498560 | consumed tokens: 5117050880 | elapsed time per iteration (s): 4.16 | learning rate: 1.954E-04 | global batch size: 512 | lm loss: 2.343210E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.949 | TFLOPs: 57.30 | 7: iteration 4890/ 44073 | consumed samples: 2503680 | consumed tokens: 5127536640 | elapsed time per iteration (s): 4.15 | learning rate: 1.954E-04 | global batch size: 512 | lm loss: 2.367849E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.438 | TFLOPs: 57.53 | 7: iteration 4900/ 44073 | consumed samples: 2508800 | consumed tokens: 5138022400 | elapsed time per iteration (s): 4.16 | learning rate: 1.954E-04 | global batch size: 512 | lm loss: 2.352124E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.053 | TFLOPs: 57.35 | 7: iteration 4910/ 44073 | consumed samples: 2513920 | consumed tokens: 5148508160 | elapsed time per iteration (s): 4.17 | learning rate: 1.954E-04 | global batch size: 512 | lm loss: 2.347490E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.674 | TFLOPs: 57.17 | 7: iteration 4920/ 44073 | consumed samples: 2519040 | consumed tokens: 5158993920 | elapsed time per iteration (s): 4.16 | learning rate: 1.954E-04 | global batch size: 512 | lm loss: 2.338112E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.043 | TFLOPs: 57.34 | 7: iteration 4930/ 44073 | consumed samples: 2524160 | consumed tokens: 5169479680 | elapsed time per iteration (s): 4.15 | learning rate: 1.953E-04 | global batch size: 512 | lm loss: 2.364086E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.452 | TFLOPs: 57.53 | 7: iteration 4940/ 44073 | consumed samples: 2529280 | consumed tokens: 5179965440 | elapsed time per iteration (s): 4.41 | learning rate: 1.953E-04 | global batch size: 512 | lm loss: 2.330400E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 116.067 | TFLOPs: 54.09 | 7: iteration 4950/ 44073 | consumed samples: 2534400 | consumed tokens: 5190451200 | elapsed time per iteration (s): 4.15 | learning rate: 1.953E-04 | global batch size: 512 | lm loss: 2.376352E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.331 | TFLOPs: 57.48 | 7: iteration 4960/ 44073 | consumed samples: 2539520 | consumed tokens: 5200936960 | elapsed time per iteration (s): 4.15 | learning rate: 1.953E-04 | global batch size: 512 | lm loss: 2.371745E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.253 | TFLOPs: 57.44 | 7: iteration 4970/ 44073 | consumed samples: 2544640 | consumed tokens: 5211422720 | elapsed time per iteration (s): 4.15 | learning rate: 1.953E-04 | global batch size: 512 | lm loss: 2.343571E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.271 | TFLOPs: 57.45 | 7: iteration 4980/ 44073 | consumed samples: 2549760 | consumed tokens: 5221908480 | elapsed time per iteration (s): 4.19 | learning rate: 1.952E-04 | global batch size: 512 | lm loss: 2.343429E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.248 | TFLOPs: 56.97 | 7: iteration 4990/ 44073 | consumed samples: 2554880 | consumed tokens: 5232394240 | elapsed time per iteration (s): 4.16 | learning rate: 1.952E-04 | global batch size: 512 | lm loss: 2.342192E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.080 | TFLOPs: 57.36 | 7: iteration 5000/ 44073 | consumed samples: 2560000 | consumed tokens: 5242880000 | elapsed time per iteration (s): 4.16 | learning rate: 1.952E-04 | global batch size: 512 | lm loss: 2.368849E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.115 | TFLOPs: 57.38 | 7: ------------------------------------------------------------------------------------------ 7: valid loss at iteration 5000 | lm loss value: 2.261483E+00 | lm loss PPL: 9.597316E+00 | 7: ------------------------------------------------------------------------------------------ 0: saving checkpoint at iteration 5000 to checkpoints_2b2 0: [2022-11-25 15:47:12,474] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step5000 is begin to save! 0: [2022-11-25 15:47:12,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_01-model_00-model_states.pt... 0: [2022-11-25 15:47:12,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_01-model_00-model_states.pt. 0: [2022-11-25 15:47:12,957] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_03-model_00-model_states.pt... 0: [2022-11-25 15:47:13,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_03-model_00-model_states.pt. 0: [2022-11-25 15:47:13,091] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_04-model_00-model_states.pt... 0: [2022-11-25 15:47:13,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_04-model_00-model_states.pt. 0: [2022-11-25 15:47:13,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_05-model_00-model_states.pt... 0: [2022-11-25 15:47:13,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_05-model_00-model_states.pt. 0: [2022-11-25 15:47:13,376] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_06-model_00-model_states.pt... 0: [2022-11-25 15:47:13,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_06-model_00-model_states.pt. 0: [2022-11-25 15:47:13,515] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_07-model_00-model_states.pt... 0: [2022-11-25 15:47:13,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_07-model_00-model_states.pt. 0: [2022-11-25 15:47:13,651] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_08-model_00-model_states.pt... 0: [2022-11-25 15:47:13,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_08-model_00-model_states.pt. 0: [2022-11-25 15:47:13,777] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_09-model_00-model_states.pt... 0: [2022-11-25 15:47:13,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_09-model_00-model_states.pt. 0: [2022-11-25 15:47:13,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_10-model_00-model_states.pt... 0: [2022-11-25 15:47:14,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_10-model_00-model_states.pt. 0: [2022-11-25 15:47:14,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_11-model_00-model_states.pt... 0: [2022-11-25 15:47:14,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_11-model_00-model_states.pt. 0: [2022-11-25 15:47:14,188] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_12-model_00-model_states.pt... 0: [2022-11-25 15:47:14,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_12-model_00-model_states.pt. 0: [2022-11-25 15:47:14,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_13-model_00-model_states.pt... 0: [2022-11-25 15:47:14,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_13-model_00-model_states.pt. 0: [2022-11-25 15:47:14,462] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_14-model_00-model_states.pt... 0: [2022-11-25 15:47:14,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_14-model_00-model_states.pt. 0: [2022-11-25 15:47:14,597] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_15-model_00-model_states.pt... 0: [2022-11-25 15:47:14,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_15-model_00-model_states.pt. 0: [2022-11-25 15:47:14,730] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_16-model_00-model_states.pt... 0: [2022-11-25 15:47:14,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_16-model_00-model_states.pt. 0: [2022-11-25 15:47:14,863] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_17-model_00-model_states.pt... 0: [2022-11-25 15:47:15,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_17-model_00-model_states.pt. 0: [2022-11-25 15:47:15,001] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_18-model_00-model_states.pt... 0: [2022-11-25 15:47:15,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_18-model_00-model_states.pt. 0: [2022-11-25 15:47:15,135] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_19-model_00-model_states.pt... 0: [2022-11-25 15:47:15,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_19-model_00-model_states.pt. 0: [2022-11-25 15:47:15,273] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_20-model_00-model_states.pt... 0: [2022-11-25 15:47:15,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_20-model_00-model_states.pt. 0: [2022-11-25 15:47:15,409] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_21-model_00-model_states.pt... 0: [2022-11-25 15:47:15,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_21-model_00-model_states.pt. 0: [2022-11-25 15:47:15,541] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_22-model_00-model_states.pt... 0: [2022-11-25 15:47:15,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_22-model_00-model_states.pt. 0: [2022-11-25 15:47:15,676] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_23-model_00-model_states.pt... 0: [2022-11-25 15:47:15,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_23-model_00-model_states.pt. 0: [2022-11-25 15:47:15,813] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_24-model_00-model_states.pt... 0: [2022-11-25 15:47:15,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_24-model_00-model_states.pt. 0: [2022-11-25 15:47:15,949] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_25-model_00-model_states.pt... 0: [2022-11-25 15:47:16,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_25-model_00-model_states.pt. 0: [2022-11-25 15:47:16,086] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_26-model_00-model_states.pt... 0: [2022-11-25 15:47:16,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_26-model_00-model_states.pt. 0: [2022-11-25 15:47:16,218] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_27-model_00-model_states.pt... 0: [2022-11-25 15:47:16,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_27-model_00-model_states.pt. 0: [2022-11-25 15:47:16,355] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_28-model_00-model_states.pt... 0: [2022-11-25 15:47:16,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_28-model_00-model_states.pt. 0: [2022-11-25 15:47:16,488] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_29-model_00-model_states.pt... 0: [2022-11-25 15:47:16,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_29-model_00-model_states.pt. 0: [2022-11-25 15:47:16,622] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_30-model_00-model_states.pt... 0: [2022-11-25 15:47:16,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_30-model_00-model_states.pt. 0: [2022-11-25 15:47:16,757] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_31-model_00-model_states.pt... 0: [2022-11-25 15:47:16,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_31-model_00-model_states.pt. 0: [2022-11-25 15:47:16,892] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_32-model_00-model_states.pt... 0: [2022-11-25 15:47:17,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_32-model_00-model_states.pt. 0: [2022-11-25 15:47:17,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_33-model_00-model_states.pt... 0: [2022-11-25 15:47:17,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_33-model_00-model_states.pt. 0: [2022-11-25 15:47:17,155] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_34-model_00-model_states.pt... 0: [2022-11-25 15:47:17,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_34-model_00-model_states.pt. 0: [2022-11-25 15:47:17,289] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/layer_36-model_00-model_states.pt... 0: [2022-11-25 15:47:17,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/layer_36-model_00-model_states.pt. 0: [2022-11-25 15:47:17,295] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step5000/mp_rank_00_model_states.pt 0: [2022-11-25 15:47:17,295] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/mp_rank_00_model_states.pt... 0: [2022-11-25 15:47:17,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/mp_rank_00_model_states.pt. 0: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 0: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 7: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 2: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 5: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 0: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 7: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 3: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 4: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 1: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 2: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 0: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 3: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 1: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 0: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-25 15:47:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step5000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-25 15:47:17,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-25 15:47:17,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 15:47:17,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 15:47:17,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 0: [2022-11-25 15:47:17,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 15:47:17,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 15:47:17,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 0: [2022-11-25 15:47:17,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 15:47:17,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 15:47:17,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 4: [2022-11-25 15:47:17,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 15:47:17,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 15:47:17,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 0: [2022-11-25 15:47:17,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 15:47:17,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 15:47:17,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 0: [2022-11-25 15:47:17,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 15:47:17,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 15:47:17,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 0: [2022-11-25 15:47:17,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 15:47:17,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 15:47:17,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 4: [2022-11-25 15:47:17,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 15:47:17,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 15:47:17,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 4: [2022-11-25 15:47:17,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 15:47:17,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 15:47:17,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 4: [2022-11-25 15:47:17,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 15:47:17,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 15:47:17,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 0: [2022-11-25 15:47:17,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 15:47:17,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 15:47:17,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 4: [2022-11-25 15:47:18,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 15:47:18,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 15:47:18,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 4: [2022-11-25 15:47:18,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 15:47:18,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 15:47:18,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 15:47:18,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 15:47:18,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 4: [2022-11-25 15:47:18,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 6: [2022-11-25 15:47:18,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 15:47:18,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 15:47:18,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 15:47:18,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 15:47:18,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 15:47:18,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 15:47:18,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 15:47:18,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 15:47:18,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 6: [2022-11-25 15:47:18,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 6: [2022-11-25 15:47:18,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 6: [2022-11-25 15:47:18,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 6: [2022-11-25 15:47:18,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 15:47:18,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 15:47:18,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 6: [2022-11-25 15:47:18,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 15:47:18,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 15:47:18,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 6: [2022-11-25 15:47:18,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 15:47:18,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 15:47:18,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 6: [2022-11-25 15:47:18,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 15:47:18,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 15:47:18,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 4: [2022-11-25 15:47:18,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 15:47:18,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 15:47:18,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 7: [2022-11-25 15:47:18,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 15:47:18,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 15:47:18,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 15:47:18,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 15:47:18,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 15:47:18,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 15:47:18,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 15:47:18,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 15:47:18,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 15:47:18,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 1: [2022-11-25 15:47:18,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 15:47:18,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 15:47:18,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 15:47:18,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 15:47:18,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 15:47:18,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 7: [2022-11-25 15:47:18,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 1: [2022-11-25 15:47:18,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 7: [2022-11-25 15:47:18,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 15:47:18,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 15:47:18,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 7: [2022-11-25 15:47:18,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 7: [2022-11-25 15:47:18,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 15:47:18,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 1: [2022-11-25 15:47:18,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 15:47:18,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 15:47:18,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 7: [2022-11-25 15:47:18,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 15:47:18,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 7: [2022-11-25 15:47:18,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 7: [2022-11-25 15:47:18,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 1: [2022-11-25 15:47:18,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 15:47:18,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 15:47:18,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 15:47:18,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 15:47:18,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 7: [2022-11-25 15:47:18,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 1: [2022-11-25 15:47:18,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 15:47:18,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 1: [2022-11-25 15:47:18,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 7: [2022-11-25 15:47:18,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 7: [2022-11-25 15:47:18,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 1: [2022-11-25 15:47:18,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 1: [2022-11-25 15:47:18,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 1: [2022-11-25 15:47:18,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 1: [2022-11-25 15:47:18,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 1: [2022-11-25 15:47:18,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 1: [2022-11-25 15:47:18,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 3: [2022-11-25 15:47:18,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 15:47:18,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 15:47:18,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 15:47:18,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 15:47:18,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 15:47:18,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 15:47:18,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 3: [2022-11-25 15:47:18,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 15:47:18,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 3: [2022-11-25 15:47:18,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 15:47:18,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 3: [2022-11-25 15:47:18,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 3: [2022-11-25 15:47:18,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 15:47:18,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 15:47:18,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 15:47:18,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 15:47:18,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 15:47:18,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 15:47:18,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 15:47:18,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 3: [2022-11-25 15:47:18,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 15:47:18,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 3: [2022-11-25 15:47:18,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 3: [2022-11-25 15:47:18,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 5: [2022-11-25 15:47:18,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 15:47:18,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 15:47:18,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 15:47:18,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 15:47:18,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 15:47:18,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 15:47:18,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 5: [2022-11-25 15:47:18,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 5: [2022-11-25 15:47:18,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 5: [2022-11-25 15:47:18,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 15:47:18,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 15:47:18,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 5: [2022-11-25 15:47:18,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 15:47:18,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 15:47:18,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 5: [2022-11-25 15:47:18,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 15:47:18,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 15:47:18,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 2: [2022-11-25 15:47:18,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 15:47:18,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 15:47:18,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 15:47:18,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 15:47:18,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 15:47:18,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 15:47:18,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 15:47:18,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 15:47:18,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 15:47:18,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 15:47:18,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 15:47:18,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 15:47:18,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 2: [2022-11-25 15:47:18,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 2: [2022-11-25 15:47:18,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 2: [2022-11-25 15:47:18,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 2: [2022-11-25 15:47:18,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 2: [2022-11-25 15:47:18,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 2: [2022-11-25 15:47:18,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 15:47:18,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 15:47:18,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 15:47:18,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 15:47:18,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 2: [2022-11-25 15:47:18,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 0: [2022-11-25 15:47:18,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 15:47:18,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 5: [2022-11-25 15:47:18,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 15:47:18,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 15:47:18,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 5: [2022-11-25 15:47:18,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 15:47:18,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step5000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 15:47:18,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 0: successfully saved checkpoint at iteration 5000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 6223.07 7: iteration 5010/ 44073 | consumed samples: 2565120 | consumed tokens: 5253365760 | elapsed time per iteration (s): 4.95 | learning rate: 1.952E-04 | global batch size: 512 | lm loss: 2.354123E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 103.537 | TFLOPs: 48.25 | 7: iteration 5020/ 44073 | consumed samples: 2570240 | consumed tokens: 5263851520 | elapsed time per iteration (s): 4.16 | learning rate: 1.952E-04 | global batch size: 512 | lm loss: 2.367246E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.168 | TFLOPs: 57.40 | 7: iteration 5030/ 44073 | consumed samples: 2575360 | consumed tokens: 5274337280 | elapsed time per iteration (s): 4.15 | learning rate: 1.951E-04 | global batch size: 512 | lm loss: 2.347837E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.384 | TFLOPs: 57.50 | 7: iteration 5040/ 44073 | consumed samples: 2580480 | consumed tokens: 5284823040 | elapsed time per iteration (s): 4.14 | learning rate: 1.951E-04 | global batch size: 512 | lm loss: 2.332362E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.635 | TFLOPs: 57.62 | 7: iteration 5050/ 44073 | consumed samples: 2585600 | consumed tokens: 5295308800 | elapsed time per iteration (s): 4.17 | learning rate: 1.951E-04 | global batch size: 512 | lm loss: 2.341466E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.788 | TFLOPs: 57.23 | 7: iteration 5060/ 44073 | consumed samples: 2590720 | consumed tokens: 5305794560 | elapsed time per iteration (s): 4.16 | learning rate: 1.951E-04 | global batch size: 512 | lm loss: 2.323644E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.096 | TFLOPs: 57.37 | 7: iteration 5070/ 44073 | consumed samples: 2595840 | consumed tokens: 5316280320 | elapsed time per iteration (s): 4.17 | learning rate: 1.950E-04 | global batch size: 512 | lm loss: 2.347613E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.912 | TFLOPs: 57.28 | 7: iteration 5080/ 44073 | consumed samples: 2600960 | consumed tokens: 5326766080 | elapsed time per iteration (s): 4.16 | learning rate: 1.950E-04 | global batch size: 512 | lm loss: 2.331264E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.024 | TFLOPs: 57.34 | 7: iteration 5090/ 44073 | consumed samples: 2606080 | consumed tokens: 5337251840 | elapsed time per iteration (s): 4.18 | learning rate: 1.950E-04 | global batch size: 512 | lm loss: 2.349526E+00 | grad norm: 0.646 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.584 | TFLOPs: 57.13 | 7: iteration 5100/ 44073 | consumed samples: 2611200 | consumed tokens: 5347737600 | elapsed time per iteration (s): 4.20 | learning rate: 1.950E-04 | global batch size: 512 | lm loss: 2.350253E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.836 | TFLOPs: 56.78 | 7: iteration 5110/ 44073 | consumed samples: 2616320 | consumed tokens: 5358223360 | elapsed time per iteration (s): 4.16 | learning rate: 1.950E-04 | global batch size: 512 | lm loss: 2.335134E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.108 | TFLOPs: 57.37 | 7: iteration 5120/ 44073 | consumed samples: 2621440 | consumed tokens: 5368709120 | elapsed time per iteration (s): 4.16 | learning rate: 1.949E-04 | global batch size: 512 | lm loss: 2.348039E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.986 | TFLOPs: 57.32 | 7: iteration 5130/ 44073 | consumed samples: 2626560 | consumed tokens: 5379194880 | elapsed time per iteration (s): 4.16 | learning rate: 1.949E-04 | global batch size: 512 | lm loss: 2.334500E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.032 | TFLOPs: 57.34 | 7: iteration 5140/ 44073 | consumed samples: 2631680 | consumed tokens: 5389680640 | elapsed time per iteration (s): 4.18 | learning rate: 1.949E-04 | global batch size: 512 | lm loss: 2.347139E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.498 | TFLOPs: 57.09 | 7: iteration 5150/ 44073 | consumed samples: 2636800 | consumed tokens: 5400166400 | elapsed time per iteration (s): 4.17 | learning rate: 1.949E-04 | global batch size: 512 | lm loss: 2.340686E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.645 | TFLOPs: 57.16 | 7: iteration 5160/ 44073 | consumed samples: 2641920 | consumed tokens: 5410652160 | elapsed time per iteration (s): 4.19 | learning rate: 1.949E-04 | global batch size: 512 | lm loss: 2.356678E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.208 | TFLOPs: 56.96 | 7: iteration 5170/ 44073 | consumed samples: 2647040 | consumed tokens: 5421137920 | elapsed time per iteration (s): 4.17 | learning rate: 1.948E-04 | global batch size: 512 | lm loss: 2.346184E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.747 | TFLOPs: 57.21 | 7: iteration 5180/ 44073 | consumed samples: 2652160 | consumed tokens: 5431623680 | elapsed time per iteration (s): 4.18 | learning rate: 1.948E-04 | global batch size: 512 | lm loss: 2.343159E+00 | grad norm: 0.271 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.574 | TFLOPs: 57.13 | 7: iteration 5190/ 44073 | consumed samples: 2657280 | consumed tokens: 5442109440 | elapsed time per iteration (s): 4.15 | learning rate: 1.948E-04 | global batch size: 512 | lm loss: 2.340376E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.251 | TFLOPs: 57.44 | 7: iteration 5200/ 44073 | consumed samples: 2662400 | consumed tokens: 5452595200 | elapsed time per iteration (s): 4.17 | learning rate: 1.948E-04 | global batch size: 512 | lm loss: 2.333797E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.679 | TFLOPs: 57.17 | 7: iteration 5210/ 44073 | consumed samples: 2667520 | consumed tokens: 5463080960 | elapsed time per iteration (s): 4.18 | learning rate: 1.947E-04 | global batch size: 512 | lm loss: 2.340866E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.528 | TFLOPs: 57.10 | 7: iteration 5220/ 44073 | consumed samples: 2672640 | consumed tokens: 5473566720 | elapsed time per iteration (s): 4.16 | learning rate: 1.947E-04 | global batch size: 512 | lm loss: 2.340535E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.133 | TFLOPs: 57.39 | 7: iteration 5230/ 44073 | consumed samples: 2677760 | consumed tokens: 5484052480 | elapsed time per iteration (s): 4.17 | learning rate: 1.947E-04 | global batch size: 512 | lm loss: 2.326804E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.913 | TFLOPs: 57.28 | 7: iteration 5240/ 44073 | consumed samples: 2682880 | consumed tokens: 5494538240 | elapsed time per iteration (s): 4.17 | learning rate: 1.947E-04 | global batch size: 512 | lm loss: 2.331627E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.810 | TFLOPs: 57.24 | 7: iteration 5250/ 44073 | consumed samples: 2688000 | consumed tokens: 5505024000 | elapsed time per iteration (s): 4.14 | learning rate: 1.947E-04 | global batch size: 512 | lm loss: 2.326657E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.560 | TFLOPs: 57.59 | 7: iteration 5260/ 44073 | consumed samples: 2693120 | consumed tokens: 5515509760 | elapsed time per iteration (s): 4.19 | learning rate: 1.946E-04 | global batch size: 512 | lm loss: 2.359118E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.298 | TFLOPs: 57.00 | 7: iteration 5270/ 44073 | consumed samples: 2698240 | consumed tokens: 5525995520 | elapsed time per iteration (s): 4.15 | learning rate: 1.946E-04 | global batch size: 512 | lm loss: 2.326892E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.462 | TFLOPs: 57.54 | 7: iteration 5280/ 44073 | consumed samples: 2703360 | consumed tokens: 5536481280 | elapsed time per iteration (s): 4.15 | learning rate: 1.946E-04 | global batch size: 512 | lm loss: 2.316382E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.371 | TFLOPs: 57.50 | 7: iteration 5290/ 44073 | consumed samples: 2708480 | consumed tokens: 5546967040 | elapsed time per iteration (s): 4.14 | learning rate: 1.946E-04 | global batch size: 512 | lm loss: 2.319881E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.633 | TFLOPs: 57.62 | 7: iteration 5300/ 44073 | consumed samples: 2713600 | consumed tokens: 5557452800 | elapsed time per iteration (s): 4.16 | learning rate: 1.945E-04 | global batch size: 512 | lm loss: 2.336962E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.207 | TFLOPs: 57.42 | 7: iteration 5310/ 44073 | consumed samples: 2718720 | consumed tokens: 5567938560 | elapsed time per iteration (s): 4.17 | learning rate: 1.945E-04 | global batch size: 512 | lm loss: 2.308213E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.662 | TFLOPs: 57.17 | 7: iteration 5320/ 44073 | consumed samples: 2723840 | consumed tokens: 5578424320 | elapsed time per iteration (s): 4.14 | learning rate: 1.945E-04 | global batch size: 512 | lm loss: 2.314774E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.641 | TFLOPs: 57.62 | 7: iteration 5330/ 44073 | consumed samples: 2728960 | consumed tokens: 5588910080 | elapsed time per iteration (s): 4.14 | learning rate: 1.945E-04 | global batch size: 512 | lm loss: 2.332048E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.674 | TFLOPs: 57.64 | 7: iteration 5340/ 44073 | consumed samples: 2734080 | consumed tokens: 5599395840 | elapsed time per iteration (s): 4.14 | learning rate: 1.945E-04 | global batch size: 512 | lm loss: 2.334245E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.721 | TFLOPs: 57.66 | 7: iteration 5350/ 44073 | consumed samples: 2739200 | consumed tokens: 5609881600 | elapsed time per iteration (s): 4.14 | learning rate: 1.944E-04 | global batch size: 512 | lm loss: 2.330633E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.702 | TFLOPs: 57.65 | 7: iteration 5360/ 44073 | consumed samples: 2744320 | consumed tokens: 5620367360 | elapsed time per iteration (s): 4.14 | learning rate: 1.944E-04 | global batch size: 512 | lm loss: 2.347503E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.717 | TFLOPs: 57.66 | 7: iteration 5370/ 44073 | consumed samples: 2749440 | consumed tokens: 5630853120 | elapsed time per iteration (s): 4.17 | learning rate: 1.944E-04 | global batch size: 512 | lm loss: 2.310438E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.822 | TFLOPs: 57.24 | 7: iteration 5380/ 44073 | consumed samples: 2754560 | consumed tokens: 5641338880 | elapsed time per iteration (s): 4.14 | learning rate: 1.944E-04 | global batch size: 512 | lm loss: 2.327362E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.574 | TFLOPs: 57.59 | 7: iteration 5390/ 44073 | consumed samples: 2759680 | consumed tokens: 5651824640 | elapsed time per iteration (s): 4.14 | learning rate: 1.943E-04 | global batch size: 512 | lm loss: 2.290480E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.635 | TFLOPs: 57.62 | 7: iteration 5400/ 44073 | consumed samples: 2764800 | consumed tokens: 5662310400 | elapsed time per iteration (s): 4.14 | learning rate: 1.943E-04 | global batch size: 512 | lm loss: 2.314888E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.617 | TFLOPs: 57.61 | 7: iteration 5410/ 44073 | consumed samples: 2769920 | consumed tokens: 5672796160 | elapsed time per iteration (s): 4.15 | learning rate: 1.943E-04 | global batch size: 512 | lm loss: 2.310661E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.322 | TFLOPs: 57.47 | 7: iteration 5420/ 44073 | consumed samples: 2775040 | consumed tokens: 5683281920 | elapsed time per iteration (s): 4.15 | learning rate: 1.943E-04 | global batch size: 512 | lm loss: 2.316743E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.505 | TFLOPs: 57.56 | 7: iteration 5430/ 44073 | consumed samples: 2780160 | consumed tokens: 5693767680 | elapsed time per iteration (s): 4.16 | learning rate: 1.943E-04 | global batch size: 512 | lm loss: 2.312607E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.125 | TFLOPs: 57.38 | 7: iteration 5440/ 44073 | consumed samples: 2785280 | consumed tokens: 5704253440 | elapsed time per iteration (s): 4.16 | learning rate: 1.942E-04 | global batch size: 512 | lm loss: 2.328681E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.047 | TFLOPs: 57.35 | 7: iteration 5450/ 44073 | consumed samples: 2790400 | consumed tokens: 5714739200 | elapsed time per iteration (s): 4.14 | learning rate: 1.942E-04 | global batch size: 512 | lm loss: 2.313169E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.702 | TFLOPs: 57.65 | 7: iteration 5460/ 44073 | consumed samples: 2795520 | consumed tokens: 5725224960 | elapsed time per iteration (s): 4.16 | learning rate: 1.942E-04 | global batch size: 512 | lm loss: 2.317591E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.213 | TFLOPs: 57.42 | 7: iteration 5470/ 44073 | consumed samples: 2800640 | consumed tokens: 5735710720 | elapsed time per iteration (s): 4.18 | learning rate: 1.942E-04 | global batch size: 512 | lm loss: 2.289552E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.425 | TFLOPs: 57.06 | 7: iteration 5480/ 44073 | consumed samples: 2805760 | consumed tokens: 5746196480 | elapsed time per iteration (s): 4.18 | learning rate: 1.941E-04 | global batch size: 512 | lm loss: 2.322494E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.620 | TFLOPs: 57.15 | 7: iteration 5490/ 44073 | consumed samples: 2810880 | consumed tokens: 5756682240 | elapsed time per iteration (s): 4.16 | learning rate: 1.941E-04 | global batch size: 512 | lm loss: 2.312510E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.061 | TFLOPs: 57.35 | 7: iteration 5500/ 44073 | consumed samples: 2816000 | consumed tokens: 5767168000 | elapsed time per iteration (s): 4.16 | learning rate: 1.941E-04 | global batch size: 512 | lm loss: 2.322764E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.125 | TFLOPs: 57.38 | 7: iteration 5510/ 44073 | consumed samples: 2821120 | consumed tokens: 5777653760 | elapsed time per iteration (s): 4.15 | learning rate: 1.941E-04 | global batch size: 512 | lm loss: 2.322576E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.399 | TFLOPs: 57.51 | 7: iteration 5520/ 44073 | consumed samples: 2826240 | consumed tokens: 5788139520 | elapsed time per iteration (s): 4.16 | learning rate: 1.940E-04 | global batch size: 512 | lm loss: 2.300802E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.098 | TFLOPs: 57.37 | 7: iteration 5530/ 44073 | consumed samples: 2831360 | consumed tokens: 5798625280 | elapsed time per iteration (s): 4.16 | learning rate: 1.940E-04 | global batch size: 512 | lm loss: 2.312586E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.046 | TFLOPs: 57.35 | 7: iteration 5540/ 44073 | consumed samples: 2836480 | consumed tokens: 5809111040 | elapsed time per iteration (s): 4.24 | learning rate: 1.940E-04 | global batch size: 512 | lm loss: 2.312331E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.840 | TFLOPs: 56.32 | 7: iteration 5550/ 44073 | consumed samples: 2841600 | consumed tokens: 5819596800 | elapsed time per iteration (s): 4.20 | learning rate: 1.940E-04 | global batch size: 512 | lm loss: 2.316909E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.955 | TFLOPs: 56.84 | 7: iteration 5560/ 44073 | consumed samples: 2846720 | consumed tokens: 5830082560 | elapsed time per iteration (s): 4.16 | learning rate: 1.940E-04 | global batch size: 512 | lm loss: 2.329133E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.022 | TFLOPs: 57.33 | 7: iteration 5570/ 44073 | consumed samples: 2851840 | consumed tokens: 5840568320 | elapsed time per iteration (s): 4.18 | learning rate: 1.939E-04 | global batch size: 512 | lm loss: 2.299991E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.611 | TFLOPs: 57.14 | 7: iteration 5580/ 44073 | consumed samples: 2856960 | consumed tokens: 5851054080 | elapsed time per iteration (s): 4.16 | learning rate: 1.939E-04 | global batch size: 512 | lm loss: 2.310463E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.120 | TFLOPs: 57.38 | 7: iteration 5590/ 44073 | consumed samples: 2862080 | consumed tokens: 5861539840 | elapsed time per iteration (s): 4.22 | learning rate: 1.939E-04 | global batch size: 512 | lm loss: 2.310862E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.277 | TFLOPs: 56.52 | 7: iteration 5600/ 44073 | consumed samples: 2867200 | consumed tokens: 5872025600 | elapsed time per iteration (s): 4.17 | learning rate: 1.939E-04 | global batch size: 512 | lm loss: 2.306912E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.720 | TFLOPs: 57.19 | 7: iteration 5610/ 44073 | consumed samples: 2872320 | consumed tokens: 5882511360 | elapsed time per iteration (s): 4.17 | learning rate: 1.938E-04 | global batch size: 512 | lm loss: 2.326416E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.926 | TFLOPs: 57.29 | 7: iteration 5620/ 44073 | consumed samples: 2877440 | consumed tokens: 5892997120 | elapsed time per iteration (s): 4.19 | learning rate: 1.938E-04 | global batch size: 512 | lm loss: 2.325448E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.185 | TFLOPs: 56.94 | 7: iteration 5630/ 44073 | consumed samples: 2882560 | consumed tokens: 5903482880 | elapsed time per iteration (s): 4.17 | learning rate: 1.938E-04 | global batch size: 512 | lm loss: 2.313859E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.735 | TFLOPs: 57.20 | 7: iteration 5640/ 44073 | consumed samples: 2887680 | consumed tokens: 5913968640 | elapsed time per iteration (s): 4.16 | learning rate: 1.938E-04 | global batch size: 512 | lm loss: 2.325600E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.968 | TFLOPs: 57.31 | 7: iteration 5650/ 44073 | consumed samples: 2892800 | consumed tokens: 5924454400 | elapsed time per iteration (s): 7.74 | learning rate: 1.937E-04 | global batch size: 512 | lm loss: 2.294400E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 66.140 | TFLOPs: 30.82 | 7: iteration 5660/ 44073 | consumed samples: 2897920 | consumed tokens: 5934940160 | elapsed time per iteration (s): 6.92 | learning rate: 1.937E-04 | global batch size: 512 | lm loss: 2.297473E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 73.967 | TFLOPs: 34.47 | 7: iteration 5670/ 44073 | consumed samples: 2903040 | consumed tokens: 5945425920 | elapsed time per iteration (s): 4.48 | learning rate: 1.937E-04 | global batch size: 512 | lm loss: 2.302546E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 114.380 | TFLOPs: 53.31 | 7: iteration 5680/ 44073 | consumed samples: 2908160 | consumed tokens: 5955911680 | elapsed time per iteration (s): 4.19 | learning rate: 1.937E-04 | global batch size: 512 | lm loss: 2.322703E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.135 | TFLOPs: 56.92 | 7: iteration 5690/ 44073 | consumed samples: 2913280 | consumed tokens: 5966397440 | elapsed time per iteration (s): 4.17 | learning rate: 1.936E-04 | global batch size: 512 | lm loss: 2.306979E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.681 | TFLOPs: 57.18 | 7: iteration 5700/ 44073 | consumed samples: 2918400 | consumed tokens: 5976883200 | elapsed time per iteration (s): 4.19 | learning rate: 1.936E-04 | global batch size: 512 | lm loss: 2.316458E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.160 | TFLOPs: 56.93 | 7: iteration 5710/ 44073 | consumed samples: 2923520 | consumed tokens: 5987368960 | elapsed time per iteration (s): 4.14 | learning rate: 1.936E-04 | global batch size: 512 | lm loss: 2.276897E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.528 | TFLOPs: 57.57 | 7: iteration 5720/ 44073 | consumed samples: 2928640 | consumed tokens: 5997854720 | elapsed time per iteration (s): 4.15 | learning rate: 1.936E-04 | global batch size: 512 | lm loss: 2.309510E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.503 | TFLOPs: 57.56 | 7: iteration 5730/ 44073 | consumed samples: 2933760 | consumed tokens: 6008340480 | elapsed time per iteration (s): 4.16 | learning rate: 1.936E-04 | global batch size: 512 | lm loss: 2.293452E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.127 | TFLOPs: 57.38 | 7: iteration 5740/ 44073 | consumed samples: 2938880 | consumed tokens: 6018826240 | elapsed time per iteration (s): 4.16 | learning rate: 1.935E-04 | global batch size: 512 | lm loss: 2.323180E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.079 | TFLOPs: 57.36 | 7: iteration 5750/ 44073 | consumed samples: 2944000 | consumed tokens: 6029312000 | elapsed time per iteration (s): 4.17 | learning rate: 1.935E-04 | global batch size: 512 | lm loss: 2.319274E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.804 | TFLOPs: 57.23 | 7: iteration 5760/ 44073 | consumed samples: 2949120 | consumed tokens: 6039797760 | elapsed time per iteration (s): 4.14 | learning rate: 1.935E-04 | global batch size: 512 | lm loss: 2.313880E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.623 | TFLOPs: 57.61 | 7: iteration 5770/ 44073 | consumed samples: 2954240 | consumed tokens: 6050283520 | elapsed time per iteration (s): 4.16 | learning rate: 1.935E-04 | global batch size: 512 | lm loss: 2.296924E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.060 | TFLOPs: 57.35 | 7: iteration 5780/ 44073 | consumed samples: 2959360 | consumed tokens: 6060769280 | elapsed time per iteration (s): 4.15 | learning rate: 1.934E-04 | global batch size: 512 | lm loss: 2.315738E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.477 | TFLOPs: 57.55 | 7: iteration 5790/ 44073 | consumed samples: 2964480 | consumed tokens: 6071255040 | elapsed time per iteration (s): 4.15 | learning rate: 1.934E-04 | global batch size: 512 | lm loss: 2.285949E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.350 | TFLOPs: 57.49 | 7: iteration 5800/ 44073 | consumed samples: 2969600 | consumed tokens: 6081740800 | elapsed time per iteration (s): 4.16 | learning rate: 1.934E-04 | global batch size: 512 | lm loss: 2.311537E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.143 | TFLOPs: 57.39 | 7: iteration 5810/ 44073 | consumed samples: 2974720 | consumed tokens: 6092226560 | elapsed time per iteration (s): 4.23 | learning rate: 1.934E-04 | global batch size: 512 | lm loss: 2.325009E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.054 | TFLOPs: 56.42 | 7: iteration 5820/ 44073 | consumed samples: 2979840 | consumed tokens: 6102712320 | elapsed time per iteration (s): 8.19 | learning rate: 1.933E-04 | global batch size: 512 | lm loss: 2.307777E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 62.515 | TFLOPs: 29.14 | 7: iteration 5830/ 44073 | consumed samples: 2984960 | consumed tokens: 6113198080 | elapsed time per iteration (s): 7.51 | learning rate: 1.933E-04 | global batch size: 512 | lm loss: 2.292696E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 68.205 | TFLOPs: 31.79 | 7: iteration 5840/ 44073 | consumed samples: 2990080 | consumed tokens: 6123683840 | elapsed time per iteration (s): 4.33 | learning rate: 1.933E-04 | global batch size: 512 | lm loss: 2.302678E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.229 | TFLOPs: 55.10 | 7: iteration 5850/ 44073 | consumed samples: 2995200 | consumed tokens: 6134169600 | elapsed time per iteration (s): 4.16 | learning rate: 1.933E-04 | global batch size: 512 | lm loss: 2.297955E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.084 | TFLOPs: 57.36 | 7: iteration 5860/ 44073 | consumed samples: 3000320 | consumed tokens: 6144655360 | elapsed time per iteration (s): 4.15 | learning rate: 1.932E-04 | global batch size: 512 | lm loss: 2.298485E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.336 | TFLOPs: 57.48 | 7: iteration 5870/ 44073 | consumed samples: 3005440 | consumed tokens: 6155141120 | elapsed time per iteration (s): 4.16 | learning rate: 1.932E-04 | global batch size: 512 | lm loss: 2.297169E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.045 | TFLOPs: 57.35 | 7: iteration 5880/ 44073 | consumed samples: 3010560 | consumed tokens: 6165626880 | elapsed time per iteration (s): 4.15 | learning rate: 1.932E-04 | global batch size: 512 | lm loss: 2.294400E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.512 | TFLOPs: 57.56 | 7: iteration 5890/ 44073 | consumed samples: 3015680 | consumed tokens: 6176112640 | elapsed time per iteration (s): 4.16 | learning rate: 1.932E-04 | global batch size: 512 | lm loss: 2.400800E+00 | grad norm: 25.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.008 | TFLOPs: 57.33 | 7: iteration 5900/ 44073 | consumed samples: 3020800 | consumed tokens: 6186598400 | elapsed time per iteration (s): 4.15 | learning rate: 1.931E-04 | global batch size: 512 | lm loss: 4.357144E+00 | grad norm: 11.337 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.431 | TFLOPs: 57.52 | 7: iteration 5910/ 44073 | consumed samples: 3025920 | consumed tokens: 6197084160 | elapsed time per iteration (s): 4.20 | learning rate: 1.931E-04 | global batch size: 512 | lm loss: 4.591341E+00 | grad norm: 5.731 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.767 | TFLOPs: 56.75 | 7: iteration 5920/ 44073 | consumed samples: 3031040 | consumed tokens: 6207569920 | elapsed time per iteration (s): 4.22 | learning rate: 1.931E-04 | global batch size: 512 | lm loss: 3.529504E+00 | grad norm: 1.807 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.387 | TFLOPs: 56.57 | 7: iteration 5930/ 44073 | consumed samples: 3036160 | consumed tokens: 6218055680 | elapsed time per iteration (s): 4.17 | learning rate: 1.931E-04 | global batch size: 512 | lm loss: 3.082863E+00 | grad norm: 1.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.721 | TFLOPs: 57.19 | 7: iteration 5940/ 44073 | consumed samples: 3041280 | consumed tokens: 6228541440 | elapsed time per iteration (s): 4.15 | learning rate: 1.930E-04 | global batch size: 512 | lm loss: 2.857045E+00 | grad norm: 0.517 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.263 | TFLOPs: 57.45 | 7: iteration 5950/ 44073 | consumed samples: 3046400 | consumed tokens: 6239027200 | elapsed time per iteration (s): 4.15 | learning rate: 1.930E-04 | global batch size: 512 | lm loss: 2.707027E+00 | grad norm: 0.512 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.266 | TFLOPs: 57.45 | 7: iteration 5960/ 44073 | consumed samples: 3051520 | consumed tokens: 6249512960 | elapsed time per iteration (s): 4.20 | learning rate: 1.930E-04 | global batch size: 512 | lm loss: 2.600266E+00 | grad norm: 0.338 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.803 | TFLOPs: 56.77 | 7: iteration 5970/ 44073 | consumed samples: 3056640 | consumed tokens: 6259998720 | elapsed time per iteration (s): 4.15 | learning rate: 1.930E-04 | global batch size: 512 | lm loss: 2.521312E+00 | grad norm: 0.244 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.414 | TFLOPs: 57.52 | 7: iteration 5980/ 44073 | consumed samples: 3061760 | consumed tokens: 6270484480 | elapsed time per iteration (s): 4.17 | learning rate: 1.929E-04 | global batch size: 512 | lm loss: 2.422701E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.896 | TFLOPs: 57.28 | 7: iteration 5990/ 44073 | consumed samples: 3066880 | consumed tokens: 6280970240 | elapsed time per iteration (s): 4.16 | learning rate: 1.929E-04 | global batch size: 512 | lm loss: 2.420606E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.160 | TFLOPs: 57.40 | 0: [2022-11-25 16:59:05,234] [INFO] [logging.py:68:log_dist] [Rank 0] step=6000, skipped=0, lr=[0.00019288601721738536, 0.00019288601721738536, 0.00019288601721738536], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 7: iteration 6000/ 44073 | consumed samples: 3072000 | consumed tokens: 6291456000 | elapsed time per iteration (s): 4.16 | learning rate: 1.929E-04 | global batch size: 512 | lm loss: 2.398223E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.135 | TFLOPs: 57.39 | 0: steps: 6000 loss: 2.4362 iter time (s): 4.237 samples/sec: 120.841 7: ------------------------------------------------------------------------------------------ 7: valid loss at iteration 6000 | lm loss value: 2.331460E+00 | lm loss PPL: 1.029296E+01 | 7: ------------------------------------------------------------------------------------------ 0: saving checkpoint at iteration 6000 to checkpoints_2b2 0: [2022-11-25 16:59:06,616] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step6000 is begin to save! 0: [2022-11-25 16:59:06,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_01-model_00-model_states.pt... 0: [2022-11-25 16:59:07,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_01-model_00-model_states.pt. 0: [2022-11-25 16:59:07,182] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_03-model_00-model_states.pt... 0: [2022-11-25 16:59:07,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_03-model_00-model_states.pt. 0: [2022-11-25 16:59:07,328] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_04-model_00-model_states.pt... 0: [2022-11-25 16:59:07,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_04-model_00-model_states.pt. 0: [2022-11-25 16:59:07,476] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_05-model_00-model_states.pt... 0: [2022-11-25 16:59:07,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_05-model_00-model_states.pt. 0: [2022-11-25 16:59:07,603] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_06-model_00-model_states.pt... 0: [2022-11-25 16:59:07,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_06-model_00-model_states.pt. 0: [2022-11-25 16:59:07,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_07-model_00-model_states.pt... 0: [2022-11-25 16:59:07,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_07-model_00-model_states.pt. 0: [2022-11-25 16:59:07,858] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_08-model_00-model_states.pt... 0: [2022-11-25 16:59:07,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_08-model_00-model_states.pt. 0: [2022-11-25 16:59:07,996] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_09-model_00-model_states.pt... 0: [2022-11-25 16:59:08,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_09-model_00-model_states.pt. 0: [2022-11-25 16:59:08,128] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_10-model_00-model_states.pt... 0: [2022-11-25 16:59:08,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_10-model_00-model_states.pt. 0: [2022-11-25 16:59:08,264] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_11-model_00-model_states.pt... 0: [2022-11-25 16:59:08,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_11-model_00-model_states.pt. 0: [2022-11-25 16:59:08,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_12-model_00-model_states.pt... 0: [2022-11-25 16:59:08,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_12-model_00-model_states.pt. 0: [2022-11-25 16:59:08,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_13-model_00-model_states.pt... 0: [2022-11-25 16:59:08,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_13-model_00-model_states.pt. 0: [2022-11-25 16:59:08,664] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_14-model_00-model_states.pt... 0: [2022-11-25 16:59:08,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_14-model_00-model_states.pt. 0: [2022-11-25 16:59:08,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_15-model_00-model_states.pt... 0: [2022-11-25 16:59:08,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_15-model_00-model_states.pt. 0: [2022-11-25 16:59:08,923] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_16-model_00-model_states.pt... 0: [2022-11-25 16:59:09,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_16-model_00-model_states.pt. 0: [2022-11-25 16:59:09,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_17-model_00-model_states.pt... 0: [2022-11-25 16:59:09,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_17-model_00-model_states.pt. 0: [2022-11-25 16:59:09,186] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_18-model_00-model_states.pt... 0: [2022-11-25 16:59:09,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_18-model_00-model_states.pt. 0: [2022-11-25 16:59:09,320] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_19-model_00-model_states.pt... 0: [2022-11-25 16:59:09,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_19-model_00-model_states.pt. 0: [2022-11-25 16:59:09,451] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_20-model_00-model_states.pt... 0: [2022-11-25 16:59:09,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_20-model_00-model_states.pt. 0: [2022-11-25 16:59:09,585] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_21-model_00-model_states.pt... 0: [2022-11-25 16:59:09,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_21-model_00-model_states.pt. 0: [2022-11-25 16:59:09,721] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_22-model_00-model_states.pt... 0: [2022-11-25 16:59:09,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_22-model_00-model_states.pt. 0: [2022-11-25 16:59:09,848] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_23-model_00-model_states.pt... 0: [2022-11-25 16:59:09,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_23-model_00-model_states.pt. 0: [2022-11-25 16:59:09,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_24-model_00-model_states.pt... 0: [2022-11-25 16:59:10,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_24-model_00-model_states.pt. 0: [2022-11-25 16:59:10,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_25-model_00-model_states.pt... 0: [2022-11-25 16:59:10,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_25-model_00-model_states.pt. 0: [2022-11-25 16:59:10,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_26-model_00-model_states.pt... 0: [2022-11-25 16:59:10,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_26-model_00-model_states.pt. 0: [2022-11-25 16:59:10,408] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_27-model_00-model_states.pt... 0: [2022-11-25 16:59:10,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_27-model_00-model_states.pt. 0: [2022-11-25 16:59:10,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_28-model_00-model_states.pt... 0: [2022-11-25 16:59:10,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_28-model_00-model_states.pt. 0: [2022-11-25 16:59:10,665] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_29-model_00-model_states.pt... 0: [2022-11-25 16:59:10,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_29-model_00-model_states.pt. 0: [2022-11-25 16:59:10,795] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_30-model_00-model_states.pt... 0: [2022-11-25 16:59:10,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_30-model_00-model_states.pt. 0: [2022-11-25 16:59:10,925] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_31-model_00-model_states.pt... 0: [2022-11-25 16:59:11,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_31-model_00-model_states.pt. 0: [2022-11-25 16:59:11,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_32-model_00-model_states.pt... 0: [2022-11-25 16:59:11,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_32-model_00-model_states.pt. 0: [2022-11-25 16:59:11,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_33-model_00-model_states.pt... 0: [2022-11-25 16:59:11,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_33-model_00-model_states.pt. 0: [2022-11-25 16:59:11,360] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_34-model_00-model_states.pt... 0: [2022-11-25 16:59:11,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_34-model_00-model_states.pt. 0: [2022-11-25 16:59:11,490] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/layer_36-model_00-model_states.pt... 0: [2022-11-25 16:59:11,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/layer_36-model_00-model_states.pt. 0: [2022-11-25 16:59:11,497] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step6000/mp_rank_00_model_states.pt 0: [2022-11-25 16:59:11,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/mp_rank_00_model_states.pt... 0: [2022-11-25 16:59:11,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/mp_rank_00_model_states.pt. 0: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 3: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 4: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 1: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 2: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 5: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 0: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 7: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 3: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 4: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 1: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 2: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 5: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 0: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 7: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 0: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 16:59:11,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step6000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 4: [2022-11-25 16:59:12,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 16:59:12,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 16:59:12,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 16:59:12,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-25 16:59:12,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 16:59:12,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 16:59:12,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 16:59:12,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 16:59:12,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 16:59:12,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 16:59:12,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 16:59:12,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 16:59:12,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 16:59:12,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 16:59:12,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 16:59:12,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 16:59:12,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 16:59:12,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 16:59:12,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 4: [2022-11-25 16:59:12,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 16:59:12,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 16:59:12,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 4: [2022-11-25 16:59:12,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 16:59:12,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 16:59:12,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 16:59:12,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 16:59:12,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 16:59:12,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 4: [2022-11-25 16:59:12,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 16:59:12,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 16:59:12,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 4: [2022-11-25 16:59:12,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 16:59:12,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 16:59:12,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 4: [2022-11-25 16:59:12,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 16:59:12,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 16:59:12,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 4: [2022-11-25 16:59:12,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 16:59:12,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 16:59:12,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 16:59:12,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 16:59:12,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 16:59:12,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 4: [2022-11-25 16:59:12,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 16:59:12,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 16:59:12,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 1: [2022-11-25 16:59:12,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 16:59:12,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 16:59:12,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 16:59:12,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 16:59:12,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 1: [2022-11-25 16:59:12,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 16:59:12,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 16:59:12,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 1: [2022-11-25 16:59:12,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 1: [2022-11-25 16:59:12,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 16:59:12,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 16:59:12,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 6: [2022-11-25 16:59:12,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 16:59:12,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 16:59:12,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 16:59:12,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 16:59:12,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 16:59:12,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 6: [2022-11-25 16:59:12,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 16:59:12,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 16:59:12,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 16:59:12,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 6: [2022-11-25 16:59:12,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 6: [2022-11-25 16:59:12,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 1: [2022-11-25 16:59:12,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 16:59:12,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 16:59:12,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 1: [2022-11-25 16:59:12,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 16:59:12,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 16:59:12,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 1: [2022-11-25 16:59:12,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 16:59:12,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 16:59:12,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 1: [2022-11-25 16:59:12,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 16:59:12,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 16:59:12,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 6: [2022-11-25 16:59:12,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 16:59:12,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 16:59:12,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 6: [2022-11-25 16:59:12,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 16:59:12,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 16:59:12,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 6: [2022-11-25 16:59:12,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 16:59:12,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 16:59:12,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 6: [2022-11-25 16:59:12,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 16:59:12,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 16:59:12,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 7: [2022-11-25 16:59:12,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 16:59:12,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 16:59:12,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 7: [2022-11-25 16:59:12,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 16:59:12,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 16:59:12,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 16:59:12,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 16:59:12,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 16:59:12,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 16:59:12,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 16:59:12,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 16:59:12,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 16:59:12,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 16:59:12,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 16:59:12,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 16:59:12,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 16:59:12,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 16:59:12,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 7: [2022-11-25 16:59:12,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 7: [2022-11-25 16:59:12,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 7: [2022-11-25 16:59:12,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 7: [2022-11-25 16:59:12,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 7: [2022-11-25 16:59:12,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 7: [2022-11-25 16:59:12,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 2: [2022-11-25 16:59:12,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 16:59:12,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 16:59:12,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 16:59:12,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 16:59:12,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 16:59:12,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 16:59:12,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 16:59:12,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 16:59:12,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 16:59:12,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 2: [2022-11-25 16:59:12,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 2: [2022-11-25 16:59:12,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 16:59:12,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 16:59:12,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 16:59:12,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 16:59:12,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 2: [2022-11-25 16:59:12,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 16:59:12,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 2: [2022-11-25 16:59:12,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 2: [2022-11-25 16:59:12,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 2: [2022-11-25 16:59:12,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 2: [2022-11-25 16:59:12,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 16:59:12,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 16:59:12,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 5: [2022-11-25 16:59:12,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 16:59:12,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 16:59:12,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 16:59:12,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 16:59:12,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 16:59:12,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 16:59:12,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 16:59:12,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 16:59:12,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 16:59:12,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 16:59:12,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 16:59:12,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 5: [2022-11-25 16:59:12,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 5: [2022-11-25 16:59:12,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 16:59:12,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 5: [2022-11-25 16:59:12,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 16:59:12,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 16:59:12,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 16:59:12,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 5: [2022-11-25 16:59:12,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 5: [2022-11-25 16:59:12,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 5: [2022-11-25 16:59:12,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 5: [2022-11-25 16:59:12,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 16:59:12,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 3: [2022-11-25 16:59:12,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 16:59:12,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 16:59:12,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 16:59:12,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 16:59:12,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 16:59:12,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 16:59:12,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 16:59:12,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 16:59:12,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 16:59:12,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 16:59:12,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 16:59:12,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 16:59:12,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 16:59:12,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 16:59:12,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 16:59:12,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 16:59:12,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 3: [2022-11-25 16:59:12,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 3: [2022-11-25 16:59:12,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 3: [2022-11-25 16:59:12,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 3: [2022-11-25 16:59:12,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 3: [2022-11-25 16:59:12,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 3: [2022-11-25 16:59:12,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 3: [2022-11-25 16:59:12,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 16:59:12,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step6000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 16:59:12,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: successfully saved checkpoint at iteration 6000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 6240.38 7: iteration 6010/ 44073 | consumed samples: 3077120 | consumed tokens: 6301941760 | elapsed time per iteration (s): 4.92 | learning rate: 1.929E-04 | global batch size: 512 | lm loss: 2.352177E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 104.134 | TFLOPs: 48.53 | 7: iteration 6020/ 44073 | consumed samples: 3082240 | consumed tokens: 6312427520 | elapsed time per iteration (s): 4.17 | learning rate: 1.928E-04 | global batch size: 512 | lm loss: 2.346381E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.886 | TFLOPs: 57.27 | 7: iteration 6030/ 44073 | consumed samples: 3087360 | consumed tokens: 6322913280 | elapsed time per iteration (s): 4.17 | learning rate: 1.928E-04 | global batch size: 512 | lm loss: 2.332820E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.664 | TFLOPs: 57.17 | 7: iteration 6040/ 44073 | consumed samples: 3092480 | consumed tokens: 6333399040 | elapsed time per iteration (s): 4.17 | learning rate: 1.928E-04 | global batch size: 512 | lm loss: 2.347095E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.914 | TFLOPs: 57.28 | 7: iteration 6050/ 44073 | consumed samples: 3097600 | consumed tokens: 6343884800 | elapsed time per iteration (s): 4.14 | learning rate: 1.928E-04 | global batch size: 512 | lm loss: 2.327042E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.546 | TFLOPs: 57.58 | 7: iteration 6060/ 44073 | consumed samples: 3102720 | consumed tokens: 6354370560 | elapsed time per iteration (s): 4.18 | learning rate: 1.927E-04 | global batch size: 512 | lm loss: 2.323899E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.608 | TFLOPs: 57.14 | 7: iteration 6070/ 44073 | consumed samples: 3107840 | consumed tokens: 6364856320 | elapsed time per iteration (s): 4.24 | learning rate: 1.927E-04 | global batch size: 512 | lm loss: 2.350118E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.870 | TFLOPs: 56.33 | 7: iteration 6080/ 44073 | consumed samples: 3112960 | consumed tokens: 6375342080 | elapsed time per iteration (s): 4.16 | learning rate: 1.927E-04 | global batch size: 512 | lm loss: 2.335043E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.042 | TFLOPs: 57.34 | 7: iteration 6090/ 44073 | consumed samples: 3118080 | consumed tokens: 6385827840 | elapsed time per iteration (s): 4.16 | learning rate: 1.927E-04 | global batch size: 512 | lm loss: 2.294753E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.052 | TFLOPs: 57.35 | 7: iteration 6100/ 44073 | consumed samples: 3123200 | consumed tokens: 6396313600 | elapsed time per iteration (s): 4.14 | learning rate: 1.926E-04 | global batch size: 512 | lm loss: 2.326874E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.760 | TFLOPs: 57.68 | 7: iteration 6110/ 44073 | consumed samples: 3128320 | consumed tokens: 6406799360 | elapsed time per iteration (s): 4.15 | learning rate: 1.926E-04 | global batch size: 512 | lm loss: 2.332491E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.476 | TFLOPs: 57.55 | 7: iteration 6120/ 44073 | consumed samples: 3133440 | consumed tokens: 6417285120 | elapsed time per iteration (s): 4.14 | learning rate: 1.926E-04 | global batch size: 512 | lm loss: 2.310118E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.751 | TFLOPs: 57.67 | 7: iteration 6130/ 44073 | consumed samples: 3138560 | consumed tokens: 6427770880 | elapsed time per iteration (s): 4.17 | learning rate: 1.926E-04 | global batch size: 512 | lm loss: 2.323441E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.898 | TFLOPs: 57.28 | 7: iteration 6140/ 44073 | consumed samples: 3143680 | consumed tokens: 6438256640 | elapsed time per iteration (s): 4.14 | learning rate: 1.925E-04 | global batch size: 512 | lm loss: 2.321258E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.556 | TFLOPs: 57.58 | 7: iteration 6150/ 44073 | consumed samples: 3148800 | consumed tokens: 6448742400 | elapsed time per iteration (s): 4.16 | learning rate: 1.925E-04 | global batch size: 512 | lm loss: 2.321519E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.938 | TFLOPs: 57.30 | 7: iteration 6160/ 44073 | consumed samples: 3153920 | consumed tokens: 6459228160 | elapsed time per iteration (s): 4.14 | learning rate: 1.925E-04 | global batch size: 512 | lm loss: 2.317464E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.576 | TFLOPs: 57.59 | 7: iteration 6170/ 44073 | consumed samples: 3159040 | consumed tokens: 6469713920 | elapsed time per iteration (s): 4.14 | learning rate: 1.925E-04 | global batch size: 512 | lm loss: 2.330205E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.534 | TFLOPs: 57.57 | 7: iteration 6180/ 44073 | consumed samples: 3164160 | consumed tokens: 6480199680 | elapsed time per iteration (s): 4.16 | learning rate: 1.924E-04 | global batch size: 512 | lm loss: 2.316457E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.214 | TFLOPs: 57.42 | 7: iteration 6190/ 44073 | consumed samples: 3169280 | consumed tokens: 6490685440 | elapsed time per iteration (s): 4.16 | learning rate: 1.924E-04 | global batch size: 512 | lm loss: 2.312143E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.958 | TFLOPs: 57.30 | 7: iteration 6200/ 44073 | consumed samples: 3174400 | consumed tokens: 6501171200 | elapsed time per iteration (s): 4.16 | learning rate: 1.924E-04 | global batch size: 512 | lm loss: 2.309326E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.139 | TFLOPs: 57.39 | 7: iteration 6210/ 44073 | consumed samples: 3179520 | consumed tokens: 6511656960 | elapsed time per iteration (s): 4.15 | learning rate: 1.923E-04 | global batch size: 512 | lm loss: 2.322460E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.350 | TFLOPs: 57.49 | 7: iteration 6220/ 44073 | consumed samples: 3184640 | consumed tokens: 6522142720 | elapsed time per iteration (s): 4.17 | learning rate: 1.923E-04 | global batch size: 512 | lm loss: 2.302110E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.821 | TFLOPs: 57.24 | 7: iteration 6230/ 44073 | consumed samples: 3189760 | consumed tokens: 6532628480 | elapsed time per iteration (s): 4.14 | learning rate: 1.923E-04 | global batch size: 512 | lm loss: 2.321315E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.628 | TFLOPs: 57.62 | 7: iteration 6240/ 44073 | consumed samples: 3194880 | consumed tokens: 6543114240 | elapsed time per iteration (s): 4.16 | learning rate: 1.923E-04 | global batch size: 512 | lm loss: 2.293646E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.081 | TFLOPs: 57.36 | 7: iteration 6250/ 44073 | consumed samples: 3200000 | consumed tokens: 6553600000 | elapsed time per iteration (s): 4.15 | learning rate: 1.922E-04 | global batch size: 512 | lm loss: 2.281526E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.349 | TFLOPs: 57.49 | 7: iteration 6260/ 44073 | consumed samples: 3205120 | consumed tokens: 6564085760 | elapsed time per iteration (s): 4.14 | learning rate: 1.922E-04 | global batch size: 512 | lm loss: 2.298387E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.707 | TFLOPs: 57.65 | 7: iteration 6270/ 44073 | consumed samples: 3210240 | consumed tokens: 6574571520 | elapsed time per iteration (s): 4.15 | learning rate: 1.922E-04 | global batch size: 512 | lm loss: 2.279857E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.506 | TFLOPs: 57.56 | 7: iteration 6280/ 44073 | consumed samples: 3215360 | consumed tokens: 6585057280 | elapsed time per iteration (s): 4.15 | learning rate: 1.922E-04 | global batch size: 512 | lm loss: 2.320474E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.310 | TFLOPs: 57.47 | 7: iteration 6290/ 44073 | consumed samples: 3220480 | consumed tokens: 6595543040 | elapsed time per iteration (s): 4.14 | learning rate: 1.921E-04 | global batch size: 512 | lm loss: 2.301181E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.523 | TFLOPs: 57.57 | 7: iteration 6300/ 44073 | consumed samples: 3225600 | consumed tokens: 6606028800 | elapsed time per iteration (s): 4.17 | learning rate: 1.921E-04 | global batch size: 512 | lm loss: 2.308497E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.912 | TFLOPs: 57.28 | 7: iteration 6310/ 44073 | consumed samples: 3230720 | consumed tokens: 6616514560 | elapsed time per iteration (s): 4.15 | learning rate: 1.921E-04 | global batch size: 512 | lm loss: 2.279193E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.375 | TFLOPs: 57.50 | 7: iteration 6320/ 44073 | consumed samples: 3235840 | consumed tokens: 6627000320 | elapsed time per iteration (s): 4.14 | learning rate: 1.921E-04 | global batch size: 512 | lm loss: 2.312661E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.623 | TFLOPs: 57.61 | 7: iteration 6330/ 44073 | consumed samples: 3240960 | consumed tokens: 6637486080 | elapsed time per iteration (s): 4.15 | learning rate: 1.920E-04 | global batch size: 512 | lm loss: 2.284381E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.300 | TFLOPs: 57.46 | 7: iteration 6340/ 44073 | consumed samples: 3246080 | consumed tokens: 6647971840 | elapsed time per iteration (s): 4.14 | learning rate: 1.920E-04 | global batch size: 512 | lm loss: 2.304191E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.552 | TFLOPs: 57.58 | 7: iteration 6350/ 44073 | consumed samples: 3251200 | consumed tokens: 6658457600 | elapsed time per iteration (s): 4.16 | learning rate: 1.920E-04 | global batch size: 512 | lm loss: 2.288386E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.087 | TFLOPs: 57.36 | 7: iteration 6360/ 44073 | consumed samples: 3256320 | consumed tokens: 6668943360 | elapsed time per iteration (s): 4.15 | learning rate: 1.919E-04 | global batch size: 512 | lm loss: 2.279384E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.431 | TFLOPs: 57.52 | 7: iteration 6370/ 44073 | consumed samples: 3261440 | consumed tokens: 6679429120 | elapsed time per iteration (s): 4.17 | learning rate: 1.919E-04 | global batch size: 512 | lm loss: 2.275126E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.751 | TFLOPs: 57.21 | 7: iteration 6380/ 44073 | consumed samples: 3266560 | consumed tokens: 6689914880 | elapsed time per iteration (s): 4.15 | learning rate: 1.919E-04 | global batch size: 512 | lm loss: 2.303964E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.257 | TFLOPs: 57.44 | 7: iteration 6390/ 44073 | consumed samples: 3271680 | consumed tokens: 6700400640 | elapsed time per iteration (s): 4.14 | learning rate: 1.919E-04 | global batch size: 512 | lm loss: 2.285871E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.627 | TFLOPs: 57.62 | 7: iteration 6400/ 44073 | consumed samples: 3276800 | consumed tokens: 6710886400 | elapsed time per iteration (s): 4.16 | learning rate: 1.918E-04 | global batch size: 512 | lm loss: 2.289210E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.031 | TFLOPs: 57.34 | 7: iteration 6410/ 44073 | consumed samples: 3281920 | consumed tokens: 6721372160 | elapsed time per iteration (s): 4.16 | learning rate: 1.918E-04 | global batch size: 512 | lm loss: 2.287165E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.013 | TFLOPs: 57.33 | 7: iteration 6420/ 44073 | consumed samples: 3287040 | consumed tokens: 6731857920 | elapsed time per iteration (s): 4.15 | learning rate: 1.918E-04 | global batch size: 512 | lm loss: 2.285889E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.370 | TFLOPs: 57.50 | 7: iteration 6430/ 44073 | consumed samples: 3292160 | consumed tokens: 6742343680 | elapsed time per iteration (s): 4.14 | learning rate: 1.918E-04 | global batch size: 512 | lm loss: 2.261962E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.774 | TFLOPs: 57.68 | 7: iteration 6440/ 44073 | consumed samples: 3297280 | consumed tokens: 6752829440 | elapsed time per iteration (s): 4.34 | learning rate: 1.917E-04 | global batch size: 512 | lm loss: 2.284690E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.919 | TFLOPs: 54.96 | 7: iteration 6450/ 44073 | consumed samples: 3302400 | consumed tokens: 6763315200 | elapsed time per iteration (s): 4.17 | learning rate: 1.917E-04 | global batch size: 512 | lm loss: 2.290300E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.826 | TFLOPs: 57.24 | 7: iteration 6460/ 44073 | consumed samples: 3307520 | consumed tokens: 6773800960 | elapsed time per iteration (s): 4.17 | learning rate: 1.917E-04 | global batch size: 512 | lm loss: 2.307555E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.830 | TFLOPs: 57.25 | 7: iteration 6470/ 44073 | consumed samples: 3312640 | consumed tokens: 6784286720 | elapsed time per iteration (s): 4.15 | learning rate: 1.917E-04 | global batch size: 512 | lm loss: 2.300074E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.388 | TFLOPs: 57.50 | 7: iteration 6480/ 44073 | consumed samples: 3317760 | consumed tokens: 6794772480 | elapsed time per iteration (s): 4.18 | learning rate: 1.916E-04 | global batch size: 512 | lm loss: 2.270021E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.443 | TFLOPs: 57.06 | 7: iteration 6490/ 44073 | consumed samples: 3322880 | consumed tokens: 6805258240 | elapsed time per iteration (s): 4.20 | learning rate: 1.916E-04 | global batch size: 512 | lm loss: 2.286858E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.007 | TFLOPs: 56.86 | 7: iteration 6500/ 44073 | consumed samples: 3328000 | consumed tokens: 6815744000 | elapsed time per iteration (s): 4.17 | learning rate: 1.916E-04 | global batch size: 512 | lm loss: 2.274761E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.905 | TFLOPs: 57.28 | 7: iteration 6510/ 44073 | consumed samples: 3333120 | consumed tokens: 6826229760 | elapsed time per iteration (s): 4.16 | learning rate: 1.915E-04 | global batch size: 512 | lm loss: 2.289023E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.192 | TFLOPs: 57.41 | 7: iteration 6520/ 44073 | consumed samples: 3338240 | consumed tokens: 6836715520 | elapsed time per iteration (s): 4.16 | learning rate: 1.915E-04 | global batch size: 512 | lm loss: 2.275866E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.193 | TFLOPs: 57.41 | 7: iteration 6530/ 44073 | consumed samples: 3343360 | consumed tokens: 6847201280 | elapsed time per iteration (s): 4.17 | learning rate: 1.915E-04 | global batch size: 512 | lm loss: 2.260738E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.690 | TFLOPs: 57.18 | 7: iteration 6540/ 44073 | consumed samples: 3348480 | consumed tokens: 6857687040 | elapsed time per iteration (s): 4.17 | learning rate: 1.915E-04 | global batch size: 512 | lm loss: 2.298178E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.637 | TFLOPs: 57.15 | 7: iteration 6550/ 44073 | consumed samples: 3353600 | consumed tokens: 6868172800 | elapsed time per iteration (s): 4.25 | learning rate: 1.914E-04 | global batch size: 512 | lm loss: 2.291101E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.483 | TFLOPs: 56.15 | 7: iteration 6560/ 44073 | consumed samples: 3358720 | consumed tokens: 6878658560 | elapsed time per iteration (s): 4.15 | learning rate: 1.914E-04 | global batch size: 512 | lm loss: 2.280014E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.288 | TFLOPs: 57.46 | 7: iteration 6570/ 44073 | consumed samples: 3363840 | consumed tokens: 6889144320 | elapsed time per iteration (s): 4.17 | learning rate: 1.914E-04 | global batch size: 512 | lm loss: 2.249982E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.804 | TFLOPs: 57.23 | 7: iteration 6580/ 44073 | consumed samples: 3368960 | consumed tokens: 6899630080 | elapsed time per iteration (s): 4.39 | learning rate: 1.913E-04 | global batch size: 512 | lm loss: 2.287820E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 116.602 | TFLOPs: 54.34 | 7: iteration 6590/ 44073 | consumed samples: 3374080 | consumed tokens: 6910115840 | elapsed time per iteration (s): 4.17 | learning rate: 1.913E-04 | global batch size: 512 | lm loss: 2.263323E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.854 | TFLOPs: 57.26 | 7: iteration 6600/ 44073 | consumed samples: 3379200 | consumed tokens: 6920601600 | elapsed time per iteration (s): 4.17 | learning rate: 1.913E-04 | global batch size: 512 | lm loss: 2.281454E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.672 | TFLOPs: 57.17 | 7: iteration 6610/ 44073 | consumed samples: 3384320 | consumed tokens: 6931087360 | elapsed time per iteration (s): 4.17 | learning rate: 1.913E-04 | global batch size: 512 | lm loss: 2.271494E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.917 | TFLOPs: 57.29 | 7: iteration 6620/ 44073 | consumed samples: 3389440 | consumed tokens: 6941573120 | elapsed time per iteration (s): 4.18 | learning rate: 1.912E-04 | global batch size: 512 | lm loss: 2.281390E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.372 | TFLOPs: 57.03 | 7: iteration 6630/ 44073 | consumed samples: 3394560 | consumed tokens: 6952058880 | elapsed time per iteration (s): 4.18 | learning rate: 1.912E-04 | global batch size: 512 | lm loss: 2.288364E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.595 | TFLOPs: 57.14 | 7: iteration 6640/ 44073 | consumed samples: 3399680 | consumed tokens: 6962544640 | elapsed time per iteration (s): 4.17 | learning rate: 1.912E-04 | global batch size: 512 | lm loss: 2.271244E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.724 | TFLOPs: 57.20 | 7: iteration 6650/ 44073 | consumed samples: 3404800 | consumed tokens: 6973030400 | elapsed time per iteration (s): 4.16 | learning rate: 1.912E-04 | global batch size: 512 | lm loss: 2.289170E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.186 | TFLOPs: 57.41 | 7: iteration 6660/ 44073 | consumed samples: 3409920 | consumed tokens: 6983516160 | elapsed time per iteration (s): 4.17 | learning rate: 1.911E-04 | global batch size: 512 | lm loss: 2.277404E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.678 | TFLOPs: 57.17 | 7: iteration 6670/ 44073 | consumed samples: 3415040 | consumed tokens: 6994001920 | elapsed time per iteration (s): 4.18 | learning rate: 1.911E-04 | global batch size: 512 | lm loss: 2.269222E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.408 | TFLOPs: 57.05 | 7: iteration 6680/ 44073 | consumed samples: 3420160 | consumed tokens: 7004487680 | elapsed time per iteration (s): 4.18 | learning rate: 1.911E-04 | global batch size: 512 | lm loss: 2.271198E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.516 | TFLOPs: 57.10 | 7: iteration 6690/ 44073 | consumed samples: 3425280 | consumed tokens: 7014973440 | elapsed time per iteration (s): 4.15 | learning rate: 1.910E-04 | global batch size: 512 | lm loss: 2.271295E+00 | grad norm: 0.243 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.338 | TFLOPs: 57.48 | 7: iteration 6700/ 44073 | consumed samples: 3430400 | consumed tokens: 7025459200 | elapsed time per iteration (s): 4.17 | learning rate: 1.910E-04 | global batch size: 512 | lm loss: 2.278869E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.919 | TFLOPs: 57.29 | 7: iteration 6710/ 44073 | consumed samples: 3435520 | consumed tokens: 7035944960 | elapsed time per iteration (s): 4.17 | learning rate: 1.910E-04 | global batch size: 512 | lm loss: 2.307840E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.806 | TFLOPs: 57.23 | 7: iteration 6720/ 44073 | consumed samples: 3440640 | consumed tokens: 7046430720 | elapsed time per iteration (s): 4.14 | learning rate: 1.910E-04 | global batch size: 512 | lm loss: 2.265228E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.613 | TFLOPs: 57.61 | 7: iteration 6730/ 44073 | consumed samples: 3445760 | consumed tokens: 7056916480 | elapsed time per iteration (s): 4.17 | learning rate: 1.909E-04 | global batch size: 512 | lm loss: 2.290335E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.879 | TFLOPs: 57.27 | 7: iteration 6740/ 44073 | consumed samples: 3450880 | consumed tokens: 7067402240 | elapsed time per iteration (s): 4.16 | learning rate: 1.909E-04 | global batch size: 512 | lm loss: 2.256063E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.033 | TFLOPs: 57.34 | 7: iteration 6750/ 44073 | consumed samples: 3456000 | consumed tokens: 7077888000 | elapsed time per iteration (s): 4.14 | learning rate: 1.909E-04 | global batch size: 512 | lm loss: 2.252279E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.597 | TFLOPs: 57.60 | 7: iteration 6760/ 44073 | consumed samples: 3461120 | consumed tokens: 7088373760 | elapsed time per iteration (s): 4.15 | learning rate: 1.908E-04 | global batch size: 512 | lm loss: 2.288838E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.278 | TFLOPs: 57.45 | 7: iteration 6770/ 44073 | consumed samples: 3466240 | consumed tokens: 7098859520 | elapsed time per iteration (s): 4.14 | learning rate: 1.908E-04 | global batch size: 512 | lm loss: 2.275353E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.642 | TFLOPs: 57.62 | 7: iteration 6780/ 44073 | consumed samples: 3471360 | consumed tokens: 7109345280 | elapsed time per iteration (s): 4.17 | learning rate: 1.908E-04 | global batch size: 512 | lm loss: 2.269349E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.885 | TFLOPs: 57.27 | 7: iteration 6790/ 44073 | consumed samples: 3476480 | consumed tokens: 7119831040 | elapsed time per iteration (s): 4.19 | learning rate: 1.908E-04 | global batch size: 512 | lm loss: 2.275544E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.313 | TFLOPs: 57.00 | 7: iteration 6800/ 44073 | consumed samples: 3481600 | consumed tokens: 7130316800 | elapsed time per iteration (s): 4.15 | learning rate: 1.907E-04 | global batch size: 512 | lm loss: 2.284304E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.385 | TFLOPs: 57.50 | 7: iteration 6810/ 44073 | consumed samples: 3486720 | consumed tokens: 7140802560 | elapsed time per iteration (s): 4.16 | learning rate: 1.907E-04 | global batch size: 512 | lm loss: 2.265701E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.109 | TFLOPs: 57.37 | 7: iteration 6820/ 44073 | consumed samples: 3491840 | consumed tokens: 7151288320 | elapsed time per iteration (s): 4.17 | learning rate: 1.907E-04 | global batch size: 512 | lm loss: 2.262713E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.835 | TFLOPs: 57.25 | 7: iteration 6830/ 44073 | consumed samples: 3496960 | consumed tokens: 7161774080 | elapsed time per iteration (s): 4.15 | learning rate: 1.906E-04 | global batch size: 512 | lm loss: 2.276014E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.322 | TFLOPs: 57.47 | 7: iteration 6840/ 44073 | consumed samples: 3502080 | consumed tokens: 7172259840 | elapsed time per iteration (s): 4.17 | learning rate: 1.906E-04 | global batch size: 512 | lm loss: 2.274219E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.654 | TFLOPs: 57.16 | 7: iteration 6850/ 44073 | consumed samples: 3507200 | consumed tokens: 7182745600 | elapsed time per iteration (s): 4.16 | learning rate: 1.906E-04 | global batch size: 512 | lm loss: 2.288600E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.177 | TFLOPs: 57.41 | 7: iteration 6860/ 44073 | consumed samples: 3512320 | consumed tokens: 7193231360 | elapsed time per iteration (s): 4.15 | learning rate: 1.906E-04 | global batch size: 512 | lm loss: 2.255135E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.324 | TFLOPs: 57.48 | 7: iteration 6870/ 44073 | consumed samples: 3517440 | consumed tokens: 7203717120 | elapsed time per iteration (s): 4.17 | learning rate: 1.905E-04 | global batch size: 512 | lm loss: 2.277048E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.927 | TFLOPs: 57.29 | 7: iteration 6880/ 44073 | consumed samples: 3522560 | consumed tokens: 7214202880 | elapsed time per iteration (s): 4.15 | learning rate: 1.905E-04 | global batch size: 512 | lm loss: 2.259602E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.334 | TFLOPs: 57.48 | 7: iteration 6890/ 44073 | consumed samples: 3527680 | consumed tokens: 7224688640 | elapsed time per iteration (s): 4.14 | learning rate: 1.905E-04 | global batch size: 512 | lm loss: 2.276059E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.602 | TFLOPs: 57.60 | 7: iteration 6900/ 44073 | consumed samples: 3532800 | consumed tokens: 7235174400 | elapsed time per iteration (s): 4.14 | learning rate: 1.904E-04 | global batch size: 512 | lm loss: 2.279462E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.584 | TFLOPs: 57.60 | 7: iteration 6910/ 44073 | consumed samples: 3537920 | consumed tokens: 7245660160 | elapsed time per iteration (s): 4.15 | learning rate: 1.904E-04 | global batch size: 512 | lm loss: 2.254719E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.335 | TFLOPs: 57.48 | 7: iteration 6920/ 44073 | consumed samples: 3543040 | consumed tokens: 7256145920 | elapsed time per iteration (s): 4.15 | learning rate: 1.904E-04 | global batch size: 512 | lm loss: 2.267915E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.513 | TFLOPs: 57.56 | 7: iteration 6930/ 44073 | consumed samples: 3548160 | consumed tokens: 7266631680 | elapsed time per iteration (s): 4.17 | learning rate: 1.904E-04 | global batch size: 512 | lm loss: 2.292614E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.832 | TFLOPs: 57.25 | 7: iteration 6940/ 44073 | consumed samples: 3553280 | consumed tokens: 7277117440 | elapsed time per iteration (s): 4.16 | learning rate: 1.903E-04 | global batch size: 512 | lm loss: 2.254272E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.176 | TFLOPs: 57.41 | 7: iteration 6950/ 44073 | consumed samples: 3558400 | consumed tokens: 7287603200 | elapsed time per iteration (s): 4.16 | learning rate: 1.903E-04 | global batch size: 512 | lm loss: 2.251509E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.938 | TFLOPs: 57.30 | 7: iteration 6960/ 44073 | consumed samples: 3563520 | consumed tokens: 7298088960 | elapsed time per iteration (s): 4.15 | learning rate: 1.903E-04 | global batch size: 512 | lm loss: 2.264749E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.250 | TFLOPs: 57.44 | 7: iteration 6970/ 44073 | consumed samples: 3568640 | consumed tokens: 7308574720 | elapsed time per iteration (s): 4.16 | learning rate: 1.902E-04 | global batch size: 512 | lm loss: 2.273816E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.052 | TFLOPs: 57.35 | 7: iteration 6980/ 44073 | consumed samples: 3573760 | consumed tokens: 7319060480 | elapsed time per iteration (s): 4.16 | learning rate: 1.902E-04 | global batch size: 512 | lm loss: 2.272180E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.089 | TFLOPs: 57.37 | 7: iteration 6990/ 44073 | consumed samples: 3578880 | consumed tokens: 7329546240 | elapsed time per iteration (s): 4.15 | learning rate: 1.902E-04 | global batch size: 512 | lm loss: 2.258416E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.477 | TFLOPs: 57.55 | 7: iteration 7000/ 44073 | consumed samples: 3584000 | consumed tokens: 7340032000 | elapsed time per iteration (s): 4.14 | learning rate: 1.901E-04 | global batch size: 512 | lm loss: 2.237976E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.735 | TFLOPs: 57.67 | 7: ------------------------------------------------------------------------------------------ 7: valid loss at iteration 7000 | lm loss value: 2.213866E+00 | lm loss PPL: 9.151030E+00 | 7: ------------------------------------------------------------------------------------------ 0: saving checkpoint at iteration 7000 to checkpoints_2b2 0: [2022-11-25 18:08:38,453] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step7000 is begin to save! 0: [2022-11-25 18:08:38,461] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_01-model_00-model_states.pt... 0: [2022-11-25 18:08:38,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_01-model_00-model_states.pt. 0: [2022-11-25 18:08:38,805] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_03-model_00-model_states.pt... 0: [2022-11-25 18:08:38,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_03-model_00-model_states.pt. 0: [2022-11-25 18:08:38,957] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_04-model_00-model_states.pt... 0: [2022-11-25 18:08:39,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_04-model_00-model_states.pt. 0: [2022-11-25 18:08:39,107] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_05-model_00-model_states.pt... 0: [2022-11-25 18:08:39,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_05-model_00-model_states.pt. 0: [2022-11-25 18:08:39,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_06-model_00-model_states.pt... 0: [2022-11-25 18:08:39,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_06-model_00-model_states.pt. 0: [2022-11-25 18:08:39,368] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_07-model_00-model_states.pt... 0: [2022-11-25 18:08:39,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_07-model_00-model_states.pt. 0: [2022-11-25 18:08:39,504] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_08-model_00-model_states.pt... 0: [2022-11-25 18:08:39,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_08-model_00-model_states.pt. 0: [2022-11-25 18:08:39,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_09-model_00-model_states.pt... 0: [2022-11-25 18:08:39,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_09-model_00-model_states.pt. 0: [2022-11-25 18:08:39,766] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_10-model_00-model_states.pt... 0: [2022-11-25 18:08:39,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_10-model_00-model_states.pt. 0: [2022-11-25 18:08:39,891] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_11-model_00-model_states.pt... 0: [2022-11-25 18:08:40,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_11-model_00-model_states.pt. 0: [2022-11-25 18:08:40,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_12-model_00-model_states.pt... 0: [2022-11-25 18:08:40,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_12-model_00-model_states.pt. 0: [2022-11-25 18:08:40,154] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_13-model_00-model_states.pt... 0: [2022-11-25 18:08:40,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_13-model_00-model_states.pt. 0: [2022-11-25 18:08:40,288] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_14-model_00-model_states.pt... 0: [2022-11-25 18:08:40,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_14-model_00-model_states.pt. 0: [2022-11-25 18:08:40,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_15-model_00-model_states.pt... 0: [2022-11-25 18:08:40,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_15-model_00-model_states.pt. 0: [2022-11-25 18:08:40,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_16-model_00-model_states.pt... 0: [2022-11-25 18:08:40,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_16-model_00-model_states.pt. 0: [2022-11-25 18:08:40,686] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_17-model_00-model_states.pt... 0: [2022-11-25 18:08:40,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_17-model_00-model_states.pt. 0: [2022-11-25 18:08:40,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_18-model_00-model_states.pt... 0: [2022-11-25 18:08:40,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_18-model_00-model_states.pt. 0: [2022-11-25 18:08:40,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_19-model_00-model_states.pt... 0: [2022-11-25 18:08:41,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_19-model_00-model_states.pt. 0: [2022-11-25 18:08:41,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_20-model_00-model_states.pt... 0: [2022-11-25 18:08:41,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_20-model_00-model_states.pt. 0: [2022-11-25 18:08:41,241] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_21-model_00-model_states.pt... 0: [2022-11-25 18:08:41,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_21-model_00-model_states.pt. 0: [2022-11-25 18:08:41,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_22-model_00-model_states.pt... 0: [2022-11-25 18:08:41,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_22-model_00-model_states.pt. 0: [2022-11-25 18:08:41,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_23-model_00-model_states.pt... 0: [2022-11-25 18:08:41,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_23-model_00-model_states.pt. 0: [2022-11-25 18:08:41,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_24-model_00-model_states.pt... 0: [2022-11-25 18:08:41,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_24-model_00-model_states.pt. 0: [2022-11-25 18:08:41,794] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_25-model_00-model_states.pt... 0: [2022-11-25 18:08:41,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_25-model_00-model_states.pt. 0: [2022-11-25 18:08:41,930] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_26-model_00-model_states.pt... 0: [2022-11-25 18:08:42,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_26-model_00-model_states.pt. 0: [2022-11-25 18:08:42,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_27-model_00-model_states.pt... 0: [2022-11-25 18:08:42,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_27-model_00-model_states.pt. 0: [2022-11-25 18:08:42,206] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_28-model_00-model_states.pt... 0: [2022-11-25 18:08:42,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_28-model_00-model_states.pt. 0: [2022-11-25 18:08:42,343] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_29-model_00-model_states.pt... 0: [2022-11-25 18:08:42,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_29-model_00-model_states.pt. 0: [2022-11-25 18:08:42,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_30-model_00-model_states.pt... 0: [2022-11-25 18:08:42,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_30-model_00-model_states.pt. 0: [2022-11-25 18:08:42,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_31-model_00-model_states.pt... 0: [2022-11-25 18:08:42,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_31-model_00-model_states.pt. 0: [2022-11-25 18:08:42,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_32-model_00-model_states.pt... 0: [2022-11-25 18:08:42,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_32-model_00-model_states.pt. 0: [2022-11-25 18:08:42,890] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_33-model_00-model_states.pt... 0: [2022-11-25 18:08:43,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_33-model_00-model_states.pt. 0: [2022-11-25 18:08:43,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_34-model_00-model_states.pt... 0: [2022-11-25 18:08:43,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_34-model_00-model_states.pt. 0: [2022-11-25 18:08:43,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/layer_36-model_00-model_states.pt... 0: [2022-11-25 18:08:43,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/layer_36-model_00-model_states.pt. 0: [2022-11-25 18:08:43,169] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step7000/mp_rank_00_model_states.pt 0: [2022-11-25 18:08:43,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/mp_rank_00_model_states.pt... 0: [2022-11-25 18:08:43,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/mp_rank_00_model_states.pt. 2: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:08:43,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step7000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:08:43,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:08:43,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:08:43,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 18:08:43,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 18:08:43,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:08:43,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 18:08:43,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 18:08:43,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:08:43,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 18:08:43,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 2: [2022-11-25 18:08:43,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:08:43,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 18:08:43,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 18:08:43,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:08:43,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 18:08:43,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 18:08:43,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:08:43,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 18:08:43,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 2: [2022-11-25 18:08:43,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:08:43,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 18:08:43,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 18:08:43,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:08:43,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 18:08:43,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 2: [2022-11-25 18:08:43,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:08:43,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 18:08:43,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 4: [2022-11-25 18:08:43,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:08:43,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:08:43,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 18:08:43,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 18:08:43,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 18:08:43,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 4: [2022-11-25 18:08:43,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:08:43,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 18:08:43,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 4: [2022-11-25 18:08:43,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:08:43,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 18:08:43,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 4: [2022-11-25 18:08:43,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:08:43,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 18:08:43,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 2: [2022-11-25 18:08:43,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:08:43,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 18:08:43,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 2: [2022-11-25 18:08:43,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:08:43,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 18:08:43,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 2: [2022-11-25 18:08:43,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:08:43,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 18:08:43,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 4: [2022-11-25 18:08:43,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:08:43,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 18:08:43,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 2: [2022-11-25 18:08:43,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:08:43,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 18:08:43,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 2: [2022-11-25 18:08:43,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:08:43,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 18:08:43,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 4: [2022-11-25 18:08:43,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:08:43,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 18:08:43,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 4: [2022-11-25 18:08:43,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:08:43,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 18:08:43,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 4: [2022-11-25 18:08:43,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:08:43,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 18:08:43,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 3: [2022-11-25 18:08:44,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:08:44,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:08:44,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:08:44,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:08:44,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:08:44,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 18:08:44,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 18:08:44,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 18:08:44,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 18:08:44,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 3: [2022-11-25 18:08:44,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 3: [2022-11-25 18:08:44,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 18:08:44,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 3: [2022-11-25 18:08:44,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 3: [2022-11-25 18:08:44,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 7: [2022-11-25 18:08:44,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:08:44,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:08:44,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:08:44,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:08:44,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:08:44,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:08:44,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:08:44,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 18:08:44,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 18:08:44,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:08:44,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 18:08:44,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 18:08:44,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 18:08:44,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 18:08:44,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 7: [2022-11-25 18:08:44,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 18:08:44,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 7: [2022-11-25 18:08:44,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 18:08:44,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 7: [2022-11-25 18:08:44,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 7: [2022-11-25 18:08:44,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 7: [2022-11-25 18:08:44,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 7: [2022-11-25 18:08:44,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 7: [2022-11-25 18:08:44,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 3: [2022-11-25 18:08:44,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:08:44,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 18:08:44,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 3: [2022-11-25 18:08:44,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:08:44,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 18:08:44,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 3: [2022-11-25 18:08:44,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:08:44,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 18:08:44,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 1: [2022-11-25 18:08:44,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:08:44,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 18:08:44,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:08:44,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:08:44,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:08:44,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 1: [2022-11-25 18:08:44,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 18:08:44,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 18:08:44,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 18:08:44,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 1: [2022-11-25 18:08:44,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 1: [2022-11-25 18:08:44,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 1: [2022-11-25 18:08:44,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:08:44,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 18:08:44,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 1: [2022-11-25 18:08:44,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:08:44,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 18:08:44,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 1: [2022-11-25 18:08:44,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:08:44,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 18:08:44,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:08:44,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 1: [2022-11-25 18:08:44,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 18:08:44,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 6: [2022-11-25 18:08:44,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:08:44,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:08:44,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:08:44,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:08:44,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 18:08:44,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 18:08:44,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 18:08:44,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 18:08:44,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 6: [2022-11-25 18:08:44,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 6: [2022-11-25 18:08:44,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 6: [2022-11-25 18:08:44,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 6: [2022-11-25 18:08:44,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:08:44,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 18:08:44,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 6: [2022-11-25 18:08:44,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:08:44,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 18:08:44,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 6: [2022-11-25 18:08:44,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:08:44,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 18:08:44,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 6: [2022-11-25 18:08:44,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:08:44,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 18:08:44,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 5: [2022-11-25 18:08:44,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:08:44,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:08:44,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:08:44,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 18:08:44,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:08:44,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 18:08:44,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 18:08:44,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 5: [2022-11-25 18:08:44,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 18:08:44,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:08:44,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 5: [2022-11-25 18:08:44,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 5: [2022-11-25 18:08:44,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 5: [2022-11-25 18:08:44,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 18:08:44,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 5: [2022-11-25 18:08:44,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:08:44,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 18:08:44,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 5: [2022-11-25 18:08:44,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:08:44,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 18:08:44,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 5: [2022-11-25 18:08:44,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:08:44,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 18:08:44,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 18:08:44,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step7000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 18:08:44,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: successfully saved checkpoint at iteration 7000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 6094.47 7: iteration 7010/ 44073 | consumed samples: 3589120 | consumed tokens: 7350517760 | elapsed time per iteration (s): 4.91 | learning rate: 1.901E-04 | global batch size: 512 | lm loss: 2.286072E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 104.303 | TFLOPs: 48.61 | 7: iteration 7020/ 44073 | consumed samples: 3594240 | consumed tokens: 7361003520 | elapsed time per iteration (s): 4.15 | learning rate: 1.901E-04 | global batch size: 512 | lm loss: 2.266313E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.468 | TFLOPs: 57.54 | 7: iteration 7030/ 44073 | consumed samples: 3599360 | consumed tokens: 7371489280 | elapsed time per iteration (s): 4.14 | learning rate: 1.901E-04 | global batch size: 512 | lm loss: 2.243499E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.685 | TFLOPs: 57.64 | 7: iteration 7040/ 44073 | consumed samples: 3604480 | consumed tokens: 7381975040 | elapsed time per iteration (s): 4.16 | learning rate: 1.900E-04 | global batch size: 512 | lm loss: 2.250320E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.089 | TFLOPs: 57.37 | 7: iteration 7050/ 44073 | consumed samples: 3609600 | consumed tokens: 7392460800 | elapsed time per iteration (s): 4.19 | learning rate: 1.900E-04 | global batch size: 512 | lm loss: 2.250921E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.159 | TFLOPs: 56.93 | 7: iteration 7060/ 44073 | consumed samples: 3614720 | consumed tokens: 7402946560 | elapsed time per iteration (s): 4.18 | learning rate: 1.900E-04 | global batch size: 512 | lm loss: 2.267338E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.595 | TFLOPs: 57.14 | 7: iteration 7070/ 44073 | consumed samples: 3619840 | consumed tokens: 7413432320 | elapsed time per iteration (s): 4.17 | learning rate: 1.899E-04 | global batch size: 512 | lm loss: 2.262760E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.911 | TFLOPs: 57.28 | 7: iteration 7080/ 44073 | consumed samples: 3624960 | consumed tokens: 7423918080 | elapsed time per iteration (s): 4.17 | learning rate: 1.899E-04 | global batch size: 512 | lm loss: 2.285581E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.887 | TFLOPs: 57.27 | 7: iteration 7090/ 44073 | consumed samples: 3630080 | consumed tokens: 7434403840 | elapsed time per iteration (s): 4.16 | learning rate: 1.899E-04 | global batch size: 512 | lm loss: 2.274239E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.082 | TFLOPs: 57.36 | 7: iteration 7100/ 44073 | consumed samples: 3635200 | consumed tokens: 7444889600 | elapsed time per iteration (s): 4.17 | learning rate: 1.899E-04 | global batch size: 512 | lm loss: 2.260021E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.810 | TFLOPs: 57.24 | 7: iteration 7110/ 44073 | consumed samples: 3640320 | consumed tokens: 7455375360 | elapsed time per iteration (s): 4.16 | learning rate: 1.898E-04 | global batch size: 512 | lm loss: 2.254401E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.189 | TFLOPs: 57.41 | 7: iteration 7120/ 44073 | consumed samples: 3645440 | consumed tokens: 7465861120 | elapsed time per iteration (s): 4.19 | learning rate: 1.898E-04 | global batch size: 512 | lm loss: 2.261362E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.306 | TFLOPs: 57.00 | 7: iteration 7130/ 44073 | consumed samples: 3650560 | consumed tokens: 7476346880 | elapsed time per iteration (s): 4.16 | learning rate: 1.898E-04 | global batch size: 512 | lm loss: 2.274652E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.097 | TFLOPs: 57.37 | 7: iteration 7140/ 44073 | consumed samples: 3655680 | consumed tokens: 7486832640 | elapsed time per iteration (s): 4.16 | learning rate: 1.897E-04 | global batch size: 512 | lm loss: 2.245674E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.063 | TFLOPs: 57.35 | 7: iteration 7150/ 44073 | consumed samples: 3660800 | consumed tokens: 7497318400 | elapsed time per iteration (s): 4.14 | learning rate: 1.897E-04 | global batch size: 512 | lm loss: 2.242257E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.748 | TFLOPs: 57.67 | 7: iteration 7160/ 44073 | consumed samples: 3665920 | consumed tokens: 7507804160 | elapsed time per iteration (s): 4.16 | learning rate: 1.897E-04 | global batch size: 512 | lm loss: 2.221474E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.156 | TFLOPs: 57.40 | 7: iteration 7170/ 44073 | consumed samples: 3671040 | consumed tokens: 7518289920 | elapsed time per iteration (s): 4.14 | learning rate: 1.896E-04 | global batch size: 512 | lm loss: 2.247528E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.715 | TFLOPs: 57.66 | 7: iteration 7180/ 44073 | consumed samples: 3676160 | consumed tokens: 7528775680 | elapsed time per iteration (s): 4.14 | learning rate: 1.896E-04 | global batch size: 512 | lm loss: 2.282280E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.581 | TFLOPs: 57.59 | 7: iteration 7190/ 44073 | consumed samples: 3681280 | consumed tokens: 7539261440 | elapsed time per iteration (s): 4.17 | learning rate: 1.896E-04 | global batch size: 512 | lm loss: 2.262147E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.895 | TFLOPs: 57.28 | 7: iteration 7200/ 44073 | consumed samples: 3686400 | consumed tokens: 7549747200 | elapsed time per iteration (s): 4.14 | learning rate: 1.896E-04 | global batch size: 512 | lm loss: 2.241993E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.599 | TFLOPs: 57.60 | 7: iteration 7210/ 44073 | consumed samples: 3691520 | consumed tokens: 7560232960 | elapsed time per iteration (s): 4.15 | learning rate: 1.895E-04 | global batch size: 512 | lm loss: 2.232441E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.496 | TFLOPs: 57.56 | 7: iteration 7220/ 44073 | consumed samples: 3696640 | consumed tokens: 7570718720 | elapsed time per iteration (s): 4.18 | learning rate: 1.895E-04 | global batch size: 512 | lm loss: 2.239600E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.367 | TFLOPs: 57.03 | 7: iteration 7230/ 44073 | consumed samples: 3701760 | consumed tokens: 7581204480 | elapsed time per iteration (s): 4.15 | learning rate: 1.895E-04 | global batch size: 512 | lm loss: 2.252636E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.512 | TFLOPs: 57.56 | 7: iteration 7240/ 44073 | consumed samples: 3706880 | consumed tokens: 7591690240 | elapsed time per iteration (s): 4.15 | learning rate: 1.894E-04 | global batch size: 512 | lm loss: 2.252398E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.414 | TFLOPs: 57.52 | 7: iteration 7250/ 44073 | consumed samples: 3712000 | consumed tokens: 7602176000 | elapsed time per iteration (s): 4.14 | learning rate: 1.894E-04 | global batch size: 512 | lm loss: 2.241480E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.730 | TFLOPs: 57.66 | 7: iteration 7260/ 44073 | consumed samples: 3717120 | consumed tokens: 7612661760 | elapsed time per iteration (s): 4.15 | learning rate: 1.894E-04 | global batch size: 512 | lm loss: 2.284637E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.433 | TFLOPs: 57.53 | 7: iteration 7270/ 44073 | consumed samples: 3722240 | consumed tokens: 7623147520 | elapsed time per iteration (s): 4.15 | learning rate: 1.893E-04 | global batch size: 512 | lm loss: 2.253250E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.512 | TFLOPs: 57.56 | 7: iteration 7280/ 44073 | consumed samples: 3727360 | consumed tokens: 7633633280 | elapsed time per iteration (s): 4.14 | learning rate: 1.893E-04 | global batch size: 512 | lm loss: 2.233885E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.559 | TFLOPs: 57.58 | 7: iteration 7290/ 44073 | consumed samples: 3732480 | consumed tokens: 7644119040 | elapsed time per iteration (s): 4.16 | learning rate: 1.893E-04 | global batch size: 512 | lm loss: 2.233386E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.093 | TFLOPs: 57.37 | 7: iteration 7300/ 44073 | consumed samples: 3737600 | consumed tokens: 7654604800 | elapsed time per iteration (s): 4.16 | learning rate: 1.892E-04 | global batch size: 512 | lm loss: 2.266964E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.130 | TFLOPs: 57.38 | 7: iteration 7310/ 44073 | consumed samples: 3742720 | consumed tokens: 7665090560 | elapsed time per iteration (s): 4.19 | learning rate: 1.892E-04 | global batch size: 512 | lm loss: 2.258572E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.137 | TFLOPs: 56.92 | 7: iteration 7320/ 44073 | consumed samples: 3747840 | consumed tokens: 7675576320 | elapsed time per iteration (s): 4.16 | learning rate: 1.892E-04 | global batch size: 512 | lm loss: 2.229806E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.168 | TFLOPs: 57.40 | 7: iteration 7330/ 44073 | consumed samples: 3752960 | consumed tokens: 7686062080 | elapsed time per iteration (s): 4.16 | learning rate: 1.892E-04 | global batch size: 512 | lm loss: 2.240973E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.995 | TFLOPs: 57.32 | 7: iteration 7340/ 44073 | consumed samples: 3758080 | consumed tokens: 7696547840 | elapsed time per iteration (s): 4.15 | learning rate: 1.891E-04 | global batch size: 512 | lm loss: 2.251693E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.516 | TFLOPs: 57.56 | 7: iteration 7350/ 44073 | consumed samples: 3763200 | consumed tokens: 7707033600 | elapsed time per iteration (s): 4.16 | learning rate: 1.891E-04 | global batch size: 512 | lm loss: 2.251552E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.108 | TFLOPs: 57.37 | 7: iteration 7360/ 44073 | consumed samples: 3768320 | consumed tokens: 7717519360 | elapsed time per iteration (s): 4.17 | learning rate: 1.891E-04 | global batch size: 512 | lm loss: 2.247317E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.848 | TFLOPs: 57.25 | 7: iteration 7370/ 44073 | consumed samples: 3773440 | consumed tokens: 7728005120 | elapsed time per iteration (s): 4.34 | learning rate: 1.890E-04 | global batch size: 512 | lm loss: 2.254291E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.091 | TFLOPs: 55.04 | 7: iteration 7380/ 44073 | consumed samples: 3778560 | consumed tokens: 7738490880 | elapsed time per iteration (s): 4.23 | learning rate: 1.890E-04 | global batch size: 512 | lm loss: 2.244160E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.102 | TFLOPs: 56.44 | 7: iteration 7390/ 44073 | consumed samples: 3783680 | consumed tokens: 7748976640 | elapsed time per iteration (s): 4.14 | learning rate: 1.890E-04 | global batch size: 512 | lm loss: 2.258108E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.673 | TFLOPs: 57.64 | 7: iteration 7400/ 44073 | consumed samples: 3788800 | consumed tokens: 7759462400 | elapsed time per iteration (s): 4.15 | learning rate: 1.889E-04 | global batch size: 512 | lm loss: 2.251628E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.285 | TFLOPs: 57.46 | 7: iteration 7410/ 44073 | consumed samples: 3793920 | consumed tokens: 7769948160 | elapsed time per iteration (s): 4.14 | learning rate: 1.889E-04 | global batch size: 512 | lm loss: 2.252347E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.529 | TFLOPs: 57.57 | 7: iteration 7420/ 44073 | consumed samples: 3799040 | consumed tokens: 7780433920 | elapsed time per iteration (s): 4.16 | learning rate: 1.889E-04 | global batch size: 512 | lm loss: 2.245482E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.200 | TFLOPs: 57.42 | 7: iteration 7430/ 44073 | consumed samples: 3804160 | consumed tokens: 7790919680 | elapsed time per iteration (s): 4.24 | learning rate: 1.888E-04 | global batch size: 512 | lm loss: 2.230046E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.672 | TFLOPs: 56.24 | 7: iteration 7440/ 44073 | consumed samples: 3809280 | consumed tokens: 7801405440 | elapsed time per iteration (s): 4.16 | learning rate: 1.888E-04 | global batch size: 512 | lm loss: 2.251983E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.123 | TFLOPs: 57.38 | 7: iteration 7450/ 44073 | consumed samples: 3814400 | consumed tokens: 7811891200 | elapsed time per iteration (s): 4.14 | learning rate: 1.888E-04 | global batch size: 512 | lm loss: 2.221517E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.584 | TFLOPs: 57.60 | 7: iteration 7460/ 44073 | consumed samples: 3819520 | consumed tokens: 7822376960 | elapsed time per iteration (s): 4.16 | learning rate: 1.887E-04 | global batch size: 512 | lm loss: 2.240539E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.089 | TFLOPs: 57.37 | 7: iteration 7470/ 44073 | consumed samples: 3824640 | consumed tokens: 7832862720 | elapsed time per iteration (s): 4.16 | learning rate: 1.887E-04 | global batch size: 512 | lm loss: 2.237818E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.110 | TFLOPs: 57.38 | 7: iteration 7480/ 44073 | consumed samples: 3829760 | consumed tokens: 7843348480 | elapsed time per iteration (s): 4.15 | learning rate: 1.887E-04 | global batch size: 512 | lm loss: 2.209321E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.366 | TFLOPs: 57.49 | 7: iteration 7490/ 44073 | consumed samples: 3834880 | consumed tokens: 7853834240 | elapsed time per iteration (s): 4.14 | learning rate: 1.887E-04 | global batch size: 512 | lm loss: 2.239638E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.716 | TFLOPs: 57.66 | 7: iteration 7500/ 44073 | consumed samples: 3840000 | consumed tokens: 7864320000 | elapsed time per iteration (s): 4.19 | learning rate: 1.886E-04 | global batch size: 512 | lm loss: 2.478934E+00 | grad norm: 3.353 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.210 | TFLOPs: 56.96 | 7: iteration 7510/ 44073 | consumed samples: 3845120 | consumed tokens: 7874805760 | elapsed time per iteration (s): 4.22 | learning rate: 1.886E-04 | global batch size: 512 | lm loss: 3.416974E+00 | grad norm: 5.259 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.455 | TFLOPs: 56.60 | 7: iteration 7520/ 44073 | consumed samples: 3850240 | consumed tokens: 7885291520 | elapsed time per iteration (s): 4.31 | learning rate: 1.886E-04 | global batch size: 512 | lm loss: 2.895564E+00 | grad norm: 1.832 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.879 | TFLOPs: 55.40 | 7: iteration 7530/ 44073 | consumed samples: 3855360 | consumed tokens: 7895777280 | elapsed time per iteration (s): 4.15 | learning rate: 1.885E-04 | global batch size: 512 | lm loss: 2.649863E+00 | grad norm: 1.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.482 | TFLOPs: 57.55 | 7: iteration 7540/ 44073 | consumed samples: 3860480 | consumed tokens: 7906263040 | elapsed time per iteration (s): 4.17 | learning rate: 1.885E-04 | global batch size: 512 | lm loss: 2.459320E+00 | grad norm: 0.444 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.659 | TFLOPs: 57.17 | 7: iteration 7550/ 44073 | consumed samples: 3865600 | consumed tokens: 7916748800 | elapsed time per iteration (s): 4.17 | learning rate: 1.885E-04 | global batch size: 512 | lm loss: 2.360830E+00 | grad norm: 0.220 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.671 | TFLOPs: 57.17 | 7: iteration 7560/ 44073 | consumed samples: 3870720 | consumed tokens: 7927234560 | elapsed time per iteration (s): 4.17 | learning rate: 1.884E-04 | global batch size: 512 | lm loss: 2.325906E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.773 | TFLOPs: 57.22 | 7: iteration 7570/ 44073 | consumed samples: 3875840 | consumed tokens: 7937720320 | elapsed time per iteration (s): 4.17 | learning rate: 1.884E-04 | global batch size: 512 | lm loss: 2.300577E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.740 | TFLOPs: 57.20 | 7: iteration 7580/ 44073 | consumed samples: 3880960 | consumed tokens: 7948206080 | elapsed time per iteration (s): 4.15 | learning rate: 1.884E-04 | global batch size: 512 | lm loss: 2.272978E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.476 | TFLOPs: 57.55 | 7: iteration 7590/ 44073 | consumed samples: 3886080 | consumed tokens: 7958691840 | elapsed time per iteration (s): 4.17 | learning rate: 1.883E-04 | global batch size: 512 | lm loss: 2.279922E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.804 | TFLOPs: 57.23 | 7: iteration 7600/ 44073 | consumed samples: 3891200 | consumed tokens: 7969177600 | elapsed time per iteration (s): 4.14 | learning rate: 1.883E-04 | global batch size: 512 | lm loss: 2.276679E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.617 | TFLOPs: 57.61 | 7: iteration 7610/ 44073 | consumed samples: 3896320 | consumed tokens: 7979663360 | elapsed time per iteration (s): 4.15 | learning rate: 1.883E-04 | global batch size: 512 | lm loss: 2.273255E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.270 | TFLOPs: 57.45 | 7: iteration 7620/ 44073 | consumed samples: 3901440 | consumed tokens: 7990149120 | elapsed time per iteration (s): 4.23 | learning rate: 1.882E-04 | global batch size: 512 | lm loss: 2.267726E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.028 | TFLOPs: 56.41 | 7: iteration 7630/ 44073 | consumed samples: 3906560 | consumed tokens: 8000634880 | elapsed time per iteration (s): 4.15 | learning rate: 1.882E-04 | global batch size: 512 | lm loss: 2.265240E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.346 | TFLOPs: 57.49 | 7: iteration 7640/ 44073 | consumed samples: 3911680 | consumed tokens: 8011120640 | elapsed time per iteration (s): 4.17 | learning rate: 1.882E-04 | global batch size: 512 | lm loss: 2.261798E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.642 | TFLOPs: 57.16 | 7: iteration 7650/ 44073 | consumed samples: 3916800 | consumed tokens: 8021606400 | elapsed time per iteration (s): 4.16 | learning rate: 1.881E-04 | global batch size: 512 | lm loss: 2.246387E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.958 | TFLOPs: 57.30 | 7: iteration 7660/ 44073 | consumed samples: 3921920 | consumed tokens: 8032092160 | elapsed time per iteration (s): 4.17 | learning rate: 1.881E-04 | global batch size: 512 | lm loss: 2.257887E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.679 | TFLOPs: 57.17 | 7: iteration 7670/ 44073 | consumed samples: 3927040 | consumed tokens: 8042577920 | elapsed time per iteration (s): 4.16 | learning rate: 1.881E-04 | global batch size: 512 | lm loss: 2.252812E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.984 | TFLOPs: 57.32 | 7: iteration 7680/ 44073 | consumed samples: 3932160 | consumed tokens: 8053063680 | elapsed time per iteration (s): 4.19 | learning rate: 1.880E-04 | global batch size: 512 | lm loss: 2.253126E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.145 | TFLOPs: 56.93 | 7: iteration 7690/ 44073 | consumed samples: 3937280 | consumed tokens: 8063549440 | elapsed time per iteration (s): 4.18 | learning rate: 1.880E-04 | global batch size: 512 | lm loss: 2.248038E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.634 | TFLOPs: 57.15 | 7: iteration 7700/ 44073 | consumed samples: 3942400 | consumed tokens: 8074035200 | elapsed time per iteration (s): 4.19 | learning rate: 1.880E-04 | global batch size: 512 | lm loss: 2.254634E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.085 | TFLOPs: 56.90 | 7: iteration 7710/ 44073 | consumed samples: 3947520 | consumed tokens: 8084520960 | elapsed time per iteration (s): 4.21 | learning rate: 1.880E-04 | global batch size: 512 | lm loss: 2.276679E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.675 | TFLOPs: 56.71 | 7: iteration 7720/ 44073 | consumed samples: 3952640 | consumed tokens: 8095006720 | elapsed time per iteration (s): 4.15 | learning rate: 1.879E-04 | global batch size: 512 | lm loss: 2.247698E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.402 | TFLOPs: 57.51 | 7: iteration 7730/ 44073 | consumed samples: 3957760 | consumed tokens: 8105492480 | elapsed time per iteration (s): 4.17 | learning rate: 1.879E-04 | global batch size: 512 | lm loss: 2.257274E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.742 | TFLOPs: 57.20 | 7: iteration 7740/ 44073 | consumed samples: 3962880 | consumed tokens: 8115978240 | elapsed time per iteration (s): 4.14 | learning rate: 1.879E-04 | global batch size: 512 | lm loss: 2.255594E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.784 | TFLOPs: 57.69 | 7: iteration 7750/ 44073 | consumed samples: 3968000 | consumed tokens: 8126464000 | elapsed time per iteration (s): 4.17 | learning rate: 1.878E-04 | global batch size: 512 | lm loss: 2.247036E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.763 | TFLOPs: 57.21 | 7: iteration 7760/ 44073 | consumed samples: 3973120 | consumed tokens: 8136949760 | elapsed time per iteration (s): 4.14 | learning rate: 1.878E-04 | global batch size: 512 | lm loss: 2.253462E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.556 | TFLOPs: 57.58 | 7: iteration 7770/ 44073 | consumed samples: 3978240 | consumed tokens: 8147435520 | elapsed time per iteration (s): 4.20 | learning rate: 1.878E-04 | global batch size: 512 | lm loss: 2.264364E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.024 | TFLOPs: 56.87 | 7: iteration 7780/ 44073 | consumed samples: 3983360 | consumed tokens: 8157921280 | elapsed time per iteration (s): 4.15 | learning rate: 1.877E-04 | global batch size: 512 | lm loss: 2.237298E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.481 | TFLOPs: 57.55 | 7: iteration 7790/ 44073 | consumed samples: 3988480 | consumed tokens: 8168407040 | elapsed time per iteration (s): 4.20 | learning rate: 1.877E-04 | global batch size: 512 | lm loss: 2.239933E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.972 | TFLOPs: 56.85 | 7: iteration 7800/ 44073 | consumed samples: 3993600 | consumed tokens: 8178892800 | elapsed time per iteration (s): 4.15 | learning rate: 1.877E-04 | global batch size: 512 | lm loss: 2.257153E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.283 | TFLOPs: 57.46 | 7: iteration 7810/ 44073 | consumed samples: 3998720 | consumed tokens: 8189378560 | elapsed time per iteration (s): 4.18 | learning rate: 1.876E-04 | global batch size: 512 | lm loss: 2.241595E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.583 | TFLOPs: 57.13 | 7: iteration 7820/ 44073 | consumed samples: 4003840 | consumed tokens: 8199864320 | elapsed time per iteration (s): 4.15 | learning rate: 1.876E-04 | global batch size: 512 | lm loss: 2.252339E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.240 | TFLOPs: 57.44 | 7: iteration 7830/ 44073 | consumed samples: 4008960 | consumed tokens: 8210350080 | elapsed time per iteration (s): 4.17 | learning rate: 1.876E-04 | global batch size: 512 | lm loss: 2.247963E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.838 | TFLOPs: 57.25 | 7: iteration 7840/ 44073 | consumed samples: 4014080 | consumed tokens: 8220835840 | elapsed time per iteration (s): 4.16 | learning rate: 1.875E-04 | global batch size: 512 | lm loss: 2.234962E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.948 | TFLOPs: 57.30 | 7: iteration 7850/ 44073 | consumed samples: 4019200 | consumed tokens: 8231321600 | elapsed time per iteration (s): 4.15 | learning rate: 1.875E-04 | global batch size: 512 | lm loss: 2.255456E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.250 | TFLOPs: 57.44 | 7: iteration 7860/ 44073 | consumed samples: 4024320 | consumed tokens: 8241807360 | elapsed time per iteration (s): 4.15 | learning rate: 1.875E-04 | global batch size: 512 | lm loss: 2.258986E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.368 | TFLOPs: 57.50 | 7: iteration 7870/ 44073 | consumed samples: 4029440 | consumed tokens: 8252293120 | elapsed time per iteration (s): 4.14 | learning rate: 1.874E-04 | global batch size: 512 | lm loss: 2.250183E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.622 | TFLOPs: 57.61 | 7: iteration 7880/ 44073 | consumed samples: 4034560 | consumed tokens: 8262778880 | elapsed time per iteration (s): 4.14 | learning rate: 1.874E-04 | global batch size: 512 | lm loss: 2.244798E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.597 | TFLOPs: 57.60 | 7: iteration 7890/ 44073 | consumed samples: 4039680 | consumed tokens: 8273264640 | elapsed time per iteration (s): 4.15 | learning rate: 1.874E-04 | global batch size: 512 | lm loss: 2.236316E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.418 | TFLOPs: 57.52 | 7: iteration 7900/ 44073 | consumed samples: 4044800 | consumed tokens: 8283750400 | elapsed time per iteration (s): 4.16 | learning rate: 1.873E-04 | global batch size: 512 | lm loss: 2.221682E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.024 | TFLOPs: 57.34 | 7: iteration 7910/ 44073 | consumed samples: 4049920 | consumed tokens: 8294236160 | elapsed time per iteration (s): 4.16 | learning rate: 1.873E-04 | global batch size: 512 | lm loss: 2.232364E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.216 | TFLOPs: 57.42 | 7: iteration 7920/ 44073 | consumed samples: 4055040 | consumed tokens: 8304721920 | elapsed time per iteration (s): 4.16 | learning rate: 1.873E-04 | global batch size: 512 | lm loss: 2.246525E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.118 | TFLOPs: 57.38 | 7: iteration 7930/ 44073 | consumed samples: 4060160 | consumed tokens: 8315207680 | elapsed time per iteration (s): 4.14 | learning rate: 1.872E-04 | global batch size: 512 | lm loss: 2.243117E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.553 | TFLOPs: 57.58 | 7: iteration 7940/ 44073 | consumed samples: 4065280 | consumed tokens: 8325693440 | elapsed time per iteration (s): 4.17 | learning rate: 1.872E-04 | global batch size: 512 | lm loss: 2.223857E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.691 | TFLOPs: 57.18 | 7: iteration 7950/ 44073 | consumed samples: 4070400 | consumed tokens: 8336179200 | elapsed time per iteration (s): 4.19 | learning rate: 1.872E-04 | global batch size: 512 | lm loss: 2.241352E+00 | grad norm: 0.332 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.149 | TFLOPs: 56.93 | 7: iteration 7960/ 44073 | consumed samples: 4075520 | consumed tokens: 8346664960 | elapsed time per iteration (s): 4.17 | learning rate: 1.871E-04 | global batch size: 512 | lm loss: 2.225301E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.845 | TFLOPs: 57.25 | 7: iteration 7970/ 44073 | consumed samples: 4080640 | consumed tokens: 8357150720 | elapsed time per iteration (s): 4.20 | learning rate: 1.871E-04 | global batch size: 512 | lm loss: 2.232019E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.864 | TFLOPs: 56.79 | 7: iteration 7980/ 44073 | consumed samples: 4085760 | consumed tokens: 8367636480 | elapsed time per iteration (s): 4.14 | learning rate: 1.871E-04 | global batch size: 512 | lm loss: 2.234123E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.723 | TFLOPs: 57.66 | 7: iteration 7990/ 44073 | consumed samples: 4090880 | consumed tokens: 8378122240 | elapsed time per iteration (s): 4.17 | learning rate: 1.870E-04 | global batch size: 512 | lm loss: 2.248563E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.784 | TFLOPs: 57.22 | 0: [2022-11-25 19:18:11,032] [INFO] [logging.py:68:log_dist] [Rank 0] step=8000, skipped=0, lr=[0.0001869954336874501, 0.0001869954336874501, 0.0001869954336874501], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 7: iteration 8000/ 44073 | consumed samples: 4096000 | consumed tokens: 8388608000 | elapsed time per iteration (s): 4.15 | learning rate: 1.870E-04 | global batch size: 512 | lm loss: 2.241989E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.456 | TFLOPs: 57.54 | 0: steps: 8000 loss: 2.2879 iter time (s): 4.162 samples/sec: 123.032 7: ------------------------------------------------------------------------------------------ 7: valid loss at iteration 8000 | lm loss value: 2.184408E+00 | lm loss PPL: 8.885386E+00 | 7: ------------------------------------------------------------------------------------------ 0: saving checkpoint at iteration 8000 to checkpoints_2b2 0: [2022-11-25 19:18:12,371] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step8000 is begin to save! 0: [2022-11-25 19:18:12,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_01-model_00-model_states.pt... 0: [2022-11-25 19:18:12,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_01-model_00-model_states.pt. 0: [2022-11-25 19:18:12,712] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_03-model_00-model_states.pt... 0: [2022-11-25 19:18:12,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_03-model_00-model_states.pt. 0: [2022-11-25 19:18:12,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_04-model_00-model_states.pt... 0: [2022-11-25 19:18:13,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_04-model_00-model_states.pt. 0: [2022-11-25 19:18:13,014] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_05-model_00-model_states.pt... 0: [2022-11-25 19:18:13,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_05-model_00-model_states.pt. 0: [2022-11-25 19:18:13,160] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_06-model_00-model_states.pt... 0: [2022-11-25 19:18:13,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_06-model_00-model_states.pt. 0: [2022-11-25 19:18:13,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_07-model_00-model_states.pt... 0: [2022-11-25 19:18:13,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_07-model_00-model_states.pt. 0: [2022-11-25 19:18:13,433] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_08-model_00-model_states.pt... 0: [2022-11-25 19:18:13,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_08-model_00-model_states.pt. 0: [2022-11-25 19:18:13,559] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_09-model_00-model_states.pt... 0: [2022-11-25 19:18:13,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_09-model_00-model_states.pt. 0: [2022-11-25 19:18:13,682] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_10-model_00-model_states.pt... 0: [2022-11-25 19:18:13,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_10-model_00-model_states.pt. 0: [2022-11-25 19:18:13,805] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_11-model_00-model_states.pt... 0: [2022-11-25 19:18:13,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_11-model_00-model_states.pt. 0: [2022-11-25 19:18:13,945] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_12-model_00-model_states.pt... 0: [2022-11-25 19:18:14,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_12-model_00-model_states.pt. 0: [2022-11-25 19:18:14,086] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_13-model_00-model_states.pt... 0: [2022-11-25 19:18:14,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_13-model_00-model_states.pt. 0: [2022-11-25 19:18:14,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_14-model_00-model_states.pt... 0: [2022-11-25 19:18:14,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_14-model_00-model_states.pt. 0: [2022-11-25 19:18:14,370] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_15-model_00-model_states.pt... 0: [2022-11-25 19:18:14,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_15-model_00-model_states.pt. 0: [2022-11-25 19:18:14,510] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_16-model_00-model_states.pt... 0: [2022-11-25 19:18:14,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_16-model_00-model_states.pt. 0: [2022-11-25 19:18:14,648] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_17-model_00-model_states.pt... 0: [2022-11-25 19:18:14,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_17-model_00-model_states.pt. 0: [2022-11-25 19:18:14,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_18-model_00-model_states.pt... 0: [2022-11-25 19:18:14,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_18-model_00-model_states.pt. 0: [2022-11-25 19:18:14,926] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_19-model_00-model_states.pt... 0: [2022-11-25 19:18:15,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_19-model_00-model_states.pt. 0: [2022-11-25 19:18:15,069] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_20-model_00-model_states.pt... 0: [2022-11-25 19:18:15,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_20-model_00-model_states.pt. 0: [2022-11-25 19:18:15,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_21-model_00-model_states.pt... 0: [2022-11-25 19:18:15,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_21-model_00-model_states.pt. 0: [2022-11-25 19:18:15,352] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_22-model_00-model_states.pt... 0: [2022-11-25 19:18:15,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_22-model_00-model_states.pt. 0: [2022-11-25 19:18:15,490] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_23-model_00-model_states.pt... 0: [2022-11-25 19:18:15,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_23-model_00-model_states.pt. 0: [2022-11-25 19:18:15,625] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_24-model_00-model_states.pt... 0: [2022-11-25 19:18:15,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_24-model_00-model_states.pt. 0: [2022-11-25 19:18:15,764] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_25-model_00-model_states.pt... 0: [2022-11-25 19:18:15,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_25-model_00-model_states.pt. 0: [2022-11-25 19:18:15,900] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_26-model_00-model_states.pt... 0: [2022-11-25 19:18:16,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_26-model_00-model_states.pt. 0: [2022-11-25 19:18:16,036] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_27-model_00-model_states.pt... 0: [2022-11-25 19:18:16,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_27-model_00-model_states.pt. 0: [2022-11-25 19:18:16,173] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_28-model_00-model_states.pt... 0: [2022-11-25 19:18:16,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_28-model_00-model_states.pt. 0: [2022-11-25 19:18:16,311] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_29-model_00-model_states.pt... 0: [2022-11-25 19:18:16,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_29-model_00-model_states.pt. 0: [2022-11-25 19:18:16,452] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_30-model_00-model_states.pt... 0: [2022-11-25 19:18:16,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_30-model_00-model_states.pt. 0: [2022-11-25 19:18:16,586] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_31-model_00-model_states.pt... 0: [2022-11-25 19:18:16,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_31-model_00-model_states.pt. 0: [2022-11-25 19:18:16,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_32-model_00-model_states.pt... 0: [2022-11-25 19:18:16,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_32-model_00-model_states.pt. 0: [2022-11-25 19:18:16,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_33-model_00-model_states.pt... 0: [2022-11-25 19:18:16,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_33-model_00-model_states.pt. 0: [2022-11-25 19:18:16,999] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_34-model_00-model_states.pt... 0: [2022-11-25 19:18:17,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_34-model_00-model_states.pt. 0: [2022-11-25 19:18:17,135] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/layer_36-model_00-model_states.pt... 0: [2022-11-25 19:18:17,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/layer_36-model_00-model_states.pt. 0: [2022-11-25 19:18:17,140] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step8000/mp_rank_00_model_states.pt 0: [2022-11-25 19:18:17,140] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/mp_rank_00_model_states.pt... 0: [2022-11-25 19:18:17,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/mp_rank_00_model_states.pt. 5: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:18:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step8000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:18:17,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:18:17,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:18:17,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 19:18:17,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 2: [2022-11-25 19:18:17,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:18:17,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:18:17,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 6: [2022-11-25 19:18:17,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 2: [2022-11-25 19:18:17,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 6: [2022-11-25 19:18:17,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 1: [2022-11-25 19:18:17,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:18:17,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 19:18:17,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 2: [2022-11-25 19:18:17,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:18:17,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 19:18:17,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 19:18:17,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:18:17,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 19:18:17,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 4: [2022-11-25 19:18:17,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:18:17,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 19:18:17,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 6: [2022-11-25 19:18:17,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:18:17,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:18:17,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:18:17,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 19:18:17,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 19:18:17,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 19:18:17,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 6: [2022-11-25 19:18:17,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 6: [2022-11-25 19:18:17,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 6: [2022-11-25 19:18:17,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:18:17,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 19:18:17,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 2: [2022-11-25 19:18:17,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:18:17,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 19:18:17,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 6: [2022-11-25 19:18:17,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:18:17,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 19:18:17,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 4: [2022-11-25 19:18:17,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:18:17,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 19:18:17,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 2: [2022-11-25 19:18:17,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:18:17,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 19:18:17,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 3: [2022-11-25 19:18:17,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:18:17,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:18:17,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:18:17,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 19:18:17,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 19:18:17,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 4: [2022-11-25 19:18:17,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:18:17,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 4: [2022-11-25 19:18:17,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 3: [2022-11-25 19:18:17,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 3: [2022-11-25 19:18:17,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 4: [2022-11-25 19:18:17,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 4: [2022-11-25 19:18:17,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:18:17,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:18:17,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 19:18:17,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 19:18:17,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 4: [2022-11-25 19:18:17,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 3: [2022-11-25 19:18:17,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:18:17,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 19:18:17,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 3: [2022-11-25 19:18:17,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:18:17,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 19:18:17,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 3: [2022-11-25 19:18:17,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:18:17,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 19:18:17,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 19:18:17,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:18:17,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 19:18:17,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 19:18:17,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:18:17,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 19:18:17,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 19:18:17,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:18:17,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 19:18:17,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 2: [2022-11-25 19:18:17,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:18:17,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 19:18:17,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 1: [2022-11-25 19:18:17,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:18:17,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 19:18:17,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 1: [2022-11-25 19:18:17,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:18:17,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 19:18:17,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 1: [2022-11-25 19:18:17,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:18:17,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 19:18:17,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 1: [2022-11-25 19:18:17,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:18:17,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 19:18:17,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 7: [2022-11-25 19:18:17,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:18:17,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:18:17,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:18:17,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:18:17,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:18:17,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:18:17,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:18:17,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:18:17,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 19:18:17,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 19:18:17,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 19:18:17,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 19:18:17,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 19:18:17,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 7: [2022-11-25 19:18:17,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 7: [2022-11-25 19:18:17,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 19:18:17,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 19:18:17,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 19:18:17,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 7: [2022-11-25 19:18:17,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 7: [2022-11-25 19:18:17,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 7: [2022-11-25 19:18:17,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 7: [2022-11-25 19:18:17,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 7: [2022-11-25 19:18:17,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 6: [2022-11-25 19:18:17,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:18:17,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 19:18:17,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 19:18:17,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:18:17,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 19:18:17,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 19:18:17,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:18:17,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 19:18:17,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 2: [2022-11-25 19:18:17,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:18:17,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 19:18:17,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:18:17,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 2: [2022-11-25 19:18:17,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 19:18:17,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 3: [2022-11-25 19:18:17,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:18:17,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 19:18:17,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 6: [2022-11-25 19:18:17,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:18:17,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 19:18:17,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 4: [2022-11-25 19:18:17,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:18:17,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:18:17,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:18:17,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 4: [2022-11-25 19:18:17,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 1: [2022-11-25 19:18:17,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 4: [2022-11-25 19:18:17,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 19:18:17,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 4: [2022-11-25 19:18:17,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 4: [2022-11-25 19:18:17,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:18:17,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 19:18:17,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 1: [2022-11-25 19:18:17,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:18:17,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 19:18:17,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 1: [2022-11-25 19:18:17,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:18:17,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 19:18:17,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 2: [2022-11-25 19:18:17,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:18:17,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 19:18:17,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 3: [2022-11-25 19:18:17,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:18:17,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 19:18:17,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 5: [2022-11-25 19:18:18,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:18:18,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:18:18,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:18:18,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:18:18,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 19:18:18,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 19:18:18,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 19:18:18,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:18:18,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:18:18,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 19:18:18,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 5: [2022-11-25 19:18:18,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 5: [2022-11-25 19:18:18,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 5: [2022-11-25 19:18:18,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 5: [2022-11-25 19:18:18,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 19:18:18,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 19:18:18,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 5: [2022-11-25 19:18:18,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 5: [2022-11-25 19:18:18,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:18:18,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:18:18,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 19:18:18,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 5: [2022-11-25 19:18:18,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 19:18:18,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 19:18:18,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step8000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 19:18:18,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: successfully saved checkpoint at iteration 8000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 6012.71 7: iteration 8010/ 44073 | consumed samples: 4101120 | consumed tokens: 8399093760 | elapsed time per iteration (s): 4.91 | learning rate: 1.870E-04 | global batch size: 512 | lm loss: 2.230784E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 104.197 | TFLOPs: 48.56 | 7: iteration 8020/ 44073 | consumed samples: 4106240 | consumed tokens: 8409579520 | elapsed time per iteration (s): 4.16 | learning rate: 1.869E-04 | global batch size: 512 | lm loss: 2.241724E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.126 | TFLOPs: 57.38 | 7: iteration 8030/ 44073 | consumed samples: 4111360 | consumed tokens: 8420065280 | elapsed time per iteration (s): 4.19 | learning rate: 1.869E-04 | global batch size: 512 | lm loss: 2.234612E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.297 | TFLOPs: 57.00 | 7: iteration 8040/ 44073 | consumed samples: 4116480 | consumed tokens: 8430551040 | elapsed time per iteration (s): 4.14 | learning rate: 1.869E-04 | global batch size: 512 | lm loss: 2.230355E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.578 | TFLOPs: 57.59 | 7: iteration 8050/ 44073 | consumed samples: 4121600 | consumed tokens: 8441036800 | elapsed time per iteration (s): 4.14 | learning rate: 1.868E-04 | global batch size: 512 | lm loss: 2.221357E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.629 | TFLOPs: 57.62 | 7: iteration 8060/ 44073 | consumed samples: 4126720 | consumed tokens: 8451522560 | elapsed time per iteration (s): 4.18 | learning rate: 1.868E-04 | global batch size: 512 | lm loss: 2.248316E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.509 | TFLOPs: 57.10 | 7: iteration 8070/ 44073 | consumed samples: 4131840 | consumed tokens: 8462008320 | elapsed time per iteration (s): 4.20 | learning rate: 1.868E-04 | global batch size: 512 | lm loss: 2.212086E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.934 | TFLOPs: 56.83 | 7: iteration 8080/ 44073 | consumed samples: 4136960 | consumed tokens: 8472494080 | elapsed time per iteration (s): 4.17 | learning rate: 1.867E-04 | global batch size: 512 | lm loss: 2.233186E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.697 | TFLOPs: 57.18 | 7: iteration 8090/ 44073 | consumed samples: 4142080 | consumed tokens: 8482979840 | elapsed time per iteration (s): 4.18 | learning rate: 1.867E-04 | global batch size: 512 | lm loss: 2.230306E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.521 | TFLOPs: 57.10 | 7: iteration 8100/ 44073 | consumed samples: 4147200 | consumed tokens: 8493465600 | elapsed time per iteration (s): 4.16 | learning rate: 1.867E-04 | global batch size: 512 | lm loss: 2.215515E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.188 | TFLOPs: 57.41 | 7: iteration 8110/ 44073 | consumed samples: 4152320 | consumed tokens: 8503951360 | elapsed time per iteration (s): 4.16 | learning rate: 1.866E-04 | global batch size: 512 | lm loss: 2.215153E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.036 | TFLOPs: 57.34 | 7: iteration 8120/ 44073 | consumed samples: 4157440 | consumed tokens: 8514437120 | elapsed time per iteration (s): 4.17 | learning rate: 1.866E-04 | global batch size: 512 | lm loss: 2.217051E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.768 | TFLOPs: 57.22 | 7: iteration 8130/ 44073 | consumed samples: 4162560 | consumed tokens: 8524922880 | elapsed time per iteration (s): 4.16 | learning rate: 1.866E-04 | global batch size: 512 | lm loss: 2.226003E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.170 | TFLOPs: 57.40 | 7: iteration 8140/ 44073 | consumed samples: 4167680 | consumed tokens: 8535408640 | elapsed time per iteration (s): 4.18 | learning rate: 1.865E-04 | global batch size: 512 | lm loss: 2.213006E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.630 | TFLOPs: 57.15 | 7: iteration 8150/ 44073 | consumed samples: 4172800 | consumed tokens: 8545894400 | elapsed time per iteration (s): 4.16 | learning rate: 1.865E-04 | global batch size: 512 | lm loss: 2.223813E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.969 | TFLOPs: 57.31 | 7: iteration 8160/ 44073 | consumed samples: 4177920 | consumed tokens: 8556380160 | elapsed time per iteration (s): 4.15 | learning rate: 1.865E-04 | global batch size: 512 | lm loss: 2.240485E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.304 | TFLOPs: 57.47 | 7: iteration 8170/ 44073 | consumed samples: 4183040 | consumed tokens: 8566865920 | elapsed time per iteration (s): 4.15 | learning rate: 1.864E-04 | global batch size: 512 | lm loss: 2.222572E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.501 | TFLOPs: 57.56 | 7: iteration 8180/ 44073 | consumed samples: 4188160 | consumed tokens: 8577351680 | elapsed time per iteration (s): 4.16 | learning rate: 1.864E-04 | global batch size: 512 | lm loss: 2.227790E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.092 | TFLOPs: 57.37 | 7: iteration 8190/ 44073 | consumed samples: 4193280 | consumed tokens: 8587837440 | elapsed time per iteration (s): 4.16 | learning rate: 1.864E-04 | global batch size: 512 | lm loss: 2.243182E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.982 | TFLOPs: 57.32 | 7: iteration 8200/ 44073 | consumed samples: 4198400 | consumed tokens: 8598323200 | elapsed time per iteration (s): 4.18 | learning rate: 1.863E-04 | global batch size: 512 | lm loss: 2.233438E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.371 | TFLOPs: 57.03 | 7: iteration 8210/ 44073 | consumed samples: 4203520 | consumed tokens: 8608808960 | elapsed time per iteration (s): 4.15 | learning rate: 1.863E-04 | global batch size: 512 | lm loss: 2.217681E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.274 | TFLOPs: 57.45 | 7: iteration 8220/ 44073 | consumed samples: 4208640 | consumed tokens: 8619294720 | elapsed time per iteration (s): 4.17 | learning rate: 1.862E-04 | global batch size: 512 | lm loss: 2.224763E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.711 | TFLOPs: 57.19 | 7: iteration 8230/ 44073 | consumed samples: 4213760 | consumed tokens: 8629780480 | elapsed time per iteration (s): 4.16 | learning rate: 1.862E-04 | global batch size: 512 | lm loss: 2.211929E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.996 | TFLOPs: 57.32 | 7: iteration 8240/ 44073 | consumed samples: 4218880 | consumed tokens: 8640266240 | elapsed time per iteration (s): 4.14 | learning rate: 1.862E-04 | global batch size: 512 | lm loss: 2.236510E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.552 | TFLOPs: 57.58 | 7: iteration 8250/ 44073 | consumed samples: 4224000 | consumed tokens: 8650752000 | elapsed time per iteration (s): 4.14 | learning rate: 1.861E-04 | global batch size: 512 | lm loss: 2.223680E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.753 | TFLOPs: 57.67 | 7: iteration 8260/ 44073 | consumed samples: 4229120 | consumed tokens: 8661237760 | elapsed time per iteration (s): 4.15 | learning rate: 1.861E-04 | global batch size: 512 | lm loss: 2.227982E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.232 | TFLOPs: 57.43 | 7: iteration 8270/ 44073 | consumed samples: 4234240 | consumed tokens: 8671723520 | elapsed time per iteration (s): 4.15 | learning rate: 1.861E-04 | global batch size: 512 | lm loss: 2.224430E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.467 | TFLOPs: 57.54 | 7: iteration 8280/ 44073 | consumed samples: 4239360 | consumed tokens: 8682209280 | elapsed time per iteration (s): 4.17 | learning rate: 1.860E-04 | global batch size: 512 | lm loss: 2.218825E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.888 | TFLOPs: 57.27 | 7: iteration 8290/ 44073 | consumed samples: 4244480 | consumed tokens: 8692695040 | elapsed time per iteration (s): 4.15 | learning rate: 1.860E-04 | global batch size: 512 | lm loss: 2.221619E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.303 | TFLOPs: 57.47 | 7: iteration 8300/ 44073 | consumed samples: 4249600 | consumed tokens: 8703180800 | elapsed time per iteration (s): 4.18 | learning rate: 1.860E-04 | global batch size: 512 | lm loss: 2.214886E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.547 | TFLOPs: 57.11 | 7: iteration 8310/ 44073 | consumed samples: 4254720 | consumed tokens: 8713666560 | elapsed time per iteration (s): 4.51 | learning rate: 1.859E-04 | global batch size: 512 | lm loss: 2.233981E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 113.507 | TFLOPs: 52.90 | 7: iteration 8320/ 44073 | consumed samples: 4259840 | consumed tokens: 8724152320 | elapsed time per iteration (s): 4.15 | learning rate: 1.859E-04 | global batch size: 512 | lm loss: 2.223455E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.258 | TFLOPs: 57.44 | 7: iteration 8330/ 44073 | consumed samples: 4264960 | consumed tokens: 8734638080 | elapsed time per iteration (s): 4.15 | learning rate: 1.859E-04 | global batch size: 512 | lm loss: 2.212235E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.247 | TFLOPs: 57.44 | 7: iteration 8340/ 44073 | consumed samples: 4270080 | consumed tokens: 8745123840 | elapsed time per iteration (s): 4.15 | learning rate: 1.858E-04 | global batch size: 512 | lm loss: 2.213644E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.378 | TFLOPs: 57.50 | 7: iteration 8350/ 44073 | consumed samples: 4275200 | consumed tokens: 8755609600 | elapsed time per iteration (s): 4.17 | learning rate: 1.858E-04 | global batch size: 512 | lm loss: 2.213015E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.907 | TFLOPs: 57.28 | 7: iteration 8360/ 44073 | consumed samples: 4280320 | consumed tokens: 8766095360 | elapsed time per iteration (s): 4.23 | learning rate: 1.858E-04 | global batch size: 512 | lm loss: 2.198634E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.044 | TFLOPs: 56.41 | 7: iteration 8370/ 44073 | consumed samples: 4285440 | consumed tokens: 8776581120 | elapsed time per iteration (s): 4.17 | learning rate: 1.857E-04 | global batch size: 512 | lm loss: 2.235921E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.715 | TFLOPs: 57.19 | 7: iteration 8380/ 44073 | consumed samples: 4290560 | consumed tokens: 8787066880 | elapsed time per iteration (s): 4.15 | learning rate: 1.857E-04 | global batch size: 512 | lm loss: 2.225637E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.416 | TFLOPs: 57.52 | 7: iteration 8390/ 44073 | consumed samples: 4295680 | consumed tokens: 8797552640 | elapsed time per iteration (s): 158.29 | learning rate: 1.857E-04 | global batch size: 512 | lm loss: 2.209099E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 3.235 | TFLOPs: 1.51 | 7: iteration 8400/ 44073 | consumed samples: 4300800 | consumed tokens: 8808038400 | elapsed time per iteration (s): 15.00 | learning rate: 1.856E-04 | global batch size: 512 | lm loss: 2.206378E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 34.124 | TFLOPs: 15.90 | 7: iteration 8410/ 44073 | consumed samples: 4305920 | consumed tokens: 8818524160 | elapsed time per iteration (s): 8.46 | learning rate: 1.856E-04 | global batch size: 512 | lm loss: 2.193692E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 60.501 | TFLOPs: 28.20 | 7: iteration 8420/ 44073 | consumed samples: 4311040 | consumed tokens: 8829009920 | elapsed time per iteration (s): 4.16 | learning rate: 1.856E-04 | global batch size: 512 | lm loss: 2.203775E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.139 | TFLOPs: 57.39 | 7: iteration 8430/ 44073 | consumed samples: 4316160 | consumed tokens: 8839495680 | elapsed time per iteration (s): 4.16 | learning rate: 1.855E-04 | global batch size: 512 | lm loss: 2.217937E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.164 | TFLOPs: 57.40 | 7: iteration 8440/ 44073 | consumed samples: 4321280 | consumed tokens: 8849981440 | elapsed time per iteration (s): 4.16 | learning rate: 1.855E-04 | global batch size: 512 | lm loss: 2.211386E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.950 | TFLOPs: 57.30 | 7: iteration 8450/ 44073 | consumed samples: 4326400 | consumed tokens: 8860467200 | elapsed time per iteration (s): 4.19 | learning rate: 1.854E-04 | global batch size: 512 | lm loss: 2.214196E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.187 | TFLOPs: 56.95 | 7: iteration 8460/ 44073 | consumed samples: 4331520 | consumed tokens: 8870952960 | elapsed time per iteration (s): 4.19 | learning rate: 1.854E-04 | global batch size: 512 | lm loss: 2.193367E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.242 | TFLOPs: 56.97 | 7: iteration 8470/ 44073 | consumed samples: 4336640 | consumed tokens: 8881438720 | elapsed time per iteration (s): 4.19 | learning rate: 1.854E-04 | global batch size: 512 | lm loss: 2.204577E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.156 | TFLOPs: 56.93 | 7: iteration 8480/ 44073 | consumed samples: 4341760 | consumed tokens: 8891924480 | elapsed time per iteration (s): 5.86 | learning rate: 1.853E-04 | global batch size: 512 | lm loss: 2.200780E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 87.414 | TFLOPs: 40.74 | 7: iteration 8490/ 44073 | consumed samples: 4346880 | consumed tokens: 8902410240 | elapsed time per iteration (s): 4.14 | learning rate: 1.853E-04 | global batch size: 512 | lm loss: 2.223531E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.777 | TFLOPs: 57.69 | 7: iteration 8500/ 44073 | consumed samples: 4352000 | consumed tokens: 8912896000 | elapsed time per iteration (s): 4.14 | learning rate: 1.853E-04 | global batch size: 512 | lm loss: 2.226193E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.569 | TFLOPs: 57.59 | 7: iteration 8510/ 44073 | consumed samples: 4357120 | consumed tokens: 8923381760 | elapsed time per iteration (s): 4.16 | learning rate: 1.852E-04 | global batch size: 512 | lm loss: 2.225795E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.088 | TFLOPs: 57.37 | 7: iteration 8520/ 44073 | consumed samples: 4362240 | consumed tokens: 8933867520 | elapsed time per iteration (s): 4.15 | learning rate: 1.852E-04 | global batch size: 512 | lm loss: 2.227660E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.322 | TFLOPs: 57.47 | 7: iteration 8530/ 44073 | consumed samples: 4367360 | consumed tokens: 8944353280 | elapsed time per iteration (s): 4.15 | learning rate: 1.852E-04 | global batch size: 512 | lm loss: 2.198475E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.393 | TFLOPs: 57.51 | 7: iteration 8540/ 44073 | consumed samples: 4372480 | consumed tokens: 8954839040 | elapsed time per iteration (s): 4.19 | learning rate: 1.851E-04 | global batch size: 512 | lm loss: 2.216864E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.310 | TFLOPs: 57.00 | 7: iteration 8550/ 44073 | consumed samples: 4377600 | consumed tokens: 8965324800 | elapsed time per iteration (s): 4.17 | learning rate: 1.851E-04 | global batch size: 512 | lm loss: 2.212999E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.886 | TFLOPs: 57.27 | 7: iteration 8560/ 44073 | consumed samples: 4382720 | consumed tokens: 8975810560 | elapsed time per iteration (s): 4.16 | learning rate: 1.851E-04 | global batch size: 512 | lm loss: 2.223025E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.166 | TFLOPs: 57.40 | 7: iteration 8570/ 44073 | consumed samples: 4387840 | consumed tokens: 8986296320 | elapsed time per iteration (s): 4.16 | learning rate: 1.850E-04 | global batch size: 512 | lm loss: 2.200977E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.120 | TFLOPs: 57.38 | 7: iteration 8580/ 44073 | consumed samples: 4392960 | consumed tokens: 8996782080 | elapsed time per iteration (s): 4.19 | learning rate: 1.850E-04 | global batch size: 512 | lm loss: 2.207185E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.118 | TFLOPs: 56.91 | 7: iteration 8590/ 44073 | consumed samples: 4398080 | consumed tokens: 9007267840 | elapsed time per iteration (s): 4.18 | learning rate: 1.849E-04 | global batch size: 512 | lm loss: 2.214944E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.482 | TFLOPs: 57.08 | 7: iteration 8600/ 44073 | consumed samples: 4403200 | consumed tokens: 9017753600 | elapsed time per iteration (s): 4.18 | learning rate: 1.849E-04 | global batch size: 512 | lm loss: 2.200404E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.563 | TFLOPs: 57.12 | 7: iteration 8610/ 44073 | consumed samples: 4408320 | consumed tokens: 9028239360 | elapsed time per iteration (s): 4.14 | learning rate: 1.849E-04 | global batch size: 512 | lm loss: 2.189494E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.594 | TFLOPs: 57.60 | 7: iteration 8620/ 44073 | consumed samples: 4413440 | consumed tokens: 9038725120 | elapsed time per iteration (s): 4.18 | learning rate: 1.848E-04 | global batch size: 512 | lm loss: 2.231256E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.513 | TFLOPs: 57.10 | 7: iteration 8630/ 44073 | consumed samples: 4418560 | consumed tokens: 9049210880 | elapsed time per iteration (s): 4.14 | learning rate: 1.848E-04 | global batch size: 512 | lm loss: 2.205177E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.622 | TFLOPs: 57.61 | 7: iteration 8640/ 44073 | consumed samples: 4423680 | consumed tokens: 9059696640 | elapsed time per iteration (s): 4.18 | learning rate: 1.848E-04 | global batch size: 512 | lm loss: 2.221767E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.622 | TFLOPs: 57.15 | 7: iteration 8650/ 44073 | consumed samples: 4428800 | consumed tokens: 9070182400 | elapsed time per iteration (s): 4.15 | learning rate: 1.847E-04 | global batch size: 512 | lm loss: 2.218979E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.369 | TFLOPs: 57.50 | 7: iteration 8660/ 44073 | consumed samples: 4433920 | consumed tokens: 9080668160 | elapsed time per iteration (s): 4.15 | learning rate: 1.847E-04 | global batch size: 512 | lm loss: 2.194951E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.382 | TFLOPs: 57.50 | 7: iteration 8670/ 44073 | consumed samples: 4439040 | consumed tokens: 9091153920 | elapsed time per iteration (s): 4.15 | learning rate: 1.847E-04 | global batch size: 512 | lm loss: 2.209589E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.243 | TFLOPs: 57.44 | 7: iteration 8680/ 44073 | consumed samples: 4444160 | consumed tokens: 9101639680 | elapsed time per iteration (s): 4.14 | learning rate: 1.846E-04 | global batch size: 512 | lm loss: 2.186078E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.584 | TFLOPs: 57.60 | 7: iteration 8690/ 44073 | consumed samples: 4449280 | consumed tokens: 9112125440 | elapsed time per iteration (s): 4.16 | learning rate: 1.846E-04 | global batch size: 512 | lm loss: 2.202589E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.208 | TFLOPs: 57.42 | 7: iteration 8700/ 44073 | consumed samples: 4454400 | consumed tokens: 9122611200 | elapsed time per iteration (s): 4.18 | learning rate: 1.845E-04 | global batch size: 512 | lm loss: 2.227609E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.464 | TFLOPs: 57.07 | 7: iteration 8710/ 44073 | consumed samples: 4459520 | consumed tokens: 9133096960 | elapsed time per iteration (s): 4.16 | learning rate: 1.845E-04 | global batch size: 512 | lm loss: 2.206580E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.215 | TFLOPs: 57.42 | 7: iteration 8720/ 44073 | consumed samples: 4464640 | consumed tokens: 9143582720 | elapsed time per iteration (s): 4.23 | learning rate: 1.845E-04 | global batch size: 512 | lm loss: 2.209246E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.912 | TFLOPs: 56.35 | 7: iteration 8730/ 44073 | consumed samples: 4469760 | consumed tokens: 9154068480 | elapsed time per iteration (s): 4.18 | learning rate: 1.844E-04 | global batch size: 512 | lm loss: 2.221895E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.344 | TFLOPs: 57.02 | 7: iteration 8740/ 44073 | consumed samples: 4474880 | consumed tokens: 9164554240 | elapsed time per iteration (s): 4.14 | learning rate: 1.844E-04 | global batch size: 512 | lm loss: 2.193647E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.660 | TFLOPs: 57.63 | 7: iteration 8750/ 44073 | consumed samples: 4480000 | consumed tokens: 9175040000 | elapsed time per iteration (s): 4.16 | learning rate: 1.844E-04 | global batch size: 512 | lm loss: 2.207760E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.044 | TFLOPs: 57.34 | 7: iteration 8760/ 44073 | consumed samples: 4485120 | consumed tokens: 9185525760 | elapsed time per iteration (s): 4.15 | learning rate: 1.843E-04 | global batch size: 512 | lm loss: 2.199103E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.463 | TFLOPs: 57.54 | 7: iteration 8770/ 44073 | consumed samples: 4490240 | consumed tokens: 9196011520 | elapsed time per iteration (s): 4.16 | learning rate: 1.843E-04 | global batch size: 512 | lm loss: 2.238655E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.031 | TFLOPs: 57.34 | 7: iteration 8780/ 44073 | consumed samples: 4495360 | consumed tokens: 9206497280 | elapsed time per iteration (s): 4.14 | learning rate: 1.843E-04 | global batch size: 512 | lm loss: 2.204179E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.604 | TFLOPs: 57.61 | 7: iteration 8790/ 44073 | consumed samples: 4500480 | consumed tokens: 9216983040 | elapsed time per iteration (s): 4.20 | learning rate: 1.842E-04 | global batch size: 512 | lm loss: 2.206167E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.763 | TFLOPs: 56.75 | 7: iteration 8800/ 44073 | consumed samples: 4505600 | consumed tokens: 9227468800 | elapsed time per iteration (s): 4.14 | learning rate: 1.842E-04 | global batch size: 512 | lm loss: 2.206960E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.616 | TFLOPs: 57.61 | 7: iteration 8810/ 44073 | consumed samples: 4510720 | consumed tokens: 9237954560 | elapsed time per iteration (s): 4.15 | learning rate: 1.841E-04 | global batch size: 512 | lm loss: 2.215411E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.457 | TFLOPs: 57.54 | 7: iteration 8820/ 44073 | consumed samples: 4515840 | consumed tokens: 9248440320 | elapsed time per iteration (s): 4.15 | learning rate: 1.841E-04 | global batch size: 512 | lm loss: 2.190487E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.332 | TFLOPs: 57.48 | 7: iteration 8830/ 44073 | consumed samples: 4520960 | consumed tokens: 9258926080 | elapsed time per iteration (s): 4.34 | learning rate: 1.841E-04 | global batch size: 512 | lm loss: 2.215929E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.078 | TFLOPs: 55.03 | 7: iteration 8840/ 44073 | consumed samples: 4526080 | consumed tokens: 9269411840 | elapsed time per iteration (s): 4.29 | learning rate: 1.840E-04 | global batch size: 512 | lm loss: 2.184925E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.374 | TFLOPs: 55.63 | 7: iteration 8850/ 44073 | consumed samples: 4531200 | consumed tokens: 9279897600 | elapsed time per iteration (s): 4.15 | learning rate: 1.840E-04 | global batch size: 512 | lm loss: 2.220944E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.457 | TFLOPs: 57.54 | 7: iteration 8860/ 44073 | consumed samples: 4536320 | consumed tokens: 9290383360 | elapsed time per iteration (s): 4.17 | learning rate: 1.840E-04 | global batch size: 512 | lm loss: 2.204100E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.712 | TFLOPs: 57.19 | 7: iteration 8870/ 44073 | consumed samples: 4541440 | consumed tokens: 9300869120 | elapsed time per iteration (s): 4.15 | learning rate: 1.839E-04 | global batch size: 512 | lm loss: 2.183803E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.390 | TFLOPs: 57.51 | 7: iteration 8880/ 44073 | consumed samples: 4546560 | consumed tokens: 9311354880 | elapsed time per iteration (s): 4.14 | learning rate: 1.839E-04 | global batch size: 512 | lm loss: 2.226726E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.581 | TFLOPs: 57.60 | 7: iteration 8890/ 44073 | consumed samples: 4551680 | consumed tokens: 9321840640 | elapsed time per iteration (s): 4.23 | learning rate: 1.839E-04 | global batch size: 512 | lm loss: 2.222102E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.909 | TFLOPs: 56.35 | 7: iteration 8900/ 44073 | consumed samples: 4556800 | consumed tokens: 9332326400 | elapsed time per iteration (s): 4.18 | learning rate: 1.838E-04 | global batch size: 512 | lm loss: 2.648257E+00 | grad norm: 6.316 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.499 | TFLOPs: 57.09 | 7: iteration 8910/ 44073 | consumed samples: 4561920 | consumed tokens: 9342812160 | elapsed time per iteration (s): 4.17 | learning rate: 1.838E-04 | global batch size: 512 | lm loss: 2.699612E+00 | grad norm: 1.622 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.904 | TFLOPs: 57.28 | 7: iteration 8920/ 44073 | consumed samples: 4567040 | consumed tokens: 9353297920 | elapsed time per iteration (s): 4.17 | learning rate: 1.837E-04 | global batch size: 512 | lm loss: 2.501685E+00 | grad norm: 0.559 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.729 | TFLOPs: 57.20 | 7: iteration 8930/ 44073 | consumed samples: 4572160 | consumed tokens: 9363783680 | elapsed time per iteration (s): 4.17 | learning rate: 1.837E-04 | global batch size: 512 | lm loss: 2.339660E+00 | grad norm: 0.243 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.650 | TFLOPs: 57.16 | 7: iteration 8940/ 44073 | consumed samples: 4577280 | consumed tokens: 9374269440 | elapsed time per iteration (s): 4.16 | learning rate: 1.837E-04 | global batch size: 512 | lm loss: 2.293772E+00 | grad norm: 0.243 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.084 | TFLOPs: 57.36 | 7: iteration 8950/ 44073 | consumed samples: 4582400 | consumed tokens: 9384755200 | elapsed time per iteration (s): 4.19 | learning rate: 1.836E-04 | global batch size: 512 | lm loss: 2.265898E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.166 | TFLOPs: 56.94 | 7: iteration 8960/ 44073 | consumed samples: 4587520 | consumed tokens: 9395240960 | elapsed time per iteration (s): 4.16 | learning rate: 1.836E-04 | global batch size: 512 | lm loss: 2.214307E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.088 | TFLOPs: 57.37 | 7: iteration 8970/ 44073 | consumed samples: 4592640 | consumed tokens: 9405726720 | elapsed time per iteration (s): 4.17 | learning rate: 1.836E-04 | global batch size: 512 | lm loss: 2.240850E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.770 | TFLOPs: 57.22 | 7: iteration 8980/ 44073 | consumed samples: 4597760 | consumed tokens: 9416212480 | elapsed time per iteration (s): 4.21 | learning rate: 1.835E-04 | global batch size: 512 | lm loss: 2.228578E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.629 | TFLOPs: 56.69 | 7: iteration 8990/ 44073 | consumed samples: 4602880 | consumed tokens: 9426698240 | elapsed time per iteration (s): 4.19 | learning rate: 1.835E-04 | global batch size: 512 | lm loss: 2.230531E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.225 | TFLOPs: 56.96 | 7: iteration 9000/ 44073 | consumed samples: 4608000 | consumed tokens: 9437184000 | elapsed time per iteration (s): 4.18 | learning rate: 1.834E-04 | global batch size: 512 | lm loss: 2.223321E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.477 | TFLOPs: 57.08 | 7: ------------------------------------------------------------------------------------------ 7: valid loss at iteration 9000 | lm loss value: 2.148338E+00 | lm loss PPL: 8.570603E+00 | 7: ------------------------------------------------------------------------------------------ 0: saving checkpoint at iteration 9000 to checkpoints_2b2 0: [2022-11-25 20:56:21,452] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step9000 is begin to save! 0: [2022-11-25 20:56:21,458] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_01-model_00-model_states.pt... 0: [2022-11-25 20:56:21,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_01-model_00-model_states.pt. 0: [2022-11-25 20:56:21,794] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_03-model_00-model_states.pt... 0: [2022-11-25 20:56:21,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_03-model_00-model_states.pt. 0: [2022-11-25 20:56:21,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_04-model_00-model_states.pt... 0: [2022-11-25 20:56:22,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_04-model_00-model_states.pt. 0: [2022-11-25 20:56:22,090] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_05-model_00-model_states.pt... 0: [2022-11-25 20:56:22,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_05-model_00-model_states.pt. 0: [2022-11-25 20:56:22,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_06-model_00-model_states.pt... 0: [2022-11-25 20:56:22,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_06-model_00-model_states.pt. 0: [2022-11-25 20:56:22,376] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_07-model_00-model_states.pt... 0: [2022-11-25 20:56:22,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_07-model_00-model_states.pt. 0: [2022-11-25 20:56:22,517] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_08-model_00-model_states.pt... 0: [2022-11-25 20:56:22,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_08-model_00-model_states.pt. 0: [2022-11-25 20:56:22,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_09-model_00-model_states.pt... 0: [2022-11-25 20:56:22,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_09-model_00-model_states.pt. 0: [2022-11-25 20:56:22,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_10-model_00-model_states.pt... 0: [2022-11-25 20:56:22,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_10-model_00-model_states.pt. 0: [2022-11-25 20:56:22,937] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_11-model_00-model_states.pt... 0: [2022-11-25 20:56:23,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_11-model_00-model_states.pt. 0: [2022-11-25 20:56:23,076] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_12-model_00-model_states.pt... 0: [2022-11-25 20:56:23,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_12-model_00-model_states.pt. 0: [2022-11-25 20:56:23,212] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_13-model_00-model_states.pt... 0: [2022-11-25 20:56:23,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_13-model_00-model_states.pt. 0: [2022-11-25 20:56:23,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_14-model_00-model_states.pt... 0: [2022-11-25 20:56:23,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_14-model_00-model_states.pt. 0: [2022-11-25 20:56:23,490] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_15-model_00-model_states.pt... 0: [2022-11-25 20:56:23,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_15-model_00-model_states.pt. 0: [2022-11-25 20:56:23,628] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_16-model_00-model_states.pt... 0: [2022-11-25 20:56:23,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_16-model_00-model_states.pt. 0: [2022-11-25 20:56:23,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_17-model_00-model_states.pt... 0: [2022-11-25 20:56:23,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_17-model_00-model_states.pt. 0: [2022-11-25 20:56:23,900] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_18-model_00-model_states.pt... 0: [2022-11-25 20:56:24,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_18-model_00-model_states.pt. 0: [2022-11-25 20:56:24,036] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_19-model_00-model_states.pt... 0: [2022-11-25 20:56:24,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_19-model_00-model_states.pt. 0: [2022-11-25 20:56:24,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_20-model_00-model_states.pt... 0: [2022-11-25 20:56:24,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_20-model_00-model_states.pt. 0: [2022-11-25 20:56:24,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_21-model_00-model_states.pt... 0: [2022-11-25 20:56:24,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_21-model_00-model_states.pt. 0: [2022-11-25 20:56:24,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_22-model_00-model_states.pt... 0: [2022-11-25 20:56:24,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_22-model_00-model_states.pt. 0: [2022-11-25 20:56:24,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_23-model_00-model_states.pt... 0: [2022-11-25 20:56:24,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_23-model_00-model_states.pt. 0: [2022-11-25 20:56:24,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_24-model_00-model_states.pt... 0: [2022-11-25 20:56:24,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_24-model_00-model_states.pt. 0: [2022-11-25 20:56:24,851] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_25-model_00-model_states.pt... 0: [2022-11-25 20:56:24,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_25-model_00-model_states.pt. 0: [2022-11-25 20:56:24,985] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_26-model_00-model_states.pt... 0: [2022-11-25 20:56:25,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_26-model_00-model_states.pt. 0: [2022-11-25 20:56:25,123] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_27-model_00-model_states.pt... 0: [2022-11-25 20:56:25,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_27-model_00-model_states.pt. 0: [2022-11-25 20:56:25,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_28-model_00-model_states.pt... 0: [2022-11-25 20:56:25,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_28-model_00-model_states.pt. 0: [2022-11-25 20:56:25,389] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_29-model_00-model_states.pt... 0: [2022-11-25 20:56:25,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_29-model_00-model_states.pt. 0: [2022-11-25 20:56:25,524] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_30-model_00-model_states.pt... 0: [2022-11-25 20:56:25,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_30-model_00-model_states.pt. 0: [2022-11-25 20:56:25,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_31-model_00-model_states.pt... 0: [2022-11-25 20:56:25,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_31-model_00-model_states.pt. 0: [2022-11-25 20:56:25,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_32-model_00-model_states.pt... 0: [2022-11-25 20:56:25,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_32-model_00-model_states.pt. 0: [2022-11-25 20:56:25,928] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_33-model_00-model_states.pt... 0: [2022-11-25 20:56:26,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_33-model_00-model_states.pt. 0: [2022-11-25 20:56:26,061] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_34-model_00-model_states.pt... 0: [2022-11-25 20:56:26,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_34-model_00-model_states.pt. 0: [2022-11-25 20:56:26,194] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/layer_36-model_00-model_states.pt... 0: [2022-11-25 20:56:26,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/layer_36-model_00-model_states.pt. 0: [2022-11-25 20:56:26,200] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step9000/mp_rank_00_model_states.pt 0: [2022-11-25 20:56:26,200] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/mp_rank_00_model_states.pt... 0: [2022-11-25 20:56:26,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/mp_rank_00_model_states.pt. 0: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:56:26,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step9000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:56:26,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:56:26,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 20:56:26,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 0: [2022-11-25 20:56:26,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:56:26,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 20:56:26,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 0: [2022-11-25 20:56:26,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:56:26,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:56:26,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 20:56:26,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 0: [2022-11-25 20:56:26,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:56:26,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 20:56:26,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 0: [2022-11-25 20:56:26,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:56:26,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 20:56:26,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 0: [2022-11-25 20:56:26,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:56:26,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 20:56:26,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 20:56:26,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:56:26,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:56:26,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:56:26,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:56:26,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 20:56:26,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 20:56:26,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 20:56:26,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 20:56:26,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 20:56:26,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 20:56:26,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 20:56:26,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 0: [2022-11-25 20:56:26,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:56:26,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 20:56:26,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 20:56:26,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:56:26,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:56:26,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 20:56:26,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 20:56:26,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 20:56:26,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 20:56:26,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:56:26,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 20:56:26,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:56:26,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 20:56:26,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 20:56:26,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 4: [2022-11-25 20:56:26,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:56:26,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 20:56:26,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 4: [2022-11-25 20:56:26,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:56:26,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 20:56:26,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 4: [2022-11-25 20:56:26,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:56:26,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 20:56:26,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 4: [2022-11-25 20:56:26,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:56:26,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 20:56:26,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 4: [2022-11-25 20:56:26,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:56:26,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 20:56:26,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 4: [2022-11-25 20:56:26,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:56:26,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 20:56:26,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 4: [2022-11-25 20:56:26,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:56:26,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 20:56:26,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 4: [2022-11-25 20:56:27,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:56:27,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 20:56:27,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 0: [2022-11-25 20:56:27,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 20:56:27,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 5: [2022-11-25 20:56:27,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:56:27,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 20:56:27,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:56:27,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:56:27,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:56:27,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 5: [2022-11-25 20:56:27,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 20:56:27,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 20:56:27,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 20:56:27,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 5: [2022-11-25 20:56:27,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 5: [2022-11-25 20:56:27,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 5: [2022-11-25 20:56:27,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:56:27,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:56:27,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:56:27,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 20:56:27,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 20:56:27,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 5: [2022-11-25 20:56:27,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 20:56:27,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 5: [2022-11-25 20:56:27,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 5: [2022-11-25 20:56:27,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:56:27,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 20:56:27,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 20:56:27,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:56:27,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:56:27,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:56:27,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:56:27,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 20:56:27,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:56:27,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 20:56:27,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 20:56:27,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 20:56:27,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 20:56:27,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 20:56:27,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 20:56:27,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 20:56:27,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 20:56:27,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 20:56:27,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:56:27,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:56:27,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 20:56:27,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 20:56:27,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:56:27,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 20:56:27,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 20:56:27,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 20:56:27,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 2: [2022-11-25 20:56:27,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:56:27,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:56:27,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:56:27,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 20:56:27,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 20:56:27,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:56:27,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:56:27,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 2: [2022-11-25 20:56:27,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 20:56:27,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 2: [2022-11-25 20:56:27,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 20:56:27,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 20:56:27,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 2: [2022-11-25 20:56:27,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 2: [2022-11-25 20:56:27,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 2: [2022-11-25 20:56:27,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:56:27,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:56:27,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:56:27,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 20:56:27,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 20:56:27,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 20:56:27,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 2: [2022-11-25 20:56:27,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 2: [2022-11-25 20:56:27,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 3: [2022-11-25 20:56:27,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:56:27,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:56:27,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:56:27,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:56:27,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:56:27,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 20:56:27,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:56:27,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 20:56:27,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 20:56:27,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 20:56:27,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 3: [2022-11-25 20:56:27,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 20:56:27,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 20:56:27,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 3: [2022-11-25 20:56:27,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 3: [2022-11-25 20:56:27,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 3: [2022-11-25 20:56:27,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 3: [2022-11-25 20:56:27,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 3: [2022-11-25 20:56:27,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:56:27,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:56:27,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 20:56:27,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 20:56:27,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 3: [2022-11-25 20:56:27,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 6: [2022-11-25 20:56:27,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:56:27,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:56:27,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 20:56:27,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 20:56:27,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 6: [2022-11-25 20:56:27,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 6: [2022-11-25 20:56:27,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:56:27,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:56:27,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:56:27,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 20:56:27,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:56:27,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:56:27,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:56:27,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 20:56:27,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 20:56:27,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 6: [2022-11-25 20:56:27,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 6: [2022-11-25 20:56:27,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 6: [2022-11-25 20:56:27,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 20:56:27,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 20:56:27,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step9000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 20:56:27,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 6: [2022-11-25 20:56:27,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 6: [2022-11-25 20:56:27,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 0: successfully saved checkpoint at iteration 9000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 6045.55 7: iteration 9010/ 44073 | consumed samples: 4613120 | consumed tokens: 9447669760 | elapsed time per iteration (s): 4.88 | learning rate: 1.834E-04 | global batch size: 512 | lm loss: 2.235447E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 104.990 | TFLOPs: 48.93 | 7: iteration 9020/ 44073 | consumed samples: 4618240 | consumed tokens: 9458155520 | elapsed time per iteration (s): 4.21 | learning rate: 1.834E-04 | global batch size: 512 | lm loss: 2.210008E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.759 | TFLOPs: 56.75 | 7: iteration 9030/ 44073 | consumed samples: 4623360 | consumed tokens: 9468641280 | elapsed time per iteration (s): 4.23 | learning rate: 1.833E-04 | global batch size: 512 | lm loss: 2.205765E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.097 | TFLOPs: 56.44 | 7: iteration 9040/ 44073 | consumed samples: 4628480 | consumed tokens: 9479127040 | elapsed time per iteration (s): 4.18 | learning rate: 1.833E-04 | global batch size: 512 | lm loss: 2.213201E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.374 | TFLOPs: 57.03 | 7: iteration 9050/ 44073 | consumed samples: 4633600 | consumed tokens: 9489612800 | elapsed time per iteration (s): 4.19 | learning rate: 1.833E-04 | global batch size: 512 | lm loss: 2.217011E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.182 | TFLOPs: 56.94 | 7: iteration 9060/ 44073 | consumed samples: 4638720 | consumed tokens: 9500098560 | elapsed time per iteration (s): 4.25 | learning rate: 1.832E-04 | global batch size: 512 | lm loss: 2.239248E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.512 | TFLOPs: 56.16 | 7: iteration 9070/ 44073 | consumed samples: 4643840 | consumed tokens: 9510584320 | elapsed time per iteration (s): 4.21 | learning rate: 1.832E-04 | global batch size: 512 | lm loss: 2.209889E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.642 | TFLOPs: 56.69 | 7: iteration 9080/ 44073 | consumed samples: 4648960 | consumed tokens: 9521070080 | elapsed time per iteration (s): 4.14 | learning rate: 1.831E-04 | global batch size: 512 | lm loss: 2.176333E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.603 | TFLOPs: 57.61 | 7: iteration 9090/ 44073 | consumed samples: 4654080 | consumed tokens: 9531555840 | elapsed time per iteration (s): 4.17 | learning rate: 1.831E-04 | global batch size: 512 | lm loss: 2.194670E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.661 | TFLOPs: 57.17 | 7: iteration 9100/ 44073 | consumed samples: 4659200 | consumed tokens: 9542041600 | elapsed time per iteration (s): 4.17 | learning rate: 1.831E-04 | global batch size: 512 | lm loss: 2.208800E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.711 | TFLOPs: 57.19 | 7: iteration 9110/ 44073 | consumed samples: 4664320 | consumed tokens: 9552527360 | elapsed time per iteration (s): 4.18 | learning rate: 1.830E-04 | global batch size: 512 | lm loss: 2.202028E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.357 | TFLOPs: 57.02 | 7: iteration 9120/ 44073 | consumed samples: 4669440 | consumed tokens: 9563013120 | elapsed time per iteration (s): 4.15 | learning rate: 1.830E-04 | global batch size: 512 | lm loss: 2.186416E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.238 | TFLOPs: 57.44 | 7: iteration 9130/ 44073 | consumed samples: 4674560 | consumed tokens: 9573498880 | elapsed time per iteration (s): 4.16 | learning rate: 1.830E-04 | global batch size: 512 | lm loss: 2.187679E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.937 | TFLOPs: 57.30 | 7: iteration 9140/ 44073 | consumed samples: 4679680 | consumed tokens: 9583984640 | elapsed time per iteration (s): 4.16 | learning rate: 1.829E-04 | global batch size: 512 | lm loss: 2.203920E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.053 | TFLOPs: 57.35 | 7: iteration 9150/ 44073 | consumed samples: 4684800 | consumed tokens: 9594470400 | elapsed time per iteration (s): 4.16 | learning rate: 1.829E-04 | global batch size: 512 | lm loss: 2.211957E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.208 | TFLOPs: 57.42 | 7: iteration 9160/ 44073 | consumed samples: 4689920 | consumed tokens: 9604956160 | elapsed time per iteration (s): 4.14 | learning rate: 1.828E-04 | global batch size: 512 | lm loss: 2.207672E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.615 | TFLOPs: 57.61 | 7: iteration 9170/ 44073 | consumed samples: 4695040 | consumed tokens: 9615441920 | elapsed time per iteration (s): 4.15 | learning rate: 1.828E-04 | global batch size: 512 | lm loss: 2.205569E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.408 | TFLOPs: 57.51 | 7: iteration 9180/ 44073 | consumed samples: 4700160 | consumed tokens: 9625927680 | elapsed time per iteration (s): 4.15 | learning rate: 1.828E-04 | global batch size: 512 | lm loss: 2.193391E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.378 | TFLOPs: 57.50 | 7: iteration 9190/ 44073 | consumed samples: 4705280 | consumed tokens: 9636413440 | elapsed time per iteration (s): 4.16 | learning rate: 1.827E-04 | global batch size: 512 | lm loss: 2.220152E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.205 | TFLOPs: 57.42 | 7: iteration 9200/ 44073 | consumed samples: 4710400 | consumed tokens: 9646899200 | elapsed time per iteration (s): 4.16 | learning rate: 1.827E-04 | global batch size: 512 | lm loss: 2.208851E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.059 | TFLOPs: 57.35 | 7: iteration 9210/ 44073 | consumed samples: 4715520 | consumed tokens: 9657384960 | elapsed time per iteration (s): 4.17 | learning rate: 1.826E-04 | global batch size: 512 | lm loss: 2.185510E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.834 | TFLOPs: 57.25 | 7: iteration 9220/ 44073 | consumed samples: 4720640 | consumed tokens: 9667870720 | elapsed time per iteration (s): 4.14 | learning rate: 1.826E-04 | global batch size: 512 | lm loss: 2.219470E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.597 | TFLOPs: 57.60 | 7: iteration 9230/ 44073 | consumed samples: 4725760 | consumed tokens: 9678356480 | elapsed time per iteration (s): 4.15 | learning rate: 1.826E-04 | global batch size: 512 | lm loss: 2.178477E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.473 | TFLOPs: 57.54 | 7: iteration 9240/ 44073 | consumed samples: 4730880 | consumed tokens: 9688842240 | elapsed time per iteration (s): 4.16 | learning rate: 1.825E-04 | global batch size: 512 | lm loss: 2.185195E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.013 | TFLOPs: 57.33 | 7: iteration 9250/ 44073 | consumed samples: 4736000 | consumed tokens: 9699328000 | elapsed time per iteration (s): 4.15 | learning rate: 1.825E-04 | global batch size: 512 | lm loss: 2.200260E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.343 | TFLOPs: 57.48 | 7: iteration 9260/ 44073 | consumed samples: 4741120 | consumed tokens: 9709813760 | elapsed time per iteration (s): 4.15 | learning rate: 1.825E-04 | global batch size: 512 | lm loss: 2.180982E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.310 | TFLOPs: 57.47 | 7: iteration 9270/ 44073 | consumed samples: 4746240 | consumed tokens: 9720299520 | elapsed time per iteration (s): 4.18 | learning rate: 1.824E-04 | global batch size: 512 | lm loss: 2.204428E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.480 | TFLOPs: 57.08 | 7: iteration 9280/ 44073 | consumed samples: 4751360 | consumed tokens: 9730785280 | elapsed time per iteration (s): 4.17 | learning rate: 1.824E-04 | global batch size: 512 | lm loss: 2.210906E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.659 | TFLOPs: 57.17 | 7: iteration 9290/ 44073 | consumed samples: 4756480 | consumed tokens: 9741271040 | elapsed time per iteration (s): 4.15 | learning rate: 1.823E-04 | global batch size: 512 | lm loss: 2.190407E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.308 | TFLOPs: 57.47 | 7: iteration 9300/ 44073 | consumed samples: 4761600 | consumed tokens: 9751756800 | elapsed time per iteration (s): 4.19 | learning rate: 1.823E-04 | global batch size: 512 | lm loss: 2.187489E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.161 | TFLOPs: 56.93 | 7: iteration 9310/ 44073 | consumed samples: 4766720 | consumed tokens: 9762242560 | elapsed time per iteration (s): 4.15 | learning rate: 1.823E-04 | global batch size: 512 | lm loss: 2.200314E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.387 | TFLOPs: 57.50 | 7: iteration 9320/ 44073 | consumed samples: 4771840 | consumed tokens: 9772728320 | elapsed time per iteration (s): 4.18 | learning rate: 1.822E-04 | global batch size: 512 | lm loss: 2.201683E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.630 | TFLOPs: 57.15 | 7: iteration 9330/ 44073 | consumed samples: 4776960 | consumed tokens: 9783214080 | elapsed time per iteration (s): 4.18 | learning rate: 1.822E-04 | global batch size: 512 | lm loss: 2.206189E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.547 | TFLOPs: 57.11 | 7: iteration 9340/ 44073 | consumed samples: 4782080 | consumed tokens: 9793699840 | elapsed time per iteration (s): 4.15 | learning rate: 1.821E-04 | global batch size: 512 | lm loss: 2.206407E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.453 | TFLOPs: 57.54 | 7: iteration 9350/ 44073 | consumed samples: 4787200 | consumed tokens: 9804185600 | elapsed time per iteration (s): 4.21 | learning rate: 1.821E-04 | global batch size: 512 | lm loss: 2.195797E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.562 | TFLOPs: 56.65 | 7: iteration 9360/ 44073 | consumed samples: 4792320 | consumed tokens: 9814671360 | elapsed time per iteration (s): 4.15 | learning rate: 1.821E-04 | global batch size: 512 | lm loss: 2.206889E+00 | grad norm: 0.380 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.474 | TFLOPs: 57.55 | 7: iteration 9370/ 44073 | consumed samples: 4797440 | consumed tokens: 9825157120 | elapsed time per iteration (s): 4.22 | learning rate: 1.820E-04 | global batch size: 512 | lm loss: 2.203504E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.327 | TFLOPs: 56.54 | 7: iteration 9380/ 44073 | consumed samples: 4802560 | consumed tokens: 9835642880 | elapsed time per iteration (s): 4.15 | learning rate: 1.820E-04 | global batch size: 512 | lm loss: 2.203268E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.363 | TFLOPs: 57.49 | 7: iteration 9390/ 44073 | consumed samples: 4807680 | consumed tokens: 9846128640 | elapsed time per iteration (s): 4.18 | learning rate: 1.820E-04 | global batch size: 512 | lm loss: 2.187222E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.404 | TFLOPs: 57.05 | 7: iteration 9400/ 44073 | consumed samples: 4812800 | consumed tokens: 9856614400 | elapsed time per iteration (s): 4.17 | learning rate: 1.819E-04 | global batch size: 512 | lm loss: 2.197231E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.779 | TFLOPs: 57.22 | 7: iteration 9410/ 44073 | consumed samples: 4817920 | consumed tokens: 9867100160 | elapsed time per iteration (s): 4.16 | learning rate: 1.819E-04 | global batch size: 512 | lm loss: 2.184396E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.188 | TFLOPs: 57.41 | 7: iteration 9420/ 44073 | consumed samples: 4823040 | consumed tokens: 9877585920 | elapsed time per iteration (s): 4.16 | learning rate: 1.818E-04 | global batch size: 512 | lm loss: 2.179099E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.189 | TFLOPs: 57.41 | 7: iteration 9430/ 44073 | consumed samples: 4828160 | consumed tokens: 9888071680 | elapsed time per iteration (s): 4.15 | learning rate: 1.818E-04 | global batch size: 512 | lm loss: 2.188447E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.374 | TFLOPs: 57.50 | 7: iteration 9440/ 44073 | consumed samples: 4833280 | consumed tokens: 9898557440 | elapsed time per iteration (s): 4.14 | learning rate: 1.818E-04 | global batch size: 512 | lm loss: 2.193665E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.609 | TFLOPs: 57.61 | 7: iteration 9450/ 44073 | consumed samples: 4838400 | consumed tokens: 9909043200 | elapsed time per iteration (s): 4.17 | learning rate: 1.817E-04 | global batch size: 512 | lm loss: 2.196838E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.787 | TFLOPs: 57.23 | 7: iteration 9460/ 44073 | consumed samples: 4843520 | consumed tokens: 9919528960 | elapsed time per iteration (s): 4.16 | learning rate: 1.817E-04 | global batch size: 512 | lm loss: 2.183683E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.211 | TFLOPs: 57.42 | 7: iteration 9470/ 44073 | consumed samples: 4848640 | consumed tokens: 9930014720 | elapsed time per iteration (s): 4.15 | learning rate: 1.816E-04 | global batch size: 512 | lm loss: 2.207518E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.517 | TFLOPs: 57.57 | 7: iteration 9480/ 44073 | consumed samples: 4853760 | consumed tokens: 9940500480 | elapsed time per iteration (s): 4.14 | learning rate: 1.816E-04 | global batch size: 512 | lm loss: 2.173876E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.601 | TFLOPs: 57.60 | 7: iteration 9490/ 44073 | consumed samples: 4858880 | consumed tokens: 9950986240 | elapsed time per iteration (s): 4.14 | learning rate: 1.816E-04 | global batch size: 512 | lm loss: 2.209200E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.730 | TFLOPs: 57.66 | 7: iteration 9500/ 44073 | consumed samples: 4864000 | consumed tokens: 9961472000 | elapsed time per iteration (s): 4.16 | learning rate: 1.815E-04 | global batch size: 512 | lm loss: 2.205359E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.168 | TFLOPs: 57.40 | 7: iteration 9510/ 44073 | consumed samples: 4869120 | consumed tokens: 9971957760 | elapsed time per iteration (s): 4.15 | learning rate: 1.815E-04 | global batch size: 512 | lm loss: 2.210431E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.475 | TFLOPs: 57.55 | 7: iteration 9520/ 44073 | consumed samples: 4874240 | consumed tokens: 9982443520 | elapsed time per iteration (s): 4.17 | learning rate: 1.814E-04 | global batch size: 512 | lm loss: 2.192919E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.921 | TFLOPs: 57.29 | 7: iteration 9530/ 44073 | consumed samples: 4879360 | consumed tokens: 9992929280 | elapsed time per iteration (s): 4.21 | learning rate: 1.814E-04 | global batch size: 512 | lm loss: 2.179896E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.708 | TFLOPs: 56.72 | 7: iteration 9540/ 44073 | consumed samples: 4884480 | consumed tokens: 10003415040 | elapsed time per iteration (s): 4.16 | learning rate: 1.814E-04 | global batch size: 512 | lm loss: 2.182400E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.224 | TFLOPs: 57.43 | 7: iteration 9550/ 44073 | consumed samples: 4889600 | consumed tokens: 10013900800 | elapsed time per iteration (s): 4.16 | learning rate: 1.813E-04 | global batch size: 512 | lm loss: 2.216429E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.059 | TFLOPs: 57.35 | 7: iteration 9560/ 44073 | consumed samples: 4894720 | consumed tokens: 10024386560 | elapsed time per iteration (s): 4.14 | learning rate: 1.813E-04 | global batch size: 512 | lm loss: 2.202911E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.805 | TFLOPs: 57.70 | 7: iteration 9570/ 44073 | consumed samples: 4899840 | consumed tokens: 10034872320 | elapsed time per iteration (s): 4.14 | learning rate: 1.812E-04 | global batch size: 512 | lm loss: 2.178866E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.790 | TFLOPs: 57.69 | 7: iteration 9580/ 44073 | consumed samples: 4904960 | consumed tokens: 10045358080 | elapsed time per iteration (s): 4.15 | learning rate: 1.812E-04 | global batch size: 512 | lm loss: 2.188958E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.247 | TFLOPs: 57.44 | 7: iteration 9590/ 44073 | consumed samples: 4910080 | consumed tokens: 10055843840 | elapsed time per iteration (s): 4.16 | learning rate: 1.812E-04 | global batch size: 512 | lm loss: 2.186598E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.014 | TFLOPs: 57.33 | 7: iteration 9600/ 44073 | consumed samples: 4915200 | consumed tokens: 10066329600 | elapsed time per iteration (s): 4.14 | learning rate: 1.811E-04 | global batch size: 512 | lm loss: 2.195447E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.542 | TFLOPs: 57.58 | 7: iteration 9610/ 44073 | consumed samples: 4920320 | consumed tokens: 10076815360 | elapsed time per iteration (s): 4.18 | learning rate: 1.811E-04 | global batch size: 512 | lm loss: 2.188310E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.560 | TFLOPs: 57.12 | 7: iteration 9620/ 44073 | consumed samples: 4925440 | consumed tokens: 10087301120 | elapsed time per iteration (s): 4.16 | learning rate: 1.810E-04 | global batch size: 512 | lm loss: 2.193096E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.999 | TFLOPs: 57.32 | 7: iteration 9630/ 44073 | consumed samples: 4930560 | consumed tokens: 10097786880 | elapsed time per iteration (s): 4.16 | learning rate: 1.810E-04 | global batch size: 512 | lm loss: 2.199630E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.224 | TFLOPs: 57.43 | 7: iteration 9640/ 44073 | consumed samples: 4935680 | consumed tokens: 10108272640 | elapsed time per iteration (s): 4.14 | learning rate: 1.810E-04 | global batch size: 512 | lm loss: 2.171641E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.798 | TFLOPs: 57.70 | 7: iteration 9650/ 44073 | consumed samples: 4940800 | consumed tokens: 10118758400 | elapsed time per iteration (s): 4.22 | learning rate: 1.809E-04 | global batch size: 512 | lm loss: 2.190192E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.423 | TFLOPs: 56.59 | 7: iteration 9660/ 44073 | consumed samples: 4945920 | consumed tokens: 10129244160 | elapsed time per iteration (s): 4.13 | learning rate: 1.809E-04 | global batch size: 512 | lm loss: 2.172929E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.839 | TFLOPs: 57.72 | 7: iteration 9670/ 44073 | consumed samples: 4951040 | consumed tokens: 10139729920 | elapsed time per iteration (s): 4.14 | learning rate: 1.808E-04 | global batch size: 512 | lm loss: 2.177066E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.725 | TFLOPs: 57.66 | 7: iteration 9680/ 44073 | consumed samples: 4956160 | consumed tokens: 10150215680 | elapsed time per iteration (s): 4.13 | learning rate: 1.808E-04 | global batch size: 512 | lm loss: 2.178164E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.840 | TFLOPs: 57.72 | 7: iteration 9690/ 44073 | consumed samples: 4961280 | consumed tokens: 10160701440 | elapsed time per iteration (s): 4.14 | learning rate: 1.808E-04 | global batch size: 512 | lm loss: 2.170226E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.706 | TFLOPs: 57.65 | 7: iteration 9700/ 44073 | consumed samples: 4966400 | consumed tokens: 10171187200 | elapsed time per iteration (s): 4.14 | learning rate: 1.807E-04 | global batch size: 512 | lm loss: 2.187264E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.808 | TFLOPs: 57.70 | 7: iteration 9710/ 44073 | consumed samples: 4971520 | consumed tokens: 10181672960 | elapsed time per iteration (s): 4.14 | learning rate: 1.807E-04 | global batch size: 512 | lm loss: 2.191899E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.657 | TFLOPs: 57.63 | 7: iteration 9720/ 44073 | consumed samples: 4976640 | consumed tokens: 10192158720 | elapsed time per iteration (s): 4.14 | learning rate: 1.806E-04 | global batch size: 512 | lm loss: 2.196539E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.778 | TFLOPs: 57.69 | 7: iteration 9730/ 44073 | consumed samples: 4981760 | consumed tokens: 10202644480 | elapsed time per iteration (s): 4.14 | learning rate: 1.806E-04 | global batch size: 512 | lm loss: 2.189412E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.750 | TFLOPs: 57.67 | 7: iteration 9740/ 44073 | consumed samples: 4986880 | consumed tokens: 10213130240 | elapsed time per iteration (s): 4.14 | learning rate: 1.806E-04 | global batch size: 512 | lm loss: 2.165384E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.653 | TFLOPs: 57.63 | 7: iteration 9750/ 44073 | consumed samples: 4992000 | consumed tokens: 10223616000 | elapsed time per iteration (s): 4.14 | learning rate: 1.805E-04 | global batch size: 512 | lm loss: 2.179192E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.655 | TFLOPs: 57.63 | 7: iteration 9760/ 44073 | consumed samples: 4997120 | consumed tokens: 10234101760 | elapsed time per iteration (s): 4.18 | learning rate: 1.805E-04 | global batch size: 512 | lm loss: 2.196354E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.606 | TFLOPs: 57.14 | 7: iteration 9770/ 44073 | consumed samples: 5002240 | consumed tokens: 10244587520 | elapsed time per iteration (s): 4.31 | learning rate: 1.804E-04 | global batch size: 512 | lm loss: 2.177301E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.702 | TFLOPs: 55.32 | 7: iteration 9780/ 44073 | consumed samples: 5007360 | consumed tokens: 10255073280 | elapsed time per iteration (s): 4.15 | learning rate: 1.804E-04 | global batch size: 512 | lm loss: 2.193588E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.487 | TFLOPs: 57.55 | 7: iteration 9790/ 44073 | consumed samples: 5012480 | consumed tokens: 10265559040 | elapsed time per iteration (s): 4.18 | learning rate: 1.804E-04 | global batch size: 512 | lm loss: 2.177109E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.566 | TFLOPs: 57.12 | 7: iteration 9800/ 44073 | consumed samples: 5017600 | consumed tokens: 10276044800 | elapsed time per iteration (s): 4.14 | learning rate: 1.803E-04 | global batch size: 512 | lm loss: 2.159750E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.557 | TFLOPs: 57.58 | 7: iteration 9810/ 44073 | consumed samples: 5022720 | consumed tokens: 10286530560 | elapsed time per iteration (s): 4.18 | learning rate: 1.803E-04 | global batch size: 512 | lm loss: 2.189910E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.465 | TFLOPs: 57.08 | 7: iteration 9820/ 44073 | consumed samples: 5027840 | consumed tokens: 10297016320 | elapsed time per iteration (s): 4.19 | learning rate: 1.802E-04 | global batch size: 512 | lm loss: 2.197772E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.293 | TFLOPs: 56.99 | 7: iteration 9830/ 44073 | consumed samples: 5032960 | consumed tokens: 10307502080 | elapsed time per iteration (s): 4.31 | learning rate: 1.802E-04 | global batch size: 512 | lm loss: 2.169127E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.896 | TFLOPs: 55.41 | 7: iteration 9840/ 44073 | consumed samples: 5038080 | consumed tokens: 10317987840 | elapsed time per iteration (s): 4.14 | learning rate: 1.802E-04 | global batch size: 512 | lm loss: 2.165165E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.641 | TFLOPs: 57.62 | 7: iteration 9850/ 44073 | consumed samples: 5043200 | consumed tokens: 10328473600 | elapsed time per iteration (s): 4.15 | learning rate: 1.801E-04 | global batch size: 512 | lm loss: 2.171303E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.414 | TFLOPs: 57.52 | 7: iteration 9860/ 44073 | consumed samples: 5048320 | consumed tokens: 10338959360 | elapsed time per iteration (s): 4.14 | learning rate: 1.801E-04 | global batch size: 512 | lm loss: 2.162444E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.552 | TFLOPs: 57.58 | 7: iteration 9870/ 44073 | consumed samples: 5053440 | consumed tokens: 10349445120 | elapsed time per iteration (s): 4.15 | learning rate: 1.800E-04 | global batch size: 512 | lm loss: 2.188893E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.279 | TFLOPs: 57.45 | 7: iteration 9880/ 44073 | consumed samples: 5058560 | consumed tokens: 10359930880 | elapsed time per iteration (s): 4.14 | learning rate: 1.800E-04 | global batch size: 512 | lm loss: 2.176783E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.551 | TFLOPs: 57.58 | 7: iteration 9890/ 44073 | consumed samples: 5063680 | consumed tokens: 10370416640 | elapsed time per iteration (s): 4.17 | learning rate: 1.800E-04 | global batch size: 512 | lm loss: 2.184530E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.801 | TFLOPs: 57.23 | 7: iteration 9900/ 44073 | consumed samples: 5068800 | consumed tokens: 10380902400 | elapsed time per iteration (s): 4.14 | learning rate: 1.799E-04 | global batch size: 512 | lm loss: 2.166039E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.566 | TFLOPs: 57.59 | 7: iteration 9910/ 44073 | consumed samples: 5073920 | consumed tokens: 10391388160 | elapsed time per iteration (s): 4.17 | learning rate: 1.799E-04 | global batch size: 512 | lm loss: 2.182326E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.722 | TFLOPs: 57.19 | 7: iteration 9920/ 44073 | consumed samples: 5079040 | consumed tokens: 10401873920 | elapsed time per iteration (s): 4.15 | learning rate: 1.798E-04 | global batch size: 512 | lm loss: 2.156380E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.297 | TFLOPs: 57.46 | 7: iteration 9930/ 44073 | consumed samples: 5084160 | consumed tokens: 10412359680 | elapsed time per iteration (s): 4.16 | learning rate: 1.798E-04 | global batch size: 512 | lm loss: 2.193578E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.147 | TFLOPs: 57.39 | 7: iteration 9940/ 44073 | consumed samples: 5089280 | consumed tokens: 10422845440 | elapsed time per iteration (s): 4.21 | learning rate: 1.798E-04 | global batch size: 512 | lm loss: 2.187983E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.496 | TFLOPs: 56.62 | 7: iteration 9950/ 44073 | consumed samples: 5094400 | consumed tokens: 10433331200 | elapsed time per iteration (s): 4.89 | learning rate: 1.797E-04 | global batch size: 512 | lm loss: 2.178090E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 104.694 | TFLOPs: 48.79 | 7: iteration 9960/ 44073 | consumed samples: 5099520 | consumed tokens: 10443816960 | elapsed time per iteration (s): 4.16 | learning rate: 1.797E-04 | global batch size: 512 | lm loss: 2.199185E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.086 | TFLOPs: 57.36 | 7: iteration 9970/ 44073 | consumed samples: 5104640 | consumed tokens: 10454302720 | elapsed time per iteration (s): 4.14 | learning rate: 1.796E-04 | global batch size: 512 | lm loss: 2.172753E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.567 | TFLOPs: 57.59 | 7: iteration 9980/ 44073 | consumed samples: 5109760 | consumed tokens: 10464788480 | elapsed time per iteration (s): 4.14 | learning rate: 1.796E-04 | global batch size: 512 | lm loss: 2.153878E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.679 | TFLOPs: 57.64 | 7: iteration 9990/ 44073 | consumed samples: 5114880 | consumed tokens: 10475274240 | elapsed time per iteration (s): 4.15 | learning rate: 1.796E-04 | global batch size: 512 | lm loss: 2.192555E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.481 | TFLOPs: 57.55 | 0: [2022-11-25 22:05:59,615] [INFO] [logging.py:68:log_dist] [Rank 0] step=10000, skipped=0, lr=[0.0001795110044087738, 0.0001795110044087738, 0.0001795110044087738], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 7: iteration 10000/ 44073 | consumed samples: 5120000 | consumed tokens: 10485760000 | elapsed time per iteration (s): 4.24 | learning rate: 1.795E-04 | global batch size: 512 | lm loss: 2.198445E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.647 | TFLOPs: 56.23 | 0: steps: 10000 loss: 2.1802 iter time (s): 5.023 samples/sec: 101.935 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 10000 | lm loss value: 2.113586E+00 | lm loss PPL: 8.277872E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 10000 to checkpoints_2b2 0: [2022-11-25 22:06:01,010] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step10000 is begin to save! 0: [2022-11-25 22:06:01,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_01-model_00-model_states.pt... 0: [2022-11-25 22:06:01,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_01-model_00-model_states.pt. 0: [2022-11-25 22:06:01,344] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_03-model_00-model_states.pt... 0: [2022-11-25 22:06:01,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_03-model_00-model_states.pt. 0: [2022-11-25 22:06:01,493] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_04-model_00-model_states.pt... 0: [2022-11-25 22:06:01,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_04-model_00-model_states.pt. 0: [2022-11-25 22:06:01,633] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_05-model_00-model_states.pt... 0: [2022-11-25 22:06:01,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_05-model_00-model_states.pt. 0: [2022-11-25 22:06:01,781] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_06-model_00-model_states.pt... 0: [2022-11-25 22:06:01,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_06-model_00-model_states.pt. 0: [2022-11-25 22:06:01,917] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_07-model_00-model_states.pt... 0: [2022-11-25 22:06:02,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_07-model_00-model_states.pt. 0: [2022-11-25 22:06:02,062] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_08-model_00-model_states.pt... 0: [2022-11-25 22:06:02,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_08-model_00-model_states.pt. 0: [2022-11-25 22:06:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_09-model_00-model_states.pt... 0: [2022-11-25 22:06:02,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_09-model_00-model_states.pt. 0: [2022-11-25 22:06:02,334] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_10-model_00-model_states.pt... 0: [2022-11-25 22:06:02,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_10-model_00-model_states.pt. 0: [2022-11-25 22:06:02,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_11-model_00-model_states.pt... 0: [2022-11-25 22:06:02,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_11-model_00-model_states.pt. 0: [2022-11-25 22:06:02,609] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_12-model_00-model_states.pt... 0: [2022-11-25 22:06:02,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_12-model_00-model_states.pt. 0: [2022-11-25 22:06:02,741] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_13-model_00-model_states.pt... 0: [2022-11-25 22:06:02,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_13-model_00-model_states.pt. 0: [2022-11-25 22:06:02,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_14-model_00-model_states.pt... 0: [2022-11-25 22:06:03,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_14-model_00-model_states.pt. 0: [2022-11-25 22:06:03,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_15-model_00-model_states.pt... 0: [2022-11-25 22:06:03,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_15-model_00-model_states.pt. 0: [2022-11-25 22:06:03,149] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_16-model_00-model_states.pt... 0: [2022-11-25 22:06:03,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_16-model_00-model_states.pt. 0: [2022-11-25 22:06:03,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_17-model_00-model_states.pt... 0: [2022-11-25 22:06:03,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_17-model_00-model_states.pt. 0: [2022-11-25 22:06:03,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_18-model_00-model_states.pt... 0: [2022-11-25 22:06:03,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_18-model_00-model_states.pt. 0: [2022-11-25 22:06:03,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_19-model_00-model_states.pt... 0: [2022-11-25 22:06:03,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_19-model_00-model_states.pt. 0: [2022-11-25 22:06:03,686] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_20-model_00-model_states.pt... 0: [2022-11-25 22:06:03,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_20-model_00-model_states.pt. 0: [2022-11-25 22:06:03,812] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_21-model_00-model_states.pt... 0: [2022-11-25 22:06:03,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_21-model_00-model_states.pt. 0: [2022-11-25 22:06:03,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_22-model_00-model_states.pt... 0: [2022-11-25 22:06:04,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_22-model_00-model_states.pt. 0: [2022-11-25 22:06:04,080] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_23-model_00-model_states.pt... 0: [2022-11-25 22:06:04,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_23-model_00-model_states.pt. 0: [2022-11-25 22:06:04,216] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_24-model_00-model_states.pt... 0: [2022-11-25 22:06:04,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_24-model_00-model_states.pt. 0: [2022-11-25 22:06:04,344] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_25-model_00-model_states.pt... 0: [2022-11-25 22:06:04,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_25-model_00-model_states.pt. 0: [2022-11-25 22:06:04,479] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_26-model_00-model_states.pt... 0: [2022-11-25 22:06:04,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_26-model_00-model_states.pt. 0: [2022-11-25 22:06:04,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_27-model_00-model_states.pt... 0: [2022-11-25 22:06:04,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_27-model_00-model_states.pt. 0: [2022-11-25 22:06:04,741] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_28-model_00-model_states.pt... 0: [2022-11-25 22:06:04,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_28-model_00-model_states.pt. 0: [2022-11-25 22:06:04,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_29-model_00-model_states.pt... 0: [2022-11-25 22:06:05,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_29-model_00-model_states.pt. 0: [2022-11-25 22:06:05,003] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_30-model_00-model_states.pt... 0: [2022-11-25 22:06:05,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_30-model_00-model_states.pt. 0: [2022-11-25 22:06:05,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_31-model_00-model_states.pt... 0: [2022-11-25 22:06:05,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_31-model_00-model_states.pt. 0: [2022-11-25 22:06:05,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_32-model_00-model_states.pt... 0: [2022-11-25 22:06:05,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_32-model_00-model_states.pt. 0: [2022-11-25 22:06:05,399] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_33-model_00-model_states.pt... 0: [2022-11-25 22:06:05,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_33-model_00-model_states.pt. 0: [2022-11-25 22:06:05,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_34-model_00-model_states.pt... 0: [2022-11-25 22:06:05,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_34-model_00-model_states.pt. 0: [2022-11-25 22:06:05,659] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/layer_36-model_00-model_states.pt... 0: [2022-11-25 22:06:05,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/layer_36-model_00-model_states.pt. 0: [2022-11-25 22:06:05,666] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step10000/mp_rank_00_model_states.pt 0: [2022-11-25 22:06:05,666] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/mp_rank_00_model_states.pt... 0: [2022-11-25 22:06:05,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/mp_rank_00_model_states.pt. 0: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:06:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step10000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:06:06,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:06:06,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 22:06:06,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: [2022-11-25 22:06:06,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:06:06,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 22:06:06,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: [2022-11-25 22:06:06,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:06:06,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 22:06:06,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: [2022-11-25 22:06:06,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:06:06,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:06:06,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 22:06:06,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 1: [2022-11-25 22:06:06,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:06:06,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:06:06,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 22:06:06,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 22:06:06,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 1: [2022-11-25 22:06:06,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: [2022-11-25 22:06:06,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:06:06,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 22:06:06,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 1: [2022-11-25 22:06:06,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:06:06,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:06:06,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 0: [2022-11-25 22:06:06,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 1: [2022-11-25 22:06:06,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: [2022-11-25 22:06:06,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: [2022-11-25 22:06:06,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:06:06,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:06:06,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 22:06:06,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 22:06:06,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: [2022-11-25 22:06:06,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 1: [2022-11-25 22:06:06,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:06:06,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 22:06:06,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 6: [2022-11-25 22:06:06,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:06:06,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:06:06,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:06:06,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:06:06,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 22:06:06,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 22:06:06,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 22:06:06,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 6: [2022-11-25 22:06:06,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 22:06:06,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 6: [2022-11-25 22:06:06,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 6: [2022-11-25 22:06:06,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 6: [2022-11-25 22:06:06,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:06:06,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 22:06:06,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 6: [2022-11-25 22:06:06,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:06:06,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:06:06,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 22:06:06,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 22:06:06,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 6: [2022-11-25 22:06:06,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 1: [2022-11-25 22:06:06,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:06:06,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:06:06,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 22:06:06,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 1: [2022-11-25 22:06:06,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 22:06:06,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 6: [2022-11-25 22:06:06,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:06:06,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 22:06:06,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 3: [2022-11-25 22:06:06,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:06:06,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 22:06:06,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 3: [2022-11-25 22:06:06,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:06:06,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 22:06:06,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 3: [2022-11-25 22:06:06,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:06:06,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 22:06:06,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 3: [2022-11-25 22:06:06,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:06:06,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 22:06:06,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 3: [2022-11-25 22:06:06,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:06:06,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 22:06:06,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 3: [2022-11-25 22:06:06,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:06:06,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 22:06:06,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 3: [2022-11-25 22:06:06,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:06:06,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 22:06:06,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:06:06,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 3: [2022-11-25 22:06:06,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 22:06:06,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 7: [2022-11-25 22:06:06,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:06:06,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:06:06,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:06:06,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:06:06,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:06:06,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:06:06,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:06:06,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:06:06,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 22:06:06,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 22:06:06,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 22:06:06,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 22:06:06,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 22:06:06,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 22:06:06,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 22:06:06,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 22:06:06,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 7: [2022-11-25 22:06:06,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 7: [2022-11-25 22:06:06,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 7: [2022-11-25 22:06:06,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 7: [2022-11-25 22:06:06,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 7: [2022-11-25 22:06:06,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 7: [2022-11-25 22:06:06,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 7: [2022-11-25 22:06:06,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 5: [2022-11-25 22:06:06,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:06:06,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 22:06:06,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 5: [2022-11-25 22:06:06,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:06:06,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 22:06:06,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 5: [2022-11-25 22:06:06,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:06:06,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:06:06,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 22:06:06,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 22:06:06,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 5: [2022-11-25 22:06:06,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 5: [2022-11-25 22:06:06,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:06:06,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:06:06,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:06:06,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 22:06:06,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 22:06:06,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 22:06:06,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 5: [2022-11-25 22:06:06,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 5: [2022-11-25 22:06:06,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 1: [2022-11-25 22:06:06,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:06:06,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 22:06:06,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 5: [2022-11-25 22:06:06,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:06:06,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 22:06:06,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: [2022-11-25 22:06:06,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 22:06:06,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 4: [2022-11-25 22:06:06,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:06:06,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:06:06,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:06:06,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:06:06,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 22:06:06,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:06:06,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 22:06:06,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 22:06:06,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 4: [2022-11-25 22:06:06,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 22:06:06,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 4: [2022-11-25 22:06:06,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 4: [2022-11-25 22:06:06,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 22:06:06,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 4: [2022-11-25 22:06:06,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 4: [2022-11-25 22:06:06,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:06:06,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:06:06,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:06:06,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 22:06:06,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 22:06:06,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 22:06:06,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 4: [2022-11-25 22:06:06,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 4: [2022-11-25 22:06:06,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 2: [2022-11-25 22:06:06,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:06:06,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 22:06:06,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:06:06,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:06:06,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:06:06,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 2: [2022-11-25 22:06:06,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 22:06:06,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 22:06:06,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 22:06:06,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 2: [2022-11-25 22:06:06,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 2: [2022-11-25 22:06:06,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 2: [2022-11-25 22:06:07,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:06:07,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:06:07,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:06:07,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:06:07,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 22:06:07,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 22:06:07,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 2: [2022-11-25 22:06:07,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 22:06:07,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step10000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 22:06:07,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 2: [2022-11-25 22:06:07,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 2: [2022-11-25 22:06:07,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: successfully saved checkpoint at iteration 10000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 5999.30 7: iteration 10010/ 44073 | consumed samples: 5125120 | consumed tokens: 10496245760 | elapsed time per iteration (s): 4.90 | learning rate: 1.795E-04 | global batch size: 512 | lm loss: 2.169788E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 104.511 | TFLOPs: 48.71 | 7: iteration 10020/ 44073 | consumed samples: 5130240 | consumed tokens: 10506731520 | elapsed time per iteration (s): 4.16 | learning rate: 1.794E-04 | global batch size: 512 | lm loss: 2.185065E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.033 | TFLOPs: 57.34 | 7: iteration 10030/ 44073 | consumed samples: 5135360 | consumed tokens: 10517217280 | elapsed time per iteration (s): 4.14 | learning rate: 1.794E-04 | global batch size: 512 | lm loss: 2.173539E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.616 | TFLOPs: 57.61 | 7: iteration 10040/ 44073 | consumed samples: 5140480 | consumed tokens: 10527703040 | elapsed time per iteration (s): 4.16 | learning rate: 1.793E-04 | global batch size: 512 | lm loss: 2.156450E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.100 | TFLOPs: 57.37 | 7: iteration 10050/ 44073 | consumed samples: 5145600 | consumed tokens: 10538188800 | elapsed time per iteration (s): 4.16 | learning rate: 1.793E-04 | global batch size: 512 | lm loss: 2.165633E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.213 | TFLOPs: 57.42 | 7: iteration 10060/ 44073 | consumed samples: 5150720 | consumed tokens: 10548674560 | elapsed time per iteration (s): 4.15 | learning rate: 1.793E-04 | global batch size: 512 | lm loss: 2.157756E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.318 | TFLOPs: 57.47 | 7: iteration 10070/ 44073 | consumed samples: 5155840 | consumed tokens: 10559160320 | elapsed time per iteration (s): 4.15 | learning rate: 1.792E-04 | global batch size: 512 | lm loss: 2.171285E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.490 | TFLOPs: 57.55 | 7: iteration 10080/ 44073 | consumed samples: 5160960 | consumed tokens: 10569646080 | elapsed time per iteration (s): 4.14 | learning rate: 1.792E-04 | global batch size: 512 | lm loss: 2.198796E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.530 | TFLOPs: 57.57 | 7: iteration 10090/ 44073 | consumed samples: 5166080 | consumed tokens: 10580131840 | elapsed time per iteration (s): 4.16 | learning rate: 1.791E-04 | global batch size: 512 | lm loss: 2.174205E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.088 | TFLOPs: 57.37 | 7: iteration 10100/ 44073 | consumed samples: 5171200 | consumed tokens: 10590617600 | elapsed time per iteration (s): 4.14 | learning rate: 1.791E-04 | global batch size: 512 | lm loss: 2.179371E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.531 | TFLOPs: 57.57 | 7: iteration 10110/ 44073 | consumed samples: 5176320 | consumed tokens: 10601103360 | elapsed time per iteration (s): 4.13 | learning rate: 1.791E-04 | global batch size: 512 | lm loss: 2.143585E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.822 | TFLOPs: 57.71 | 7: iteration 10120/ 44073 | consumed samples: 5181440 | consumed tokens: 10611589120 | elapsed time per iteration (s): 4.17 | learning rate: 1.790E-04 | global batch size: 512 | lm loss: 2.170880E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.928 | TFLOPs: 57.29 | 7: iteration 10130/ 44073 | consumed samples: 5186560 | consumed tokens: 10622074880 | elapsed time per iteration (s): 4.14 | learning rate: 1.790E-04 | global batch size: 512 | lm loss: 2.162410E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.700 | TFLOPs: 57.65 | 7: iteration 10140/ 44073 | consumed samples: 5191680 | consumed tokens: 10632560640 | elapsed time per iteration (s): 4.15 | learning rate: 1.789E-04 | global batch size: 512 | lm loss: 2.173647E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.358 | TFLOPs: 57.49 | 7: iteration 10150/ 44073 | consumed samples: 5196800 | consumed tokens: 10643046400 | elapsed time per iteration (s): 4.16 | learning rate: 1.789E-04 | global batch size: 512 | lm loss: 2.182557E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.070 | TFLOPs: 57.36 | 7: iteration 10160/ 44073 | consumed samples: 5201920 | consumed tokens: 10653532160 | elapsed time per iteration (s): 4.17 | learning rate: 1.788E-04 | global batch size: 512 | lm loss: 2.166875E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.759 | TFLOPs: 57.21 | 7: iteration 10170/ 44073 | consumed samples: 5207040 | consumed tokens: 10664017920 | elapsed time per iteration (s): 4.15 | learning rate: 1.788E-04 | global batch size: 512 | lm loss: 2.177599E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.324 | TFLOPs: 57.48 | 7: iteration 10180/ 44073 | consumed samples: 5212160 | consumed tokens: 10674503680 | elapsed time per iteration (s): 4.14 | learning rate: 1.788E-04 | global batch size: 512 | lm loss: 2.156759E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.581 | TFLOPs: 57.59 | 7: iteration 10190/ 44073 | consumed samples: 5217280 | consumed tokens: 10684989440 | elapsed time per iteration (s): 4.15 | learning rate: 1.787E-04 | global batch size: 512 | lm loss: 2.163779E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.380 | TFLOPs: 57.50 | 7: iteration 10200/ 44073 | consumed samples: 5222400 | consumed tokens: 10695475200 | elapsed time per iteration (s): 4.15 | learning rate: 1.787E-04 | global batch size: 512 | lm loss: 2.173083E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.489 | TFLOPs: 57.55 | 7: iteration 10210/ 44073 | consumed samples: 5227520 | consumed tokens: 10705960960 | elapsed time per iteration (s): 4.14 | learning rate: 1.786E-04 | global batch size: 512 | lm loss: 2.170687E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.564 | TFLOPs: 57.59 | 7: iteration 10220/ 44073 | consumed samples: 5232640 | consumed tokens: 10716446720 | elapsed time per iteration (s): 4.15 | learning rate: 1.786E-04 | global batch size: 512 | lm loss: 2.177757E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.480 | TFLOPs: 57.55 | 7: iteration 10230/ 44073 | consumed samples: 5237760 | consumed tokens: 10726932480 | elapsed time per iteration (s): 4.14 | learning rate: 1.786E-04 | global batch size: 512 | lm loss: 2.165567E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.757 | TFLOPs: 57.68 | 7: iteration 10240/ 44073 | consumed samples: 5242880 | consumed tokens: 10737418240 | elapsed time per iteration (s): 4.15 | learning rate: 1.785E-04 | global batch size: 512 | lm loss: 2.170887E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.368 | TFLOPs: 57.50 | 7: iteration 10250/ 44073 | consumed samples: 5248000 | consumed tokens: 10747904000 | elapsed time per iteration (s): 4.15 | learning rate: 1.785E-04 | global batch size: 512 | lm loss: 2.175345E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.328 | TFLOPs: 57.48 | 7: iteration 10260/ 44073 | consumed samples: 5253120 | consumed tokens: 10758389760 | elapsed time per iteration (s): 4.16 | learning rate: 1.784E-04 | global batch size: 512 | lm loss: 2.193838E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.046 | TFLOPs: 57.35 | 7: iteration 10270/ 44073 | consumed samples: 5258240 | consumed tokens: 10768875520 | elapsed time per iteration (s): 4.17 | learning rate: 1.784E-04 | global batch size: 512 | lm loss: 2.167581E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.875 | TFLOPs: 57.27 | 7: iteration 10280/ 44073 | consumed samples: 5263360 | consumed tokens: 10779361280 | elapsed time per iteration (s): 4.14 | learning rate: 1.783E-04 | global batch size: 512 | lm loss: 2.177416E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.640 | TFLOPs: 57.62 | 7: iteration 10290/ 44073 | consumed samples: 5268480 | consumed tokens: 10789847040 | elapsed time per iteration (s): 4.14 | learning rate: 1.783E-04 | global batch size: 512 | lm loss: 2.151982E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.733 | TFLOPs: 57.67 | 7: iteration 10300/ 44073 | consumed samples: 5273600 | consumed tokens: 10800332800 | elapsed time per iteration (s): 4.16 | learning rate: 1.783E-04 | global batch size: 512 | lm loss: 2.161882E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.144 | TFLOPs: 57.39 | 7: iteration 10310/ 44073 | consumed samples: 5278720 | consumed tokens: 10810818560 | elapsed time per iteration (s): 4.16 | learning rate: 1.782E-04 | global batch size: 512 | lm loss: 2.151713E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.066 | TFLOPs: 57.35 | 7: iteration 10320/ 44073 | consumed samples: 5283840 | consumed tokens: 10821304320 | elapsed time per iteration (s): 4.16 | learning rate: 1.782E-04 | global batch size: 512 | lm loss: 2.185444E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.155 | TFLOPs: 57.40 | 7: iteration 10330/ 44073 | consumed samples: 5288960 | consumed tokens: 10831790080 | elapsed time per iteration (s): 4.14 | learning rate: 1.781E-04 | global batch size: 512 | lm loss: 2.181651E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.612 | TFLOPs: 57.61 | 7: iteration 10340/ 44073 | consumed samples: 5294080 | consumed tokens: 10842275840 | elapsed time per iteration (s): 4.16 | learning rate: 1.781E-04 | global batch size: 512 | lm loss: 2.162978E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.212 | TFLOPs: 57.42 | 7: iteration 10350/ 44073 | consumed samples: 5299200 | consumed tokens: 10852761600 | elapsed time per iteration (s): 4.14 | learning rate: 1.780E-04 | global batch size: 512 | lm loss: 2.170694E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.648 | TFLOPs: 57.63 | 7: iteration 10360/ 44073 | consumed samples: 5304320 | consumed tokens: 10863247360 | elapsed time per iteration (s): 4.14 | learning rate: 1.780E-04 | global batch size: 512 | lm loss: 2.152491E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.647 | TFLOPs: 57.63 | 7: iteration 10370/ 44073 | consumed samples: 5309440 | consumed tokens: 10873733120 | elapsed time per iteration (s): 4.13 | learning rate: 1.780E-04 | global batch size: 512 | lm loss: 2.162312E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.826 | TFLOPs: 57.71 | 7: iteration 10380/ 44073 | consumed samples: 5314560 | consumed tokens: 10884218880 | elapsed time per iteration (s): 4.21 | learning rate: 1.779E-04 | global batch size: 512 | lm loss: 2.172129E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.631 | TFLOPs: 56.69 | 7: iteration 10390/ 44073 | consumed samples: 5319680 | consumed tokens: 10894704640 | elapsed time per iteration (s): 4.19 | learning rate: 1.779E-04 | global batch size: 512 | lm loss: 2.166260E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.318 | TFLOPs: 57.01 | 7: iteration 10400/ 44073 | consumed samples: 5324800 | consumed tokens: 10905190400 | elapsed time per iteration (s): 4.15 | learning rate: 1.778E-04 | global batch size: 512 | lm loss: 2.181210E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.287 | TFLOPs: 57.46 | 7: iteration 10410/ 44073 | consumed samples: 5329920 | consumed tokens: 10915676160 | elapsed time per iteration (s): 4.19 | learning rate: 1.778E-04 | global batch size: 512 | lm loss: 2.167682E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.230 | TFLOPs: 56.97 | 7: iteration 10420/ 44073 | consumed samples: 5335040 | consumed tokens: 10926161920 | elapsed time per iteration (s): 4.16 | learning rate: 1.778E-04 | global batch size: 512 | lm loss: 2.147324E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.980 | TFLOPs: 57.31 | 7: iteration 10430/ 44073 | consumed samples: 5340160 | consumed tokens: 10936647680 | elapsed time per iteration (s): 4.18 | learning rate: 1.777E-04 | global batch size: 512 | lm loss: 2.171158E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.587 | TFLOPs: 57.13 | 7: iteration 10440/ 44073 | consumed samples: 5345280 | consumed tokens: 10947133440 | elapsed time per iteration (s): 4.14 | learning rate: 1.777E-04 | global batch size: 512 | lm loss: 2.178444E+00 | grad norm: 0.339 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.652 | TFLOPs: 57.63 | 7: iteration 10450/ 44073 | consumed samples: 5350400 | consumed tokens: 10957619200 | elapsed time per iteration (s): 4.17 | learning rate: 1.776E-04 | global batch size: 512 | lm loss: 2.158793E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.715 | TFLOPs: 57.19 | 7: iteration 10460/ 44073 | consumed samples: 5355520 | consumed tokens: 10968104960 | elapsed time per iteration (s): 4.14 | learning rate: 1.776E-04 | global batch size: 512 | lm loss: 2.160744E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.555 | TFLOPs: 57.58 | 7: iteration 10470/ 44073 | consumed samples: 5360640 | consumed tokens: 10978590720 | elapsed time per iteration (s): 4.17 | learning rate: 1.775E-04 | global batch size: 512 | lm loss: 2.155111E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.895 | TFLOPs: 57.28 | 7: iteration 10480/ 44073 | consumed samples: 5365760 | consumed tokens: 10989076480 | elapsed time per iteration (s): 4.17 | learning rate: 1.775E-04 | global batch size: 512 | lm loss: 2.158075E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.826 | TFLOPs: 57.24 | 7: iteration 10490/ 44073 | consumed samples: 5370880 | consumed tokens: 10999562240 | elapsed time per iteration (s): 4.15 | learning rate: 1.775E-04 | global batch size: 512 | lm loss: 2.148519E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.349 | TFLOPs: 57.49 | 7: iteration 10500/ 44073 | consumed samples: 5376000 | consumed tokens: 11010048000 | elapsed time per iteration (s): 4.15 | learning rate: 1.774E-04 | global batch size: 512 | lm loss: 2.172086E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.269 | TFLOPs: 57.45 | 7: iteration 10510/ 44073 | consumed samples: 5381120 | consumed tokens: 11020533760 | elapsed time per iteration (s): 4.14 | learning rate: 1.774E-04 | global batch size: 512 | lm loss: 2.180704E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.778 | TFLOPs: 57.69 | 7: iteration 10520/ 44073 | consumed samples: 5386240 | consumed tokens: 11031019520 | elapsed time per iteration (s): 4.16 | learning rate: 1.773E-04 | global batch size: 512 | lm loss: 2.172341E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.042 | TFLOPs: 57.34 | 7: iteration 10530/ 44073 | consumed samples: 5391360 | consumed tokens: 11041505280 | elapsed time per iteration (s): 4.15 | learning rate: 1.773E-04 | global batch size: 512 | lm loss: 2.184009E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.368 | TFLOPs: 57.50 | 7: iteration 10540/ 44073 | consumed samples: 5396480 | consumed tokens: 11051991040 | elapsed time per iteration (s): 4.19 | learning rate: 1.772E-04 | global batch size: 512 | lm loss: 2.152116E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.054 | TFLOPs: 56.88 | 7: iteration 10550/ 44073 | consumed samples: 5401600 | consumed tokens: 11062476800 | elapsed time per iteration (s): 4.16 | learning rate: 1.772E-04 | global batch size: 512 | lm loss: 2.153946E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.163 | TFLOPs: 57.40 | 7: iteration 10560/ 44073 | consumed samples: 5406720 | consumed tokens: 11072962560 | elapsed time per iteration (s): 4.16 | learning rate: 1.772E-04 | global batch size: 512 | lm loss: 2.189332E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.025 | TFLOPs: 57.34 | 7: iteration 10570/ 44073 | consumed samples: 5411840 | consumed tokens: 11083448320 | elapsed time per iteration (s): 4.14 | learning rate: 1.771E-04 | global batch size: 512 | lm loss: 2.159281E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.688 | TFLOPs: 57.64 | 7: iteration 10580/ 44073 | consumed samples: 5416960 | consumed tokens: 11093934080 | elapsed time per iteration (s): 4.14 | learning rate: 1.771E-04 | global batch size: 512 | lm loss: 2.177719E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.530 | TFLOPs: 57.57 | 7: iteration 10590/ 44073 | consumed samples: 5422080 | consumed tokens: 11104419840 | elapsed time per iteration (s): 4.14 | learning rate: 1.770E-04 | global batch size: 512 | lm loss: 2.151105E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.710 | TFLOPs: 57.66 | 7: iteration 10600/ 44073 | consumed samples: 5427200 | consumed tokens: 11114905600 | elapsed time per iteration (s): 4.15 | learning rate: 1.770E-04 | global batch size: 512 | lm loss: 2.160783E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.421 | TFLOPs: 57.52 | 7: iteration 10610/ 44073 | consumed samples: 5432320 | consumed tokens: 11125391360 | elapsed time per iteration (s): 4.14 | learning rate: 1.769E-04 | global batch size: 512 | lm loss: 2.153739E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.702 | TFLOPs: 57.65 | 7: iteration 10620/ 44073 | consumed samples: 5437440 | consumed tokens: 11135877120 | elapsed time per iteration (s): 4.14 | learning rate: 1.769E-04 | global batch size: 512 | lm loss: 2.136092E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.745 | TFLOPs: 57.67 | 7: iteration 10630/ 44073 | consumed samples: 5442560 | consumed tokens: 11146362880 | elapsed time per iteration (s): 4.14 | learning rate: 1.768E-04 | global batch size: 512 | lm loss: 2.180594E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.687 | TFLOPs: 57.64 | 7: iteration 10640/ 44073 | consumed samples: 5447680 | consumed tokens: 11156848640 | elapsed time per iteration (s): 4.15 | learning rate: 1.768E-04 | global batch size: 512 | lm loss: 2.153495E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.470 | TFLOPs: 57.54 | 7: iteration 10650/ 44073 | consumed samples: 5452800 | consumed tokens: 11167334400 | elapsed time per iteration (s): 4.14 | learning rate: 1.768E-04 | global batch size: 512 | lm loss: 2.181746E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.567 | TFLOPs: 57.59 | 7: iteration 10660/ 44073 | consumed samples: 5457920 | consumed tokens: 11177820160 | elapsed time per iteration (s): 4.16 | learning rate: 1.767E-04 | global batch size: 512 | lm loss: 2.161441E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.934 | TFLOPs: 57.29 | 7: iteration 10670/ 44073 | consumed samples: 5463040 | consumed tokens: 11188305920 | elapsed time per iteration (s): 4.15 | learning rate: 1.767E-04 | global batch size: 512 | lm loss: 2.153055E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.241 | TFLOPs: 57.44 | 7: iteration 10680/ 44073 | consumed samples: 5468160 | consumed tokens: 11198791680 | elapsed time per iteration (s): 4.15 | learning rate: 1.766E-04 | global batch size: 512 | lm loss: 2.181956E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.320 | TFLOPs: 57.47 | 7: iteration 10690/ 44073 | consumed samples: 5473280 | consumed tokens: 11209277440 | elapsed time per iteration (s): 4.15 | learning rate: 1.766E-04 | global batch size: 512 | lm loss: 2.159430E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.326 | TFLOPs: 57.48 | 7: iteration 10700/ 44073 | consumed samples: 5478400 | consumed tokens: 11219763200 | elapsed time per iteration (s): 4.19 | learning rate: 1.765E-04 | global batch size: 512 | lm loss: 2.164393E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.059 | TFLOPs: 56.89 | 7: iteration 10710/ 44073 | consumed samples: 5483520 | consumed tokens: 11230248960 | elapsed time per iteration (s): 4.18 | learning rate: 1.765E-04 | global batch size: 512 | lm loss: 2.172927E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.428 | TFLOPs: 57.06 | 7: iteration 10720/ 44073 | consumed samples: 5488640 | consumed tokens: 11240734720 | elapsed time per iteration (s): 4.15 | learning rate: 1.765E-04 | global batch size: 512 | lm loss: 2.185637E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.414 | TFLOPs: 57.52 | 7: iteration 10730/ 44073 | consumed samples: 5493760 | consumed tokens: 11251220480 | elapsed time per iteration (s): 4.16 | learning rate: 1.764E-04 | global batch size: 512 | lm loss: 2.159803E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.188 | TFLOPs: 57.41 | 7: iteration 10740/ 44073 | consumed samples: 5498880 | consumed tokens: 11261706240 | elapsed time per iteration (s): 4.15 | learning rate: 1.764E-04 | global batch size: 512 | lm loss: 2.154336E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.345 | TFLOPs: 57.49 | 7: iteration 10750/ 44073 | consumed samples: 5504000 | consumed tokens: 11272192000 | elapsed time per iteration (s): 4.15 | learning rate: 1.763E-04 | global batch size: 512 | lm loss: 2.174981E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.421 | TFLOPs: 57.52 | 7: iteration 10760/ 44073 | consumed samples: 5509120 | consumed tokens: 11282677760 | elapsed time per iteration (s): 4.23 | learning rate: 1.763E-04 | global batch size: 512 | lm loss: 2.150760E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.098 | TFLOPs: 56.44 | 7: iteration 10770/ 44073 | consumed samples: 5514240 | consumed tokens: 11293163520 | elapsed time per iteration (s): 4.16 | learning rate: 1.762E-04 | global batch size: 512 | lm loss: 2.129581E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.953 | TFLOPs: 57.30 | 7: iteration 10780/ 44073 | consumed samples: 5519360 | consumed tokens: 11303649280 | elapsed time per iteration (s): 4.15 | learning rate: 1.762E-04 | global batch size: 512 | lm loss: 2.161972E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.348 | TFLOPs: 57.49 | 7: iteration 10790/ 44073 | consumed samples: 5524480 | consumed tokens: 11314135040 | elapsed time per iteration (s): 4.15 | learning rate: 1.761E-04 | global batch size: 512 | lm loss: 2.155340E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.425 | TFLOPs: 57.52 | 7: iteration 10800/ 44073 | consumed samples: 5529600 | consumed tokens: 11324620800 | elapsed time per iteration (s): 4.15 | learning rate: 1.761E-04 | global batch size: 512 | lm loss: 2.148139E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.342 | TFLOPs: 57.48 | 7: iteration 10810/ 44073 | consumed samples: 5534720 | consumed tokens: 11335106560 | elapsed time per iteration (s): 4.15 | learning rate: 1.761E-04 | global batch size: 512 | lm loss: 2.180167E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.237 | TFLOPs: 57.43 | 7: iteration 10820/ 44073 | consumed samples: 5539840 | consumed tokens: 11345592320 | elapsed time per iteration (s): 4.14 | learning rate: 1.760E-04 | global batch size: 512 | lm loss: 2.180002E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.623 | TFLOPs: 57.61 | 7: iteration 10830/ 44073 | consumed samples: 5544960 | consumed tokens: 11356078080 | elapsed time per iteration (s): 6.65 | learning rate: 1.760E-04 | global batch size: 512 | lm loss: 2.156743E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 76.998 | TFLOPs: 35.89 | 7: iteration 10840/ 44073 | consumed samples: 5550080 | consumed tokens: 11366563840 | elapsed time per iteration (s): 12.45 | learning rate: 1.759E-04 | global batch size: 512 | lm loss: 2.145880E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 41.129 | TFLOPs: 19.17 | 7: iteration 10850/ 44073 | consumed samples: 5555200 | consumed tokens: 11377049600 | elapsed time per iteration (s): 4.14 | learning rate: 1.759E-04 | global batch size: 512 | lm loss: 2.133140E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.821 | TFLOPs: 57.71 | 7: iteration 10860/ 44073 | consumed samples: 5560320 | consumed tokens: 11387535360 | elapsed time per iteration (s): 5.92 | learning rate: 1.758E-04 | global batch size: 512 | lm loss: 2.149319E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 86.465 | TFLOPs: 40.30 | 7: iteration 10870/ 44073 | consumed samples: 5565440 | consumed tokens: 11398021120 | elapsed time per iteration (s): 4.17 | learning rate: 1.758E-04 | global batch size: 512 | lm loss: 2.137370E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.674 | TFLOPs: 57.17 | 7: iteration 10880/ 44073 | consumed samples: 5570560 | consumed tokens: 11408506880 | elapsed time per iteration (s): 4.16 | learning rate: 1.758E-04 | global batch size: 512 | lm loss: 2.160903E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.176 | TFLOPs: 57.41 | 7: iteration 10890/ 44073 | consumed samples: 5575680 | consumed tokens: 11418992640 | elapsed time per iteration (s): 4.15 | learning rate: 1.757E-04 | global batch size: 512 | lm loss: 2.145999E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.248 | TFLOPs: 57.44 | 7: iteration 10900/ 44073 | consumed samples: 5580800 | consumed tokens: 11429478400 | elapsed time per iteration (s): 4.25 | learning rate: 1.757E-04 | global batch size: 512 | lm loss: 2.159895E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.553 | TFLOPs: 56.18 | 7: iteration 10910/ 44073 | consumed samples: 5585920 | consumed tokens: 11439964160 | elapsed time per iteration (s): 4.19 | learning rate: 1.756E-04 | global batch size: 512 | lm loss: 2.152018E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.166 | TFLOPs: 56.94 | 7: iteration 10920/ 44073 | consumed samples: 5591040 | consumed tokens: 11450449920 | elapsed time per iteration (s): 4.15 | learning rate: 1.756E-04 | global batch size: 512 | lm loss: 2.165806E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.448 | TFLOPs: 57.53 | 7: iteration 10930/ 44073 | consumed samples: 5596160 | consumed tokens: 11460935680 | elapsed time per iteration (s): 4.15 | learning rate: 1.755E-04 | global batch size: 512 | lm loss: 2.181542E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.408 | TFLOPs: 57.51 | 7: iteration 10940/ 44073 | consumed samples: 5601280 | consumed tokens: 11471421440 | elapsed time per iteration (s): 4.15 | learning rate: 1.755E-04 | global batch size: 512 | lm loss: 2.156600E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.238 | TFLOPs: 57.43 | 7: iteration 10950/ 44073 | consumed samples: 5606400 | consumed tokens: 11481907200 | elapsed time per iteration (s): 4.18 | learning rate: 1.754E-04 | global batch size: 512 | lm loss: 2.154024E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.633 | TFLOPs: 57.15 | 7: iteration 10960/ 44073 | consumed samples: 5611520 | consumed tokens: 11492392960 | elapsed time per iteration (s): 4.14 | learning rate: 1.754E-04 | global batch size: 512 | lm loss: 2.151268E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.635 | TFLOPs: 57.62 | 7: iteration 10970/ 44073 | consumed samples: 5616640 | consumed tokens: 11502878720 | elapsed time per iteration (s): 4.15 | learning rate: 1.754E-04 | global batch size: 512 | lm loss: 2.161683E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.434 | TFLOPs: 57.53 | 7: iteration 10980/ 44073 | consumed samples: 5621760 | consumed tokens: 11513364480 | elapsed time per iteration (s): 4.14 | learning rate: 1.753E-04 | global batch size: 512 | lm loss: 2.172953E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.539 | TFLOPs: 57.58 | 7: iteration 10990/ 44073 | consumed samples: 5626880 | consumed tokens: 11523850240 | elapsed time per iteration (s): 4.15 | learning rate: 1.753E-04 | global batch size: 512 | lm loss: 2.156758E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.336 | TFLOPs: 57.48 | 7: iteration 11000/ 44073 | consumed samples: 5632000 | consumed tokens: 11534336000 | elapsed time per iteration (s): 4.15 | learning rate: 1.752E-04 | global batch size: 512 | lm loss: 2.166959E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.391 | TFLOPs: 57.51 | 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 11000 | lm loss value: 2.122571E+00 | lm loss PPL: 8.352582E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 11000 to checkpoints_2b2 0: [2022-11-25 23:17:29,929] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step11000 is begin to save! 0: [2022-11-25 23:17:29,936] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_01-model_00-model_states.pt... 0: [2022-11-25 23:17:30,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_01-model_00-model_states.pt. 0: [2022-11-25 23:17:30,271] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_03-model_00-model_states.pt... 0: [2022-11-25 23:17:30,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_03-model_00-model_states.pt. 0: [2022-11-25 23:17:30,419] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_04-model_00-model_states.pt... 0: [2022-11-25 23:17:30,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_04-model_00-model_states.pt. 0: [2022-11-25 23:17:30,567] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_05-model_00-model_states.pt... 0: [2022-11-25 23:17:30,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_05-model_00-model_states.pt. 0: [2022-11-25 23:17:30,716] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_06-model_00-model_states.pt... 0: [2022-11-25 23:17:30,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_06-model_00-model_states.pt. 0: [2022-11-25 23:17:30,857] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_07-model_00-model_states.pt... 0: [2022-11-25 23:17:30,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_07-model_00-model_states.pt. 0: [2022-11-25 23:17:30,986] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_08-model_00-model_states.pt... 0: [2022-11-25 23:17:31,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_08-model_00-model_states.pt. 0: [2022-11-25 23:17:31,115] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_09-model_00-model_states.pt... 0: [2022-11-25 23:17:31,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_09-model_00-model_states.pt. 0: [2022-11-25 23:17:31,242] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_10-model_00-model_states.pt... 0: [2022-11-25 23:17:31,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_10-model_00-model_states.pt. 0: [2022-11-25 23:17:31,385] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_11-model_00-model_states.pt... 0: [2022-11-25 23:17:31,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_11-model_00-model_states.pt. 0: [2022-11-25 23:17:31,527] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_12-model_00-model_states.pt... 0: [2022-11-25 23:17:31,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_12-model_00-model_states.pt. 0: [2022-11-25 23:17:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_13-model_00-model_states.pt... 0: [2022-11-25 23:17:31,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_13-model_00-model_states.pt. 0: [2022-11-25 23:17:31,806] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_14-model_00-model_states.pt... 0: [2022-11-25 23:17:31,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_14-model_00-model_states.pt. 0: [2022-11-25 23:17:31,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_15-model_00-model_states.pt... 0: [2022-11-25 23:17:32,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_15-model_00-model_states.pt. 0: [2022-11-25 23:17:32,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_16-model_00-model_states.pt... 0: [2022-11-25 23:17:32,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_16-model_00-model_states.pt. 0: [2022-11-25 23:17:32,218] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_17-model_00-model_states.pt... 0: [2022-11-25 23:17:32,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_17-model_00-model_states.pt. 0: [2022-11-25 23:17:32,354] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_18-model_00-model_states.pt... 0: [2022-11-25 23:17:32,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_18-model_00-model_states.pt. 0: [2022-11-25 23:17:32,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_19-model_00-model_states.pt... 0: [2022-11-25 23:17:32,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_19-model_00-model_states.pt. 0: [2022-11-25 23:17:32,631] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_20-model_00-model_states.pt... 0: [2022-11-25 23:17:32,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_20-model_00-model_states.pt. 0: [2022-11-25 23:17:32,777] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_21-model_00-model_states.pt... 0: [2022-11-25 23:17:32,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_21-model_00-model_states.pt. 0: [2022-11-25 23:17:32,914] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_22-model_00-model_states.pt... 0: [2022-11-25 23:17:33,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_22-model_00-model_states.pt. 0: [2022-11-25 23:17:33,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_23-model_00-model_states.pt... 0: [2022-11-25 23:17:33,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_23-model_00-model_states.pt. 0: [2022-11-25 23:17:33,189] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_24-model_00-model_states.pt... 0: [2022-11-25 23:17:33,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_24-model_00-model_states.pt. 0: [2022-11-25 23:17:33,322] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_25-model_00-model_states.pt... 0: [2022-11-25 23:17:33,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_25-model_00-model_states.pt. 0: [2022-11-25 23:17:33,461] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_26-model_00-model_states.pt... 0: [2022-11-25 23:17:33,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_26-model_00-model_states.pt. 0: [2022-11-25 23:17:33,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_27-model_00-model_states.pt... 0: [2022-11-25 23:17:33,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_27-model_00-model_states.pt. 0: [2022-11-25 23:17:33,737] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_28-model_00-model_states.pt... 0: [2022-11-25 23:17:33,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_28-model_00-model_states.pt. 0: [2022-11-25 23:17:33,874] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_29-model_00-model_states.pt... 0: [2022-11-25 23:17:34,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_29-model_00-model_states.pt. 0: [2022-11-25 23:17:34,007] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_30-model_00-model_states.pt... 0: [2022-11-25 23:17:34,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_30-model_00-model_states.pt. 0: [2022-11-25 23:17:34,143] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_31-model_00-model_states.pt... 0: [2022-11-25 23:17:34,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_31-model_00-model_states.pt. 0: [2022-11-25 23:17:34,279] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_32-model_00-model_states.pt... 0: [2022-11-25 23:17:34,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_32-model_00-model_states.pt. 0: [2022-11-25 23:17:34,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_33-model_00-model_states.pt... 0: [2022-11-25 23:17:34,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_33-model_00-model_states.pt. 0: [2022-11-25 23:17:34,551] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_34-model_00-model_states.pt... 0: [2022-11-25 23:17:34,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_34-model_00-model_states.pt. 0: [2022-11-25 23:17:34,686] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/layer_36-model_00-model_states.pt... 0: [2022-11-25 23:17:34,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/layer_36-model_00-model_states.pt. 0: [2022-11-25 23:17:34,691] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step11000/mp_rank_00_model_states.pt 0: [2022-11-25 23:17:34,691] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/mp_rank_00_model_states.pt... 0: [2022-11-25 23:17:34,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/mp_rank_00_model_states.pt. 0: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:17:34,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step11000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:17:35,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:17:35,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:17:35,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 23:17:35,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: [2022-11-25 23:17:35,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:17:35,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 23:17:35,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: [2022-11-25 23:17:35,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:17:35,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 23:17:35,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: [2022-11-25 23:17:35,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:17:35,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 23:17:35,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: [2022-11-25 23:17:35,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:17:35,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 23:17:35,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: [2022-11-25 23:17:35,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:17:35,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 23:17:35,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: [2022-11-25 23:17:35,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:17:35,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 23:17:35,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 3: [2022-11-25 23:17:35,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:17:35,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 23:17:35,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 3: [2022-11-25 23:17:35,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:17:35,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 23:17:35,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 1: [2022-11-25 23:17:35,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:17:35,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 23:17:35,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 1: [2022-11-25 23:17:35,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:17:35,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 23:17:35,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 1: [2022-11-25 23:17:35,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:17:35,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 23:17:35,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 3: [2022-11-25 23:17:35,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:17:35,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 23:17:35,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 3: [2022-11-25 23:17:35,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:17:35,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 23:17:35,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:17:35,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 3: [2022-11-25 23:17:35,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 23:17:35,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 1: [2022-11-25 23:17:35,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:17:35,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 23:17:35,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 23:17:35,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:17:35,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 23:17:35,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 23:17:35,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:17:35,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 23:17:35,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 23:17:35,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:17:35,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:17:35,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 23:17:35,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 23:17:35,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 23:17:35,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 4: [2022-11-25 23:17:35,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:17:35,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:17:35,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:17:35,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:17:35,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 23:17:35,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 23:17:35,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 4: [2022-11-25 23:17:35,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 23:17:35,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 4: [2022-11-25 23:17:35,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 23:17:35,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 4: [2022-11-25 23:17:35,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 4: [2022-11-25 23:17:35,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:17:35,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 23:17:35,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 4: [2022-11-25 23:17:35,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:17:35,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 23:17:35,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 4: [2022-11-25 23:17:35,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:17:35,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 23:17:35,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 3: [2022-11-25 23:17:35,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:17:35,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 23:17:35,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 3: [2022-11-25 23:17:35,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:17:35,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 23:17:35,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 3: [2022-11-25 23:17:35,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:17:35,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 23:17:35,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 23:17:35,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:17:35,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 23:17:35,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 23:17:35,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:17:35,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 23:17:35,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:17:35,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 23:17:35,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 23:17:35,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 23:17:35,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:17:35,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 23:17:35,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 1: [2022-11-25 23:17:35,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:17:35,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 23:17:35,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 1: [2022-11-25 23:17:35,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:17:35,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 23:17:35,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 1: [2022-11-25 23:17:35,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:17:35,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 23:17:35,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 1: [2022-11-25 23:17:35,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:17:35,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 23:17:35,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 2: [2022-11-25 23:17:35,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:17:35,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 23:17:35,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 2: [2022-11-25 23:17:35,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:17:35,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 23:17:35,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 2: [2022-11-25 23:17:35,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:17:35,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 23:17:35,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 2: [2022-11-25 23:17:35,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:17:35,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 23:17:35,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 2: [2022-11-25 23:17:35,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:17:35,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:17:35,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:17:35,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:17:35,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 23:17:35,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 23:17:35,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 23:17:35,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 23:17:35,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 2: [2022-11-25 23:17:35,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 2: [2022-11-25 23:17:35,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 2: [2022-11-25 23:17:35,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: [2022-11-25 23:17:35,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 23:17:35,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 4: [2022-11-25 23:17:35,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:17:35,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 23:17:35,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 7: [2022-11-25 23:17:35,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:17:35,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:17:35,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:17:35,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:17:35,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 23:17:35,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:17:35,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:17:35,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 23:17:35,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 23:17:35,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 23:17:35,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 7: [2022-11-25 23:17:35,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:17:35,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 23:17:35,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 7: [2022-11-25 23:17:35,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 23:17:35,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 7: [2022-11-25 23:17:35,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 7: [2022-11-25 23:17:35,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 7: [2022-11-25 23:17:35,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 7: [2022-11-25 23:17:35,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 23:17:35,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:17:35,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 7: [2022-11-25 23:17:35,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 23:17:35,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 6: [2022-11-25 23:17:35,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:17:35,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 23:17:35,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:17:35,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:17:35,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 6: [2022-11-25 23:17:35,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 23:17:35,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:17:35,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 23:17:35,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:17:35,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 23:17:35,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 6: [2022-11-25 23:17:35,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 6: [2022-11-25 23:17:35,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 23:17:35,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 6: [2022-11-25 23:17:35,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 6: [2022-11-25 23:17:35,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:17:35,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 23:17:35,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 6: [2022-11-25 23:17:35,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:17:35,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 23:17:35,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 6: [2022-11-25 23:17:35,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:17:35,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step11000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 23:17:35,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: successfully saved checkpoint at iteration 11000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 6002.73 7: iteration 11010/ 44073 | consumed samples: 5637120 | consumed tokens: 11544821760 | elapsed time per iteration (s): 4.89 | learning rate: 1.752E-04 | global batch size: 512 | lm loss: 2.156277E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 104.693 | TFLOPs: 48.79 | 7: iteration 11020/ 44073 | consumed samples: 5642240 | consumed tokens: 11555307520 | elapsed time per iteration (s): 4.17 | learning rate: 1.751E-04 | global batch size: 512 | lm loss: 2.142059E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.732 | TFLOPs: 57.20 | 7: iteration 11030/ 44073 | consumed samples: 5647360 | consumed tokens: 11565793280 | elapsed time per iteration (s): 4.16 | learning rate: 1.751E-04 | global batch size: 512 | lm loss: 2.148961E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.062 | TFLOPs: 57.35 | 7: iteration 11040/ 44073 | consumed samples: 5652480 | consumed tokens: 11576279040 | elapsed time per iteration (s): 4.15 | learning rate: 1.750E-04 | global batch size: 512 | lm loss: 2.148668E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.375 | TFLOPs: 57.50 | 7: iteration 11050/ 44073 | consumed samples: 5657600 | consumed tokens: 11586764800 | elapsed time per iteration (s): 4.17 | learning rate: 1.750E-04 | global batch size: 512 | lm loss: 2.165154E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.801 | TFLOPs: 57.23 | 7: iteration 11060/ 44073 | consumed samples: 5662720 | consumed tokens: 11597250560 | elapsed time per iteration (s): 4.21 | learning rate: 1.749E-04 | global batch size: 512 | lm loss: 2.163610E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.670 | TFLOPs: 56.70 | 7: iteration 11070/ 44073 | consumed samples: 5667840 | consumed tokens: 11607736320 | elapsed time per iteration (s): 4.16 | learning rate: 1.749E-04 | global batch size: 512 | lm loss: 2.145510E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.215 | TFLOPs: 57.42 | 7: iteration 11080/ 44073 | consumed samples: 5672960 | consumed tokens: 11618222080 | elapsed time per iteration (s): 4.16 | learning rate: 1.749E-04 | global batch size: 512 | lm loss: 2.160632E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.222 | TFLOPs: 57.43 | 7: iteration 11090/ 44073 | consumed samples: 5678080 | consumed tokens: 11628707840 | elapsed time per iteration (s): 4.18 | learning rate: 1.748E-04 | global batch size: 512 | lm loss: 2.144148E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.598 | TFLOPs: 57.14 | 7: iteration 11100/ 44073 | consumed samples: 5683200 | consumed tokens: 11639193600 | elapsed time per iteration (s): 4.16 | learning rate: 1.748E-04 | global batch size: 512 | lm loss: 2.162282E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.955 | TFLOPs: 57.30 | 7: iteration 11110/ 44073 | consumed samples: 5688320 | consumed tokens: 11649679360 | elapsed time per iteration (s): 4.30 | learning rate: 1.747E-04 | global batch size: 512 | lm loss: 2.157191E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.004 | TFLOPs: 55.46 | 7: iteration 11120/ 44073 | consumed samples: 5693440 | consumed tokens: 11660165120 | elapsed time per iteration (s): 4.17 | learning rate: 1.747E-04 | global batch size: 512 | lm loss: 2.169714E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.721 | TFLOPs: 57.19 | 7: iteration 11130/ 44073 | consumed samples: 5698560 | consumed tokens: 11670650880 | elapsed time per iteration (s): 4.20 | learning rate: 1.746E-04 | global batch size: 512 | lm loss: 2.137008E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.912 | TFLOPs: 56.82 | 7: iteration 11140/ 44073 | consumed samples: 5703680 | consumed tokens: 11681136640 | elapsed time per iteration (s): 4.22 | learning rate: 1.746E-04 | global batch size: 512 | lm loss: 2.173355E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.346 | TFLOPs: 56.55 | 7: iteration 11150/ 44073 | consumed samples: 5708800 | consumed tokens: 11691622400 | elapsed time per iteration (s): 4.18 | learning rate: 1.745E-04 | global batch size: 512 | lm loss: 2.165344E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.565 | TFLOPs: 57.12 | 7: iteration 11160/ 44073 | consumed samples: 5713920 | consumed tokens: 11702108160 | elapsed time per iteration (s): 4.15 | learning rate: 1.745E-04 | global batch size: 512 | lm loss: 2.142352E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.376 | TFLOPs: 57.50 | 7: iteration 11170/ 44073 | consumed samples: 5719040 | consumed tokens: 11712593920 | elapsed time per iteration (s): 4.16 | learning rate: 1.745E-04 | global batch size: 512 | lm loss: 2.148563E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.058 | TFLOPs: 57.35 | 7: iteration 11180/ 44073 | consumed samples: 5724160 | consumed tokens: 11723079680 | elapsed time per iteration (s): 4.14 | learning rate: 1.744E-04 | global batch size: 512 | lm loss: 2.128180E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.670 | TFLOPs: 57.64 | 7: iteration 11190/ 44073 | consumed samples: 5729280 | consumed tokens: 11733565440 | elapsed time per iteration (s): 4.18 | learning rate: 1.744E-04 | global batch size: 512 | lm loss: 2.141606E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.623 | TFLOPs: 57.15 | 7: iteration 11200/ 44073 | consumed samples: 5734400 | consumed tokens: 11744051200 | elapsed time per iteration (s): 4.25 | learning rate: 1.743E-04 | global batch size: 512 | lm loss: 2.122686E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.339 | TFLOPs: 56.08 | 7: iteration 11210/ 44073 | consumed samples: 5739520 | consumed tokens: 11754536960 | elapsed time per iteration (s): 4.19 | learning rate: 1.743E-04 | global batch size: 512 | lm loss: 2.132080E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.156 | TFLOPs: 56.93 | 7: iteration 11220/ 44073 | consumed samples: 5744640 | consumed tokens: 11765022720 | elapsed time per iteration (s): 4.18 | learning rate: 1.742E-04 | global batch size: 512 | lm loss: 2.134455E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.350 | TFLOPs: 57.02 | 7: iteration 11230/ 44073 | consumed samples: 5749760 | consumed tokens: 11775508480 | elapsed time per iteration (s): 4.17 | learning rate: 1.742E-04 | global batch size: 512 | lm loss: 2.163319E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.805 | TFLOPs: 57.23 | 7: iteration 11240/ 44073 | consumed samples: 5754880 | consumed tokens: 11785994240 | elapsed time per iteration (s): 4.17 | learning rate: 1.741E-04 | global batch size: 512 | lm loss: 2.152977E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.908 | TFLOPs: 57.28 | 7: iteration 11250/ 44073 | consumed samples: 5760000 | consumed tokens: 11796480000 | elapsed time per iteration (s): 4.18 | learning rate: 1.741E-04 | global batch size: 512 | lm loss: 2.143407E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.621 | TFLOPs: 57.15 | 7: iteration 11260/ 44073 | consumed samples: 5765120 | consumed tokens: 11806965760 | elapsed time per iteration (s): 4.22 | learning rate: 1.740E-04 | global batch size: 512 | lm loss: 2.163281E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.333 | TFLOPs: 56.55 | 7: iteration 11270/ 44073 | consumed samples: 5770240 | consumed tokens: 11817451520 | elapsed time per iteration (s): 4.17 | learning rate: 1.740E-04 | global batch size: 512 | lm loss: 2.158001E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.690 | TFLOPs: 57.18 | 7: iteration 11280/ 44073 | consumed samples: 5775360 | consumed tokens: 11827937280 | elapsed time per iteration (s): 4.19 | learning rate: 1.740E-04 | global batch size: 512 | lm loss: 2.163710E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.324 | TFLOPs: 57.01 | 7: iteration 11290/ 44073 | consumed samples: 5780480 | consumed tokens: 11838423040 | elapsed time per iteration (s): 4.14 | learning rate: 1.739E-04 | global batch size: 512 | lm loss: 2.135193E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.730 | TFLOPs: 57.66 | 7: iteration 11300/ 44073 | consumed samples: 5785600 | consumed tokens: 11848908800 | elapsed time per iteration (s): 4.17 | learning rate: 1.739E-04 | global batch size: 512 | lm loss: 2.150329E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.857 | TFLOPs: 57.26 | 7: iteration 11310/ 44073 | consumed samples: 5790720 | consumed tokens: 11859394560 | elapsed time per iteration (s): 4.17 | learning rate: 1.738E-04 | global batch size: 512 | lm loss: 2.135935E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.891 | TFLOPs: 57.27 | 7: iteration 11320/ 44073 | consumed samples: 5795840 | consumed tokens: 11869880320 | elapsed time per iteration (s): 4.19 | learning rate: 1.738E-04 | global batch size: 512 | lm loss: 2.154890E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.318 | TFLOPs: 57.01 | 7: iteration 11330/ 44073 | consumed samples: 5800960 | consumed tokens: 11880366080 | elapsed time per iteration (s): 4.22 | learning rate: 1.737E-04 | global batch size: 512 | lm loss: 2.145544E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.326 | TFLOPs: 56.54 | 7: iteration 11340/ 44073 | consumed samples: 5806080 | consumed tokens: 11890851840 | elapsed time per iteration (s): 4.16 | learning rate: 1.737E-04 | global batch size: 512 | lm loss: 2.148975E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.991 | TFLOPs: 57.32 | 7: iteration 11350/ 44073 | consumed samples: 5811200 | consumed tokens: 11901337600 | elapsed time per iteration (s): 4.24 | learning rate: 1.736E-04 | global batch size: 512 | lm loss: 2.159899E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.878 | TFLOPs: 56.34 | 7: iteration 11360/ 44073 | consumed samples: 5816320 | consumed tokens: 11911823360 | elapsed time per iteration (s): 4.17 | learning rate: 1.736E-04 | global batch size: 512 | lm loss: 2.150970E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.691 | TFLOPs: 57.18 | 7: iteration 11370/ 44073 | consumed samples: 5821440 | consumed tokens: 11922309120 | elapsed time per iteration (s): 4.17 | learning rate: 1.735E-04 | global batch size: 512 | lm loss: 2.129742E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.856 | TFLOPs: 57.26 | 7: iteration 11380/ 44073 | consumed samples: 5826560 | consumed tokens: 11932794880 | elapsed time per iteration (s): 4.17 | learning rate: 1.735E-04 | global batch size: 512 | lm loss: 2.150616E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.728 | TFLOPs: 57.20 | 7: iteration 11390/ 44073 | consumed samples: 5831680 | consumed tokens: 11943280640 | elapsed time per iteration (s): 4.21 | learning rate: 1.735E-04 | global batch size: 512 | lm loss: 2.158635E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.668 | TFLOPs: 56.70 | 7: iteration 11400/ 44073 | consumed samples: 5836800 | consumed tokens: 11953766400 | elapsed time per iteration (s): 4.14 | learning rate: 1.734E-04 | global batch size: 512 | lm loss: 2.155320E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.663 | TFLOPs: 57.63 | 7: iteration 11410/ 44073 | consumed samples: 5841920 | consumed tokens: 11964252160 | elapsed time per iteration (s): 4.14 | learning rate: 1.734E-04 | global batch size: 512 | lm loss: 2.151424E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.565 | TFLOPs: 57.59 | 7: iteration 11420/ 44073 | consumed samples: 5847040 | consumed tokens: 11974737920 | elapsed time per iteration (s): 4.14 | learning rate: 1.733E-04 | global batch size: 512 | lm loss: 2.142121E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.778 | TFLOPs: 57.69 | 7: iteration 11430/ 44073 | consumed samples: 5852160 | consumed tokens: 11985223680 | elapsed time per iteration (s): 4.23 | learning rate: 1.733E-04 | global batch size: 512 | lm loss: 2.141827E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.102 | TFLOPs: 56.44 | 7: iteration 11440/ 44073 | consumed samples: 5857280 | consumed tokens: 11995709440 | elapsed time per iteration (s): 4.14 | learning rate: 1.732E-04 | global batch size: 512 | lm loss: 2.164003E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.692 | TFLOPs: 57.65 | 7: iteration 11450/ 44073 | consumed samples: 5862400 | consumed tokens: 12006195200 | elapsed time per iteration (s): 4.16 | learning rate: 1.732E-04 | global batch size: 512 | lm loss: 2.154188E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.960 | TFLOPs: 57.31 | 7: iteration 11460/ 44073 | consumed samples: 5867520 | consumed tokens: 12016680960 | elapsed time per iteration (s): 4.15 | learning rate: 1.731E-04 | global batch size: 512 | lm loss: 2.147784E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.495 | TFLOPs: 57.55 | 7: iteration 11470/ 44073 | consumed samples: 5872640 | consumed tokens: 12027166720 | elapsed time per iteration (s): 4.18 | learning rate: 1.731E-04 | global batch size: 512 | lm loss: 2.128132E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.601 | TFLOPs: 57.14 | 7: iteration 11480/ 44073 | consumed samples: 5877760 | consumed tokens: 12037652480 | elapsed time per iteration (s): 4.15 | learning rate: 1.730E-04 | global batch size: 512 | lm loss: 2.147141E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.366 | TFLOPs: 57.49 | 7: iteration 11490/ 44073 | consumed samples: 5882880 | consumed tokens: 12048138240 | elapsed time per iteration (s): 4.15 | learning rate: 1.730E-04 | global batch size: 512 | lm loss: 2.125089E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.263 | TFLOPs: 57.45 | 7: iteration 11500/ 44073 | consumed samples: 5888000 | consumed tokens: 12058624000 | elapsed time per iteration (s): 4.16 | learning rate: 1.729E-04 | global batch size: 512 | lm loss: 2.138112E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.932 | TFLOPs: 57.29 | 7: iteration 11510/ 44073 | consumed samples: 5893120 | consumed tokens: 12069109760 | elapsed time per iteration (s): 4.16 | learning rate: 1.729E-04 | global batch size: 512 | lm loss: 2.143347E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.132 | TFLOPs: 57.39 | 7: iteration 11520/ 44073 | consumed samples: 5898240 | consumed tokens: 12079595520 | elapsed time per iteration (s): 4.17 | learning rate: 1.729E-04 | global batch size: 512 | lm loss: 2.136989E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.862 | TFLOPs: 57.26 | 7: iteration 11530/ 44073 | consumed samples: 5903360 | consumed tokens: 12090081280 | elapsed time per iteration (s): 4.16 | learning rate: 1.728E-04 | global batch size: 512 | lm loss: 2.140537E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.139 | TFLOPs: 57.39 | 7: iteration 11540/ 44073 | consumed samples: 5908480 | consumed tokens: 12100567040 | elapsed time per iteration (s): 4.17 | learning rate: 1.728E-04 | global batch size: 512 | lm loss: 2.149380E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.867 | TFLOPs: 57.26 | 7: iteration 11550/ 44073 | consumed samples: 5913600 | consumed tokens: 12111052800 | elapsed time per iteration (s): 4.16 | learning rate: 1.727E-04 | global batch size: 512 | lm loss: 2.125007E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.024 | TFLOPs: 57.34 | 7: iteration 11560/ 44073 | consumed samples: 5918720 | consumed tokens: 12121538560 | elapsed time per iteration (s): 4.17 | learning rate: 1.727E-04 | global batch size: 512 | lm loss: 2.123169E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.799 | TFLOPs: 57.23 | 7: iteration 11570/ 44073 | consumed samples: 5923840 | consumed tokens: 12132024320 | elapsed time per iteration (s): 4.16 | learning rate: 1.726E-04 | global batch size: 512 | lm loss: 2.157512E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.981 | TFLOPs: 57.32 | 7: iteration 11580/ 44073 | consumed samples: 5928960 | consumed tokens: 12142510080 | elapsed time per iteration (s): 4.17 | learning rate: 1.726E-04 | global batch size: 512 | lm loss: 2.112625E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.832 | TFLOPs: 57.25 | 7: iteration 11590/ 44073 | consumed samples: 5934080 | consumed tokens: 12152995840 | elapsed time per iteration (s): 4.16 | learning rate: 1.725E-04 | global batch size: 512 | lm loss: 2.153668E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.158 | TFLOPs: 57.40 | 7: iteration 11600/ 44073 | consumed samples: 5939200 | consumed tokens: 12163481600 | elapsed time per iteration (s): 4.23 | learning rate: 1.725E-04 | global batch size: 512 | lm loss: 2.157771E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.975 | TFLOPs: 56.38 | 7: iteration 11610/ 44073 | consumed samples: 5944320 | consumed tokens: 12173967360 | elapsed time per iteration (s): 4.16 | learning rate: 1.724E-04 | global batch size: 512 | lm loss: 2.135911E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.209 | TFLOPs: 57.42 | 7: iteration 11620/ 44073 | consumed samples: 5949440 | consumed tokens: 12184453120 | elapsed time per iteration (s): 4.17 | learning rate: 1.724E-04 | global batch size: 512 | lm loss: 2.136910E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.848 | TFLOPs: 57.25 | 7: iteration 11630/ 44073 | consumed samples: 5954560 | consumed tokens: 12194938880 | elapsed time per iteration (s): 4.16 | learning rate: 1.723E-04 | global batch size: 512 | lm loss: 2.146307E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.054 | TFLOPs: 57.35 | 7: iteration 11640/ 44073 | consumed samples: 5959680 | consumed tokens: 12205424640 | elapsed time per iteration (s): 4.18 | learning rate: 1.723E-04 | global batch size: 512 | lm loss: 2.151709E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.556 | TFLOPs: 57.12 | 7: iteration 11650/ 44073 | consumed samples: 5964800 | consumed tokens: 12215910400 | elapsed time per iteration (s): 4.17 | learning rate: 1.722E-04 | global batch size: 512 | lm loss: 2.150293E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.706 | TFLOPs: 57.19 | 7: iteration 11660/ 44073 | consumed samples: 5969920 | consumed tokens: 12226396160 | elapsed time per iteration (s): 4.26 | learning rate: 1.722E-04 | global batch size: 512 | lm loss: 2.148954E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.222 | TFLOPs: 56.03 | 7: iteration 11670/ 44073 | consumed samples: 5975040 | consumed tokens: 12236881920 | elapsed time per iteration (s): 4.26 | learning rate: 1.722E-04 | global batch size: 512 | lm loss: 2.139753E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.231 | TFLOPs: 56.03 | 7: iteration 11680/ 44073 | consumed samples: 5980160 | consumed tokens: 12247367680 | elapsed time per iteration (s): 4.23 | learning rate: 1.721E-04 | global batch size: 512 | lm loss: 2.149770E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.072 | TFLOPs: 56.43 | 7: iteration 11690/ 44073 | consumed samples: 5985280 | consumed tokens: 12257853440 | elapsed time per iteration (s): 4.16 | learning rate: 1.721E-04 | global batch size: 512 | lm loss: 2.144083E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.017 | TFLOPs: 57.33 | 7: iteration 11700/ 44073 | consumed samples: 5990400 | consumed tokens: 12268339200 | elapsed time per iteration (s): 4.21 | learning rate: 1.720E-04 | global batch size: 512 | lm loss: 2.127147E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.737 | TFLOPs: 56.74 | 7: iteration 11710/ 44073 | consumed samples: 5995520 | consumed tokens: 12278824960 | elapsed time per iteration (s): 4.19 | learning rate: 1.720E-04 | global batch size: 512 | lm loss: 2.141765E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.130 | TFLOPs: 56.92 | 7: iteration 11720/ 44073 | consumed samples: 6000640 | consumed tokens: 12289310720 | elapsed time per iteration (s): 4.20 | learning rate: 1.719E-04 | global batch size: 512 | lm loss: 2.135936E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.985 | TFLOPs: 56.85 | 7: iteration 11730/ 44073 | consumed samples: 6005760 | consumed tokens: 12299796480 | elapsed time per iteration (s): 4.22 | learning rate: 1.719E-04 | global batch size: 512 | lm loss: 2.126869E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.188 | TFLOPs: 56.48 | 7: iteration 11740/ 44073 | consumed samples: 6010880 | consumed tokens: 12310282240 | elapsed time per iteration (s): 4.33 | learning rate: 1.718E-04 | global batch size: 512 | lm loss: 2.156440E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.263 | TFLOPs: 55.12 | 7: iteration 11750/ 44073 | consumed samples: 6016000 | consumed tokens: 12320768000 | elapsed time per iteration (s): 4.17 | learning rate: 1.718E-04 | global batch size: 512 | lm loss: 2.147583E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.646 | TFLOPs: 57.16 | 7: iteration 11760/ 44073 | consumed samples: 6021120 | consumed tokens: 12331253760 | elapsed time per iteration (s): 4.15 | learning rate: 1.717E-04 | global batch size: 512 | lm loss: 2.137685E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.373 | TFLOPs: 57.50 | 7: iteration 11770/ 44073 | consumed samples: 6026240 | consumed tokens: 12341739520 | elapsed time per iteration (s): 4.20 | learning rate: 1.717E-04 | global batch size: 512 | lm loss: 2.182375E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.825 | TFLOPs: 56.78 | 7: iteration 11780/ 44073 | consumed samples: 6031360 | consumed tokens: 12352225280 | elapsed time per iteration (s): 4.20 | learning rate: 1.716E-04 | global batch size: 512 | lm loss: 2.169927E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.955 | TFLOPs: 56.84 | 7: iteration 11790/ 44073 | consumed samples: 6036480 | consumed tokens: 12362711040 | elapsed time per iteration (s): 4.20 | learning rate: 1.716E-04 | global batch size: 512 | lm loss: 2.132594E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.855 | TFLOPs: 56.79 | 7: iteration 11800/ 44073 | consumed samples: 6041600 | consumed tokens: 12373196800 | elapsed time per iteration (s): 4.17 | learning rate: 1.715E-04 | global batch size: 512 | lm loss: 2.128273E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.848 | TFLOPs: 57.25 | 7: iteration 11810/ 44073 | consumed samples: 6046720 | consumed tokens: 12383682560 | elapsed time per iteration (s): 4.15 | learning rate: 1.715E-04 | global batch size: 512 | lm loss: 2.158874E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.463 | TFLOPs: 57.54 | 7: iteration 11820/ 44073 | consumed samples: 6051840 | consumed tokens: 12394168320 | elapsed time per iteration (s): 4.20 | learning rate: 1.714E-04 | global batch size: 512 | lm loss: 2.128296E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.868 | TFLOPs: 56.80 | 7: iteration 11830/ 44073 | consumed samples: 6056960 | consumed tokens: 12404654080 | elapsed time per iteration (s): 4.20 | learning rate: 1.714E-04 | global batch size: 512 | lm loss: 2.121120E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.046 | TFLOPs: 56.88 | 7: iteration 11840/ 44073 | consumed samples: 6062080 | consumed tokens: 12415139840 | elapsed time per iteration (s): 4.17 | learning rate: 1.714E-04 | global batch size: 512 | lm loss: 2.133089E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.710 | TFLOPs: 57.19 | 7: iteration 11850/ 44073 | consumed samples: 6067200 | consumed tokens: 12425625600 | elapsed time per iteration (s): 4.16 | learning rate: 1.713E-04 | global batch size: 512 | lm loss: 2.146613E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.130 | TFLOPs: 57.38 | 7: iteration 11860/ 44073 | consumed samples: 6072320 | consumed tokens: 12436111360 | elapsed time per iteration (s): 4.20 | learning rate: 1.713E-04 | global batch size: 512 | lm loss: 2.134815E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.831 | TFLOPs: 56.78 | 7: iteration 11870/ 44073 | consumed samples: 6077440 | consumed tokens: 12446597120 | elapsed time per iteration (s): 4.17 | learning rate: 1.712E-04 | global batch size: 512 | lm loss: 2.129026E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.828 | TFLOPs: 57.24 | 7: iteration 11880/ 44073 | consumed samples: 6082560 | consumed tokens: 12457082880 | elapsed time per iteration (s): 4.17 | learning rate: 1.712E-04 | global batch size: 512 | lm loss: 2.122154E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.692 | TFLOPs: 57.18 | 7: iteration 11890/ 44073 | consumed samples: 6087680 | consumed tokens: 12467568640 | elapsed time per iteration (s): 4.25 | learning rate: 1.711E-04 | global batch size: 512 | lm loss: 2.156557E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.496 | TFLOPs: 56.16 | 7: iteration 11900/ 44073 | consumed samples: 6092800 | consumed tokens: 12478054400 | elapsed time per iteration (s): 4.21 | learning rate: 1.711E-04 | global batch size: 512 | lm loss: 2.150841E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.589 | TFLOPs: 56.67 | 7: iteration 11910/ 44073 | consumed samples: 6097920 | consumed tokens: 12488540160 | elapsed time per iteration (s): 4.22 | learning rate: 1.710E-04 | global batch size: 512 | lm loss: 2.124428E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.443 | TFLOPs: 56.60 | 7: iteration 11920/ 44073 | consumed samples: 6103040 | consumed tokens: 12499025920 | elapsed time per iteration (s): 4.15 | learning rate: 1.710E-04 | global batch size: 512 | lm loss: 2.121716E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.383 | TFLOPs: 57.50 | 7: iteration 11930/ 44073 | consumed samples: 6108160 | consumed tokens: 12509511680 | elapsed time per iteration (s): 4.15 | learning rate: 1.709E-04 | global batch size: 512 | lm loss: 2.126886E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.298 | TFLOPs: 57.46 | 7: iteration 11940/ 44073 | consumed samples: 6113280 | consumed tokens: 12519997440 | elapsed time per iteration (s): 4.18 | learning rate: 1.709E-04 | global batch size: 512 | lm loss: 2.148109E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.550 | TFLOPs: 57.11 | 7: iteration 11950/ 44073 | consumed samples: 6118400 | consumed tokens: 12530483200 | elapsed time per iteration (s): 4.17 | learning rate: 1.708E-04 | global batch size: 512 | lm loss: 2.157290E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.850 | TFLOPs: 57.25 | 7: iteration 11960/ 44073 | consumed samples: 6123520 | consumed tokens: 12540968960 | elapsed time per iteration (s): 4.21 | learning rate: 1.708E-04 | global batch size: 512 | lm loss: 2.123225E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.597 | TFLOPs: 56.67 | 7: iteration 11970/ 44073 | consumed samples: 6128640 | consumed tokens: 12551454720 | elapsed time per iteration (s): 4.19 | learning rate: 1.707E-04 | global batch size: 512 | lm loss: 2.141760E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.282 | TFLOPs: 56.99 | 7: iteration 11980/ 44073 | consumed samples: 6133760 | consumed tokens: 12561940480 | elapsed time per iteration (s): 4.15 | learning rate: 1.707E-04 | global batch size: 512 | lm loss: 2.121438E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.390 | TFLOPs: 57.51 | 7: iteration 11990/ 44073 | consumed samples: 6138880 | consumed tokens: 12572426240 | elapsed time per iteration (s): 4.16 | learning rate: 1.706E-04 | global batch size: 512 | lm loss: 2.144584E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.079 | TFLOPs: 57.36 | 0: [2022-11-26 00:27:16,913] [INFO] [logging.py:68:log_dist] [Rank 0] step=12000, skipped=0, lr=[0.0001705876609820537, 0.0001705876609820537, 0.0001705876609820537], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 7: iteration 12000/ 44073 | consumed samples: 6144000 | consumed tokens: 12582912000 | elapsed time per iteration (s): 4.14 | learning rate: 1.706E-04 | global batch size: 512 | lm loss: 2.137311E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.784 | TFLOPs: 57.69 | 0: steps: 12000 loss: 2.1846 iter time (s): 4.228 samples/sec: 121.090 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 12000 | lm loss value: 2.089871E+00 | lm loss PPL: 8.083872E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 12000 to checkpoints_2b2 0: [2022-11-26 00:27:18,249] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step12000 is begin to save! 0: [2022-11-26 00:27:18,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_01-model_00-model_states.pt... 0: [2022-11-26 00:27:18,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_01-model_00-model_states.pt. 0: [2022-11-26 00:27:18,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_03-model_00-model_states.pt... 0: [2022-11-26 00:27:18,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_03-model_00-model_states.pt. 0: [2022-11-26 00:27:18,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_04-model_00-model_states.pt... 0: [2022-11-26 00:27:18,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_04-model_00-model_states.pt. 0: [2022-11-26 00:27:18,874] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_05-model_00-model_states.pt... 0: [2022-11-26 00:27:19,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_05-model_00-model_states.pt. 0: [2022-11-26 00:27:19,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_06-model_00-model_states.pt... 0: [2022-11-26 00:27:19,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_06-model_00-model_states.pt. 0: [2022-11-26 00:27:19,164] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_07-model_00-model_states.pt... 0: [2022-11-26 00:27:19,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_07-model_00-model_states.pt. 0: [2022-11-26 00:27:19,307] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_08-model_00-model_states.pt... 0: [2022-11-26 00:27:19,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_08-model_00-model_states.pt. 0: [2022-11-26 00:27:19,452] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_09-model_00-model_states.pt... 0: [2022-11-26 00:27:19,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_09-model_00-model_states.pt. 0: [2022-11-26 00:27:19,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_10-model_00-model_states.pt... 0: [2022-11-26 00:27:19,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_10-model_00-model_states.pt. 0: [2022-11-26 00:27:19,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_11-model_00-model_states.pt... 0: [2022-11-26 00:27:19,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_11-model_00-model_states.pt. 0: [2022-11-26 00:27:19,878] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_12-model_00-model_states.pt... 0: [2022-11-26 00:27:20,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_12-model_00-model_states.pt. 0: [2022-11-26 00:27:20,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_13-model_00-model_states.pt... 0: [2022-11-26 00:27:20,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_13-model_00-model_states.pt. 0: [2022-11-26 00:27:20,165] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_14-model_00-model_states.pt... 0: [2022-11-26 00:27:20,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_14-model_00-model_states.pt. 0: [2022-11-26 00:27:20,303] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_15-model_00-model_states.pt... 0: [2022-11-26 00:27:20,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_15-model_00-model_states.pt. 0: [2022-11-26 00:27:20,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_16-model_00-model_states.pt... 0: [2022-11-26 00:27:20,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_16-model_00-model_states.pt. 0: [2022-11-26 00:27:20,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_17-model_00-model_states.pt... 0: [2022-11-26 00:27:20,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_17-model_00-model_states.pt. 0: [2022-11-26 00:27:20,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_18-model_00-model_states.pt... 0: [2022-11-26 00:27:20,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_18-model_00-model_states.pt. 0: [2022-11-26 00:27:20,863] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_19-model_00-model_states.pt... 0: [2022-11-26 00:27:20,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_19-model_00-model_states.pt. 0: [2022-11-26 00:27:21,000] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_20-model_00-model_states.pt... 0: [2022-11-26 00:27:21,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_20-model_00-model_states.pt. 0: [2022-11-26 00:27:21,139] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_21-model_00-model_states.pt... 0: [2022-11-26 00:27:21,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_21-model_00-model_states.pt. 0: [2022-11-26 00:27:21,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_22-model_00-model_states.pt... 0: [2022-11-26 00:27:21,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_22-model_00-model_states.pt. 0: [2022-11-26 00:27:21,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_23-model_00-model_states.pt... 0: [2022-11-26 00:27:21,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_23-model_00-model_states.pt. 0: [2022-11-26 00:27:21,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_24-model_00-model_states.pt... 0: [2022-11-26 00:27:21,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_24-model_00-model_states.pt. 0: [2022-11-26 00:27:21,686] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_25-model_00-model_states.pt... 0: [2022-11-26 00:27:21,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_25-model_00-model_states.pt. 0: [2022-11-26 00:27:21,820] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_26-model_00-model_states.pt... 0: [2022-11-26 00:27:21,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_26-model_00-model_states.pt. 0: [2022-11-26 00:27:21,958] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_27-model_00-model_states.pt... 0: [2022-11-26 00:27:22,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_27-model_00-model_states.pt. 0: [2022-11-26 00:27:22,100] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_28-model_00-model_states.pt... 0: [2022-11-26 00:27:22,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_28-model_00-model_states.pt. 0: [2022-11-26 00:27:22,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_29-model_00-model_states.pt... 0: [2022-11-26 00:27:22,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_29-model_00-model_states.pt. 0: [2022-11-26 00:27:22,381] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_30-model_00-model_states.pt... 0: [2022-11-26 00:27:22,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_30-model_00-model_states.pt. 0: [2022-11-26 00:27:22,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_31-model_00-model_states.pt... 0: [2022-11-26 00:27:22,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_31-model_00-model_states.pt. 0: [2022-11-26 00:27:22,651] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_32-model_00-model_states.pt... 0: [2022-11-26 00:27:22,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_32-model_00-model_states.pt. 0: [2022-11-26 00:27:22,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_33-model_00-model_states.pt... 0: [2022-11-26 00:27:22,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_33-model_00-model_states.pt. 0: [2022-11-26 00:27:22,928] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_34-model_00-model_states.pt... 0: [2022-11-26 00:27:23,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_34-model_00-model_states.pt. 0: [2022-11-26 00:27:23,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/layer_36-model_00-model_states.pt... 0: [2022-11-26 00:27:23,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/layer_36-model_00-model_states.pt. 0: [2022-11-26 00:27:23,070] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step12000/mp_rank_00_model_states.pt 0: [2022-11-26 00:27:23,070] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/mp_rank_00_model_states.pt... 0: [2022-11-26 00:27:23,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/mp_rank_00_model_states.pt. 0: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:27:23,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step12000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:27:23,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:27:23,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:27:23,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 00:27:23,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: [2022-11-26 00:27:23,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:27:23,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 00:27:23,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 3: [2022-11-26 00:27:23,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:27:23,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 00:27:23,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: [2022-11-26 00:27:23,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:27:23,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 00:27:23,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 5: [2022-11-26 00:27:23,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:27:23,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 00:27:23,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 5: [2022-11-26 00:27:23,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:27:23,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 00:27:23,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 3: [2022-11-26 00:27:23,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:27:23,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 00:27:23,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 5: [2022-11-26 00:27:23,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:27:23,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 00:27:23,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 5: [2022-11-26 00:27:23,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:27:23,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 00:27:23,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 6: [2022-11-26 00:27:23,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:27:23,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 00:27:23,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: [2022-11-26 00:27:23,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:27:23,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 00:27:23,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 1: [2022-11-26 00:27:23,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:27:23,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 00:27:23,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 5: [2022-11-26 00:27:23,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:27:23,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 00:27:23,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 5: [2022-11-26 00:27:23,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:27:23,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 00:27:23,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 6: [2022-11-26 00:27:23,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:27:23,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 00:27:23,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 6: [2022-11-26 00:27:23,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:27:23,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 00:27:23,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 3: [2022-11-26 00:27:23,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:27:23,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 00:27:23,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 3: [2022-11-26 00:27:23,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:27:23,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:27:23,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 00:27:23,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 00:27:23,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 3: [2022-11-26 00:27:23,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 3: [2022-11-26 00:27:23,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:27:23,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 00:27:23,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 3: [2022-11-26 00:27:23,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:27:23,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:27:23,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 1: [2022-11-26 00:27:23,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 3: [2022-11-26 00:27:23,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 1: [2022-11-26 00:27:23,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 3: [2022-11-26 00:27:23,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:27:23,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 00:27:23,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: [2022-11-26 00:27:23,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:27:23,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 00:27:23,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 5: [2022-11-26 00:27:23,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:27:23,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 00:27:23,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 6: [2022-11-26 00:27:23,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:27:23,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 00:27:23,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 6: [2022-11-26 00:27:23,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:27:23,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 00:27:23,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 5: [2022-11-26 00:27:23,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:27:23,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 00:27:23,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 6: [2022-11-26 00:27:23,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:27:23,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 00:27:23,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 6: [2022-11-26 00:27:23,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:27:23,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 00:27:23,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:27:23,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 6: [2022-11-26 00:27:23,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 00:27:23,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 1: [2022-11-26 00:27:23,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:27:23,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 00:27:23,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 1: [2022-11-26 00:27:23,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:27:23,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 00:27:23,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 1: [2022-11-26 00:27:23,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:27:23,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 00:27:23,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: [2022-11-26 00:27:23,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:27:23,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 00:27:23,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: [2022-11-26 00:27:23,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:27:23,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 00:27:23,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 1: [2022-11-26 00:27:23,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:27:23,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 00:27:23,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 1: [2022-11-26 00:27:23,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:27:23,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 00:27:23,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 1: [2022-11-26 00:27:23,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:27:23,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 00:27:23,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 2: [2022-11-26 00:27:23,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:27:23,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:27:23,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:27:23,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:27:23,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:27:23,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:27:23,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:27:23,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:27:23,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 00:27:23,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 00:27:23,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 00:27:23,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 00:27:23,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 00:27:23,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 00:27:23,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 2: [2022-11-26 00:27:23,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 00:27:23,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 2: [2022-11-26 00:27:23,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 00:27:23,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 2: [2022-11-26 00:27:23,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 2: [2022-11-26 00:27:23,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 2: [2022-11-26 00:27:23,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 2: [2022-11-26 00:27:23,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 2: [2022-11-26 00:27:23,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 7: [2022-11-26 00:27:24,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:27:24,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:27:24,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:27:24,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 00:27:24,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 00:27:24,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 7: [2022-11-26 00:27:24,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 7: [2022-11-26 00:27:24,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 00:27:24,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 7: [2022-11-26 00:27:24,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:27:24,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:27:24,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:27:24,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 00:27:24,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:27:24,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:27:24,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 00:27:24,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 00:27:24,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 7: [2022-11-26 00:27:24,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 7: [2022-11-26 00:27:24,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 7: [2022-11-26 00:27:24,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 00:27:24,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 00:27:24,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 7: [2022-11-26 00:27:24,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: [2022-11-26 00:27:24,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 00:27:24,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 4: [2022-11-26 00:27:24,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:27:24,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:27:24,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 00:27:24,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 00:27:24,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:27:24,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:27:24,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:27:24,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:27:24,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:27:24,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 4: [2022-11-26 00:27:24,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 4: [2022-11-26 00:27:24,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 00:27:24,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 00:27:24,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 00:27:24,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 00:27:24,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 00:27:24,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 4: [2022-11-26 00:27:24,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 4: [2022-11-26 00:27:24,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 4: [2022-11-26 00:27:24,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 4: [2022-11-26 00:27:24,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 4: [2022-11-26 00:27:24,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:27:24,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step12000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 00:27:24,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: successfully saved checkpoint at iteration 12000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 6146.21 7: iteration 12010/ 44073 | consumed samples: 6149120 | consumed tokens: 12593397760 | elapsed time per iteration (s): 4.92 | learning rate: 1.705E-04 | global batch size: 512 | lm loss: 2.125362E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 104.107 | TFLOPs: 48.52 | 7: iteration 12020/ 44073 | consumed samples: 6154240 | consumed tokens: 12603883520 | elapsed time per iteration (s): 4.17 | learning rate: 1.705E-04 | global batch size: 512 | lm loss: 2.139848E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.824 | TFLOPs: 57.24 | 7: iteration 12030/ 44073 | consumed samples: 6159360 | consumed tokens: 12614369280 | elapsed time per iteration (s): 4.20 | learning rate: 1.704E-04 | global batch size: 512 | lm loss: 2.142126E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.973 | TFLOPs: 56.85 | 7: iteration 12040/ 44073 | consumed samples: 6164480 | consumed tokens: 12624855040 | elapsed time per iteration (s): 4.27 | learning rate: 1.704E-04 | global batch size: 512 | lm loss: 2.137574E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.836 | TFLOPs: 55.85 | 7: iteration 12050/ 44073 | consumed samples: 6169600 | consumed tokens: 12635340800 | elapsed time per iteration (s): 4.17 | learning rate: 1.703E-04 | global batch size: 512 | lm loss: 2.117420E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.745 | TFLOPs: 57.21 | 7: iteration 12060/ 44073 | consumed samples: 6174720 | consumed tokens: 12645826560 | elapsed time per iteration (s): 4.19 | learning rate: 1.703E-04 | global batch size: 512 | lm loss: 2.144300E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.285 | TFLOPs: 56.99 | 7: iteration 12070/ 44073 | consumed samples: 6179840 | consumed tokens: 12656312320 | elapsed time per iteration (s): 4.15 | learning rate: 1.703E-04 | global batch size: 512 | lm loss: 2.123845E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.511 | TFLOPs: 57.56 | 7: iteration 12080/ 44073 | consumed samples: 6184960 | consumed tokens: 12666798080 | elapsed time per iteration (s): 4.16 | learning rate: 1.702E-04 | global batch size: 512 | lm loss: 2.139352E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.222 | TFLOPs: 57.43 | 7: iteration 12090/ 44073 | consumed samples: 6190080 | consumed tokens: 12677283840 | elapsed time per iteration (s): 4.23 | learning rate: 1.702E-04 | global batch size: 512 | lm loss: 2.133631E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.988 | TFLOPs: 56.39 | 7: iteration 12100/ 44073 | consumed samples: 6195200 | consumed tokens: 12687769600 | elapsed time per iteration (s): 4.19 | learning rate: 1.701E-04 | global batch size: 512 | lm loss: 2.137860E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.207 | TFLOPs: 56.95 | 7: iteration 12110/ 44073 | consumed samples: 6200320 | consumed tokens: 12698255360 | elapsed time per iteration (s): 4.15 | learning rate: 1.701E-04 | global batch size: 512 | lm loss: 2.119190E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.396 | TFLOPs: 57.51 | 7: iteration 12120/ 44073 | consumed samples: 6205440 | consumed tokens: 12708741120 | elapsed time per iteration (s): 4.22 | learning rate: 1.700E-04 | global batch size: 512 | lm loss: 2.128143E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.470 | TFLOPs: 56.61 | 7: iteration 12130/ 44073 | consumed samples: 6210560 | consumed tokens: 12719226880 | elapsed time per iteration (s): 4.15 | learning rate: 1.700E-04 | global batch size: 512 | lm loss: 2.133202E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.504 | TFLOPs: 57.56 | 7: iteration 12140/ 44073 | consumed samples: 6215680 | consumed tokens: 12729712640 | elapsed time per iteration (s): 4.15 | learning rate: 1.699E-04 | global batch size: 512 | lm loss: 2.143577E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.497 | TFLOPs: 57.56 | 7: iteration 12150/ 44073 | consumed samples: 6220800 | consumed tokens: 12740198400 | elapsed time per iteration (s): 4.14 | learning rate: 1.699E-04 | global batch size: 512 | lm loss: 2.113348E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.620 | TFLOPs: 57.61 | 7: iteration 12160/ 44073 | consumed samples: 6225920 | consumed tokens: 12750684160 | elapsed time per iteration (s): 4.24 | learning rate: 1.698E-04 | global batch size: 512 | lm loss: 2.125687E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.830 | TFLOPs: 56.31 | 7: iteration 12170/ 44073 | consumed samples: 6231040 | consumed tokens: 12761169920 | elapsed time per iteration (s): 4.15 | learning rate: 1.698E-04 | global batch size: 512 | lm loss: 2.135215E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.324 | TFLOPs: 57.48 | 7: iteration 12180/ 44073 | consumed samples: 6236160 | consumed tokens: 12771655680 | elapsed time per iteration (s): 4.15 | learning rate: 1.697E-04 | global batch size: 512 | lm loss: 2.130606E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.498 | TFLOPs: 57.56 | 7: iteration 12190/ 44073 | consumed samples: 6241280 | consumed tokens: 12782141440 | elapsed time per iteration (s): 4.18 | learning rate: 1.697E-04 | global batch size: 512 | lm loss: 2.119517E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.572 | TFLOPs: 57.12 | 7: iteration 12200/ 44073 | consumed samples: 6246400 | consumed tokens: 12792627200 | elapsed time per iteration (s): 4.16 | learning rate: 1.696E-04 | global batch size: 512 | lm loss: 2.127279E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.957 | TFLOPs: 57.30 | 7: iteration 12210/ 44073 | consumed samples: 6251520 | consumed tokens: 12803112960 | elapsed time per iteration (s): 4.23 | learning rate: 1.696E-04 | global batch size: 512 | lm loss: 2.142294E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.133 | TFLOPs: 56.45 | 7: iteration 12220/ 44073 | consumed samples: 6256640 | consumed tokens: 12813598720 | elapsed time per iteration (s): 4.19 | learning rate: 1.695E-04 | global batch size: 512 | lm loss: 2.141056E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.060 | TFLOPs: 56.89 | 7: iteration 12230/ 44073 | consumed samples: 6261760 | consumed tokens: 12824084480 | elapsed time per iteration (s): 4.16 | learning rate: 1.695E-04 | global batch size: 512 | lm loss: 2.134024E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.223 | TFLOPs: 57.43 | 7: iteration 12240/ 44073 | consumed samples: 6266880 | consumed tokens: 12834570240 | elapsed time per iteration (s): 4.18 | learning rate: 1.694E-04 | global batch size: 512 | lm loss: 2.137669E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.346 | TFLOPs: 57.02 | 7: iteration 12250/ 44073 | consumed samples: 6272000 | consumed tokens: 12845056000 | elapsed time per iteration (s): 4.21 | learning rate: 1.694E-04 | global batch size: 512 | lm loss: 2.140546E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.727 | TFLOPs: 56.73 | 7: iteration 12260/ 44073 | consumed samples: 6277120 | consumed tokens: 12855541760 | elapsed time per iteration (s): 4.16 | learning rate: 1.693E-04 | global batch size: 512 | lm loss: 2.162659E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.202 | TFLOPs: 57.42 | 7: iteration 12270/ 44073 | consumed samples: 6282240 | consumed tokens: 12866027520 | elapsed time per iteration (s): 4.17 | learning rate: 1.693E-04 | global batch size: 512 | lm loss: 2.144271E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.680 | TFLOPs: 57.18 | 7: iteration 12280/ 44073 | consumed samples: 6287360 | consumed tokens: 12876513280 | elapsed time per iteration (s): 4.16 | learning rate: 1.692E-04 | global batch size: 512 | lm loss: 2.138440E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.213 | TFLOPs: 57.42 | 7: iteration 12290/ 44073 | consumed samples: 6292480 | consumed tokens: 12886999040 | elapsed time per iteration (s): 4.15 | learning rate: 1.692E-04 | global batch size: 512 | lm loss: 2.157804E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.355 | TFLOPs: 57.49 | 7: iteration 12300/ 44073 | consumed samples: 6297600 | consumed tokens: 12897484800 | elapsed time per iteration (s): 4.19 | learning rate: 1.691E-04 | global batch size: 512 | lm loss: 2.109945E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.224 | TFLOPs: 56.96 | 7: iteration 12310/ 44073 | consumed samples: 6302720 | consumed tokens: 12907970560 | elapsed time per iteration (s): 4.20 | learning rate: 1.691E-04 | global batch size: 512 | lm loss: 2.161035E+00 | grad norm: 3.474 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.786 | TFLOPs: 56.76 | 7: iteration 12320/ 44073 | consumed samples: 6307840 | consumed tokens: 12918456320 | elapsed time per iteration (s): 4.42 | learning rate: 1.690E-04 | global batch size: 512 | lm loss: 6.128876E+00 | grad norm: 17.636 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 115.957 | TFLOPs: 54.04 | 7: iteration 12330/ 44073 | consumed samples: 6312960 | consumed tokens: 12928942080 | elapsed time per iteration (s): 4.15 | learning rate: 1.690E-04 | global batch size: 512 | lm loss: 6.910880E+00 | grad norm: 1.733 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.247 | TFLOPs: 57.44 | 7: iteration 12340/ 44073 | consumed samples: 6318080 | consumed tokens: 12939427840 | elapsed time per iteration (s): 4.16 | learning rate: 1.689E-04 | global batch size: 512 | lm loss: 6.126530E+00 | grad norm: 1.553 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.999 | TFLOPs: 57.32 | 7: iteration 12350/ 44073 | consumed samples: 6323200 | consumed tokens: 12949913600 | elapsed time per iteration (s): 4.16 | learning rate: 1.689E-04 | global batch size: 512 | lm loss: 5.457416E+00 | grad norm: 1.628 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.125 | TFLOPs: 57.38 | 7: iteration 12360/ 44073 | consumed samples: 6328320 | consumed tokens: 12960399360 | elapsed time per iteration (s): 4.18 | learning rate: 1.688E-04 | global batch size: 512 | lm loss: 4.717762E+00 | grad norm: 1.420 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.598 | TFLOPs: 57.14 | 7: iteration 12370/ 44073 | consumed samples: 6333440 | consumed tokens: 12970885120 | elapsed time per iteration (s): 4.15 | learning rate: 1.688E-04 | global batch size: 512 | lm loss: 3.559653E+00 | grad norm: 1.780 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.340 | TFLOPs: 57.48 | 7: iteration 12380/ 44073 | consumed samples: 6338560 | consumed tokens: 12981370880 | elapsed time per iteration (s): 4.18 | learning rate: 1.687E-04 | global batch size: 512 | lm loss: 2.925174E+00 | grad norm: 1.111 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.411 | TFLOPs: 57.05 | 7: iteration 12390/ 44073 | consumed samples: 6343680 | consumed tokens: 12991856640 | elapsed time per iteration (s): 4.18 | learning rate: 1.687E-04 | global batch size: 512 | lm loss: 2.601806E+00 | grad norm: 0.907 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.376 | TFLOPs: 57.03 | 7: iteration 12400/ 44073 | consumed samples: 6348800 | consumed tokens: 13002342400 | elapsed time per iteration (s): 4.20 | learning rate: 1.686E-04 | global batch size: 512 | lm loss: 2.412375E+00 | grad norm: 0.408 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.886 | TFLOPs: 56.80 | 7: iteration 12410/ 44073 | consumed samples: 6353920 | consumed tokens: 13012828160 | elapsed time per iteration (s): 4.21 | learning rate: 1.686E-04 | global batch size: 512 | lm loss: 2.317396E+00 | grad norm: 0.239 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.545 | TFLOPs: 56.65 | 7: iteration 12420/ 44073 | consumed samples: 6359040 | consumed tokens: 13023313920 | elapsed time per iteration (s): 4.15 | learning rate: 1.685E-04 | global batch size: 512 | lm loss: 2.249638E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.358 | TFLOPs: 57.49 | 7: iteration 12430/ 44073 | consumed samples: 6364160 | consumed tokens: 13033799680 | elapsed time per iteration (s): 4.14 | learning rate: 1.685E-04 | global batch size: 512 | lm loss: 2.210417E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.666 | TFLOPs: 57.63 | 7: iteration 12440/ 44073 | consumed samples: 6369280 | consumed tokens: 13044285440 | elapsed time per iteration (s): 4.15 | learning rate: 1.684E-04 | global batch size: 512 | lm loss: 2.205031E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.376 | TFLOPs: 57.50 | 7: iteration 12450/ 44073 | consumed samples: 6374400 | consumed tokens: 13054771200 | elapsed time per iteration (s): 4.20 | learning rate: 1.684E-04 | global batch size: 512 | lm loss: 2.188615E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.819 | TFLOPs: 56.77 | 7: iteration 12460/ 44073 | consumed samples: 6379520 | consumed tokens: 13065256960 | elapsed time per iteration (s): 4.16 | learning rate: 1.684E-04 | global batch size: 512 | lm loss: 2.179720E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.970 | TFLOPs: 57.31 | 7: iteration 12470/ 44073 | consumed samples: 6384640 | consumed tokens: 13075742720 | elapsed time per iteration (s): 4.16 | learning rate: 1.683E-04 | global batch size: 512 | lm loss: 2.180083E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.034 | TFLOPs: 57.34 | 7: iteration 12480/ 44073 | consumed samples: 6389760 | consumed tokens: 13086228480 | elapsed time per iteration (s): 4.16 | learning rate: 1.683E-04 | global batch size: 512 | lm loss: 2.174271E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.998 | TFLOPs: 57.32 | 7: iteration 12490/ 44073 | consumed samples: 6394880 | consumed tokens: 13096714240 | elapsed time per iteration (s): 4.17 | learning rate: 1.682E-04 | global batch size: 512 | lm loss: 2.157028E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.755 | TFLOPs: 57.21 | 7: iteration 12500/ 44073 | consumed samples: 6400000 | consumed tokens: 13107200000 | elapsed time per iteration (s): 4.15 | learning rate: 1.682E-04 | global batch size: 512 | lm loss: 2.171505E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.504 | TFLOPs: 57.56 | 7: iteration 12510/ 44073 | consumed samples: 6405120 | consumed tokens: 13117685760 | elapsed time per iteration (s): 4.14 | learning rate: 1.681E-04 | global batch size: 512 | lm loss: 2.160176E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.601 | TFLOPs: 57.60 | 7: iteration 12520/ 44073 | consumed samples: 6410240 | consumed tokens: 13128171520 | elapsed time per iteration (s): 4.18 | learning rate: 1.681E-04 | global batch size: 512 | lm loss: 2.171083E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.627 | TFLOPs: 57.15 | 7: iteration 12530/ 44073 | consumed samples: 6415360 | consumed tokens: 13138657280 | elapsed time per iteration (s): 4.14 | learning rate: 1.680E-04 | global batch size: 512 | lm loss: 2.176792E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.612 | TFLOPs: 57.61 | 7: iteration 12540/ 44073 | consumed samples: 6420480 | consumed tokens: 13149143040 | elapsed time per iteration (s): 4.27 | learning rate: 1.680E-04 | global batch size: 512 | lm loss: 2.166960E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.969 | TFLOPs: 55.91 | 7: iteration 12550/ 44073 | consumed samples: 6425600 | consumed tokens: 13159628800 | elapsed time per iteration (s): 4.25 | learning rate: 1.679E-04 | global batch size: 512 | lm loss: 2.165307E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.605 | TFLOPs: 56.21 | 7: iteration 12560/ 44073 | consumed samples: 6430720 | consumed tokens: 13170114560 | elapsed time per iteration (s): 4.16 | learning rate: 1.679E-04 | global batch size: 512 | lm loss: 2.151472E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.123 | TFLOPs: 57.38 | 7: iteration 12570/ 44073 | consumed samples: 6435840 | consumed tokens: 13180600320 | elapsed time per iteration (s): 4.21 | learning rate: 1.678E-04 | global batch size: 512 | lm loss: 2.155569E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.475 | TFLOPs: 56.61 | 7: iteration 12580/ 44073 | consumed samples: 6440960 | consumed tokens: 13191086080 | elapsed time per iteration (s): 4.15 | learning rate: 1.678E-04 | global batch size: 512 | lm loss: 2.170836E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.283 | TFLOPs: 57.46 | 7: iteration 12590/ 44073 | consumed samples: 6446080 | consumed tokens: 13201571840 | elapsed time per iteration (s): 4.24 | learning rate: 1.677E-04 | global batch size: 512 | lm loss: 2.153774E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.802 | TFLOPs: 56.30 | 7: iteration 12600/ 44073 | consumed samples: 6451200 | consumed tokens: 13212057600 | elapsed time per iteration (s): 4.15 | learning rate: 1.677E-04 | global batch size: 512 | lm loss: 2.162286E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.287 | TFLOPs: 57.46 | 7: iteration 12610/ 44073 | consumed samples: 6456320 | consumed tokens: 13222543360 | elapsed time per iteration (s): 4.20 | learning rate: 1.676E-04 | global batch size: 512 | lm loss: 2.166172E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.818 | TFLOPs: 56.77 | 7: iteration 12620/ 44073 | consumed samples: 6461440 | consumed tokens: 13233029120 | elapsed time per iteration (s): 4.18 | learning rate: 1.676E-04 | global batch size: 512 | lm loss: 2.147195E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.474 | TFLOPs: 57.08 | 7: iteration 12630/ 44073 | consumed samples: 6466560 | consumed tokens: 13243514880 | elapsed time per iteration (s): 4.17 | learning rate: 1.675E-04 | global batch size: 512 | lm loss: 2.143822E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.880 | TFLOPs: 57.27 | 7: iteration 12640/ 44073 | consumed samples: 6471680 | consumed tokens: 13254000640 | elapsed time per iteration (s): 4.20 | learning rate: 1.675E-04 | global batch size: 512 | lm loss: 2.169388E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.039 | TFLOPs: 56.88 | 7: iteration 12650/ 44073 | consumed samples: 6476800 | consumed tokens: 13264486400 | elapsed time per iteration (s): 4.18 | learning rate: 1.674E-04 | global batch size: 512 | lm loss: 2.128352E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.580 | TFLOPs: 57.13 | 7: iteration 12660/ 44073 | consumed samples: 6481920 | consumed tokens: 13274972160 | elapsed time per iteration (s): 4.15 | learning rate: 1.674E-04 | global batch size: 512 | lm loss: 2.142352E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.381 | TFLOPs: 57.50 | 7: iteration 12670/ 44073 | consumed samples: 6487040 | consumed tokens: 13285457920 | elapsed time per iteration (s): 4.15 | learning rate: 1.673E-04 | global batch size: 512 | lm loss: 2.147917E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.298 | TFLOPs: 57.46 | 7: iteration 12680/ 44073 | consumed samples: 6492160 | consumed tokens: 13295943680 | elapsed time per iteration (s): 4.14 | learning rate: 1.673E-04 | global batch size: 512 | lm loss: 2.138634E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.595 | TFLOPs: 57.60 | 7: iteration 12690/ 44073 | consumed samples: 6497280 | consumed tokens: 13306429440 | elapsed time per iteration (s): 4.16 | learning rate: 1.672E-04 | global batch size: 512 | lm loss: 2.144033E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.021 | TFLOPs: 57.33 | 7: iteration 12700/ 44073 | consumed samples: 6502400 | consumed tokens: 13316915200 | elapsed time per iteration (s): 4.20 | learning rate: 1.672E-04 | global batch size: 512 | lm loss: 2.166501E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.986 | TFLOPs: 56.85 | 7: iteration 12710/ 44073 | consumed samples: 6507520 | consumed tokens: 13327400960 | elapsed time per iteration (s): 4.21 | learning rate: 1.671E-04 | global batch size: 512 | lm loss: 2.121947E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.494 | TFLOPs: 56.62 | 7: iteration 12720/ 44073 | consumed samples: 6512640 | consumed tokens: 13337886720 | elapsed time per iteration (s): 4.19 | learning rate: 1.671E-04 | global batch size: 512 | lm loss: 2.149329E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.249 | TFLOPs: 56.97 | 7: iteration 12730/ 44073 | consumed samples: 6517760 | consumed tokens: 13348372480 | elapsed time per iteration (s): 4.14 | learning rate: 1.670E-04 | global batch size: 512 | lm loss: 2.151576E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.634 | TFLOPs: 57.62 | 7: iteration 12740/ 44073 | consumed samples: 6522880 | consumed tokens: 13358858240 | elapsed time per iteration (s): 4.16 | learning rate: 1.670E-04 | global batch size: 512 | lm loss: 2.149953E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.951 | TFLOPs: 57.30 | 7: iteration 12750/ 44073 | consumed samples: 6528000 | consumed tokens: 13369344000 | elapsed time per iteration (s): 4.16 | learning rate: 1.669E-04 | global batch size: 512 | lm loss: 2.129996E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.948 | TFLOPs: 57.30 | 7: iteration 12760/ 44073 | consumed samples: 6533120 | consumed tokens: 13379829760 | elapsed time per iteration (s): 4.17 | learning rate: 1.669E-04 | global batch size: 512 | lm loss: 2.120343E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.807 | TFLOPs: 57.23 | 7: iteration 12770/ 44073 | consumed samples: 6538240 | consumed tokens: 13390315520 | elapsed time per iteration (s): 4.18 | learning rate: 1.668E-04 | global batch size: 512 | lm loss: 2.134990E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.624 | TFLOPs: 57.15 | 7: iteration 12780/ 44073 | consumed samples: 6543360 | consumed tokens: 13400801280 | elapsed time per iteration (s): 4.20 | learning rate: 1.668E-04 | global batch size: 512 | lm loss: 2.140843E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.782 | TFLOPs: 56.76 | 7: iteration 12790/ 44073 | consumed samples: 6548480 | consumed tokens: 13411287040 | elapsed time per iteration (s): 4.17 | learning rate: 1.667E-04 | global batch size: 512 | lm loss: 2.122624E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.652 | TFLOPs: 57.16 | 7: iteration 12800/ 44073 | consumed samples: 6553600 | consumed tokens: 13421772800 | elapsed time per iteration (s): 4.16 | learning rate: 1.667E-04 | global batch size: 512 | lm loss: 2.122890E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.999 | TFLOPs: 57.32 | 7: iteration 12810/ 44073 | consumed samples: 6558720 | consumed tokens: 13432258560 | elapsed time per iteration (s): 4.16 | learning rate: 1.666E-04 | global batch size: 512 | lm loss: 2.150447E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.976 | TFLOPs: 57.31 | 7: iteration 12820/ 44073 | consumed samples: 6563840 | consumed tokens: 13442744320 | elapsed time per iteration (s): 4.18 | learning rate: 1.666E-04 | global batch size: 512 | lm loss: 2.147877E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.438 | TFLOPs: 57.06 | 7: iteration 12830/ 44073 | consumed samples: 6568960 | consumed tokens: 13453230080 | elapsed time per iteration (s): 4.18 | learning rate: 1.665E-04 | global batch size: 512 | lm loss: 2.153927E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.593 | TFLOPs: 57.13 | 7: iteration 12840/ 44073 | consumed samples: 6574080 | consumed tokens: 13463715840 | elapsed time per iteration (s): 4.19 | learning rate: 1.665E-04 | global batch size: 512 | lm loss: 2.136248E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.267 | TFLOPs: 56.98 | 7: iteration 12850/ 44073 | consumed samples: 6579200 | consumed tokens: 13474201600 | elapsed time per iteration (s): 4.16 | learning rate: 1.664E-04 | global batch size: 512 | lm loss: 2.145861E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.142 | TFLOPs: 57.39 | 7: iteration 12860/ 44073 | consumed samples: 6584320 | consumed tokens: 13484687360 | elapsed time per iteration (s): 4.16 | learning rate: 1.664E-04 | global batch size: 512 | lm loss: 2.123523E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.960 | TFLOPs: 57.31 | 7: iteration 12870/ 44073 | consumed samples: 6589440 | consumed tokens: 13495173120 | elapsed time per iteration (s): 4.16 | learning rate: 1.663E-04 | global batch size: 512 | lm loss: 2.134019E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.153 | TFLOPs: 57.40 | 7: iteration 12880/ 44073 | consumed samples: 6594560 | consumed tokens: 13505658880 | elapsed time per iteration (s): 4.13 | learning rate: 1.663E-04 | global batch size: 512 | lm loss: 2.151421E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.828 | TFLOPs: 57.71 | 7: iteration 12890/ 44073 | consumed samples: 6599680 | consumed tokens: 13516144640 | elapsed time per iteration (s): 4.17 | learning rate: 1.662E-04 | global batch size: 512 | lm loss: 2.130528E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.837 | TFLOPs: 57.25 | 7: iteration 12900/ 44073 | consumed samples: 6604800 | consumed tokens: 13526630400 | elapsed time per iteration (s): 4.16 | learning rate: 1.662E-04 | global batch size: 512 | lm loss: 2.143211E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.955 | TFLOPs: 57.30 | 7: iteration 12910/ 44073 | consumed samples: 6609920 | consumed tokens: 13537116160 | elapsed time per iteration (s): 4.21 | learning rate: 1.661E-04 | global batch size: 512 | lm loss: 2.152041E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.702 | TFLOPs: 56.72 | 7: iteration 12920/ 44073 | consumed samples: 6615040 | consumed tokens: 13547601920 | elapsed time per iteration (s): 4.21 | learning rate: 1.660E-04 | global batch size: 512 | lm loss: 2.134017E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.632 | TFLOPs: 56.69 | 7: iteration 12930/ 44073 | consumed samples: 6620160 | consumed tokens: 13558087680 | elapsed time per iteration (s): 4.15 | learning rate: 1.660E-04 | global batch size: 512 | lm loss: 2.148827E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.226 | TFLOPs: 57.43 | 7: iteration 12940/ 44073 | consumed samples: 6625280 | consumed tokens: 13568573440 | elapsed time per iteration (s): 4.35 | learning rate: 1.659E-04 | global batch size: 512 | lm loss: 2.121634E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.582 | TFLOPs: 54.80 | 7: iteration 12950/ 44073 | consumed samples: 6630400 | consumed tokens: 13579059200 | elapsed time per iteration (s): 4.23 | learning rate: 1.659E-04 | global batch size: 512 | lm loss: 2.129581E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.103 | TFLOPs: 56.44 | 7: iteration 12960/ 44073 | consumed samples: 6635520 | consumed tokens: 13589544960 | elapsed time per iteration (s): 4.17 | learning rate: 1.658E-04 | global batch size: 512 | lm loss: 2.131876E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.891 | TFLOPs: 57.27 | 7: iteration 12970/ 44073 | consumed samples: 6640640 | consumed tokens: 13600030720 | elapsed time per iteration (s): 4.25 | learning rate: 1.658E-04 | global batch size: 512 | lm loss: 2.130325E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.534 | TFLOPs: 56.18 | 7: iteration 12980/ 44073 | consumed samples: 6645760 | consumed tokens: 13610516480 | elapsed time per iteration (s): 4.14 | learning rate: 1.657E-04 | global batch size: 512 | lm loss: 2.132741E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.771 | TFLOPs: 57.68 | 7: iteration 12990/ 44073 | consumed samples: 6650880 | consumed tokens: 13621002240 | elapsed time per iteration (s): 4.14 | learning rate: 1.657E-04 | global batch size: 512 | lm loss: 2.127444E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.723 | TFLOPs: 57.66 | 7: iteration 13000/ 44073 | consumed samples: 6656000 | consumed tokens: 13631488000 | elapsed time per iteration (s): 4.15 | learning rate: 1.656E-04 | global batch size: 512 | lm loss: 2.147568E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.492 | TFLOPs: 57.55 | 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 13000 | lm loss value: 2.070980E+00 | lm loss PPL: 7.932594E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 13000 to checkpoints_2b2 0: [2022-11-26 01:37:05,979] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step13000 is begin to save! 0: [2022-11-26 01:37:05,985] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_01-model_00-model_states.pt... 0: [2022-11-26 01:37:06,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_01-model_00-model_states.pt. 0: [2022-11-26 01:37:06,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_03-model_00-model_states.pt... 0: [2022-11-26 01:37:06,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_03-model_00-model_states.pt. 0: [2022-11-26 01:37:06,515] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_04-model_00-model_states.pt... 0: [2022-11-26 01:37:06,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_04-model_00-model_states.pt. 0: [2022-11-26 01:37:06,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_05-model_00-model_states.pt... 0: [2022-11-26 01:37:06,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_05-model_00-model_states.pt. 0: [2022-11-26 01:37:06,806] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_06-model_00-model_states.pt... 0: [2022-11-26 01:37:06,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_06-model_00-model_states.pt. 0: [2022-11-26 01:37:06,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_07-model_00-model_states.pt... 0: [2022-11-26 01:37:07,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_07-model_00-model_states.pt. 0: [2022-11-26 01:37:07,073] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_08-model_00-model_states.pt... 0: [2022-11-26 01:37:07,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_08-model_00-model_states.pt. 0: [2022-11-26 01:37:07,200] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_09-model_00-model_states.pt... 0: [2022-11-26 01:37:07,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_09-model_00-model_states.pt. 0: [2022-11-26 01:37:07,325] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_10-model_00-model_states.pt... 0: [2022-11-26 01:37:07,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_10-model_00-model_states.pt. 0: [2022-11-26 01:37:07,449] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_11-model_00-model_states.pt... 0: [2022-11-26 01:37:07,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_11-model_00-model_states.pt. 0: [2022-11-26 01:37:07,574] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_12-model_00-model_states.pt... 0: [2022-11-26 01:37:07,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_12-model_00-model_states.pt. 0: [2022-11-26 01:37:07,698] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_13-model_00-model_states.pt... 0: [2022-11-26 01:37:07,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_13-model_00-model_states.pt. 0: [2022-11-26 01:37:07,820] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_14-model_00-model_states.pt... 0: [2022-11-26 01:37:07,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_14-model_00-model_states.pt. 0: [2022-11-26 01:37:07,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_15-model_00-model_states.pt... 0: [2022-11-26 01:37:08,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_15-model_00-model_states.pt. 0: [2022-11-26 01:37:08,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_16-model_00-model_states.pt... 0: [2022-11-26 01:37:08,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_16-model_00-model_states.pt. 0: [2022-11-26 01:37:08,192] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_17-model_00-model_states.pt... 0: [2022-11-26 01:37:08,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_17-model_00-model_states.pt. 0: [2022-11-26 01:37:08,315] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_18-model_00-model_states.pt... 0: [2022-11-26 01:37:08,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_18-model_00-model_states.pt. 0: [2022-11-26 01:37:08,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_19-model_00-model_states.pt... 0: [2022-11-26 01:37:08,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_19-model_00-model_states.pt. 0: [2022-11-26 01:37:08,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_20-model_00-model_states.pt... 0: [2022-11-26 01:37:08,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_20-model_00-model_states.pt. 0: [2022-11-26 01:37:08,687] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_21-model_00-model_states.pt... 0: [2022-11-26 01:37:08,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_21-model_00-model_states.pt. 0: [2022-11-26 01:37:08,811] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_22-model_00-model_states.pt... 0: [2022-11-26 01:37:08,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_22-model_00-model_states.pt. 0: [2022-11-26 01:37:08,935] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_23-model_00-model_states.pt... 0: [2022-11-26 01:37:09,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_23-model_00-model_states.pt. 0: [2022-11-26 01:37:09,057] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_24-model_00-model_states.pt... 0: [2022-11-26 01:37:09,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_24-model_00-model_states.pt. 0: [2022-11-26 01:37:09,180] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_25-model_00-model_states.pt... 0: [2022-11-26 01:37:09,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_25-model_00-model_states.pt. 0: [2022-11-26 01:37:09,302] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_26-model_00-model_states.pt... 0: [2022-11-26 01:37:09,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_26-model_00-model_states.pt. 0: [2022-11-26 01:37:09,424] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_27-model_00-model_states.pt... 0: [2022-11-26 01:37:09,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_27-model_00-model_states.pt. 0: [2022-11-26 01:37:09,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_28-model_00-model_states.pt... 0: [2022-11-26 01:37:09,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_28-model_00-model_states.pt. 0: [2022-11-26 01:37:09,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_29-model_00-model_states.pt... 0: [2022-11-26 01:37:09,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_29-model_00-model_states.pt. 0: [2022-11-26 01:37:09,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_30-model_00-model_states.pt... 0: [2022-11-26 01:37:09,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_30-model_00-model_states.pt. 0: [2022-11-26 01:37:09,914] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_31-model_00-model_states.pt... 0: [2022-11-26 01:37:10,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_31-model_00-model_states.pt. 0: [2022-11-26 01:37:10,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_32-model_00-model_states.pt... 0: [2022-11-26 01:37:10,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_32-model_00-model_states.pt. 0: [2022-11-26 01:37:10,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_33-model_00-model_states.pt... 0: [2022-11-26 01:37:10,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_33-model_00-model_states.pt. 0: [2022-11-26 01:37:10,287] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_34-model_00-model_states.pt... 0: [2022-11-26 01:37:10,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_34-model_00-model_states.pt. 0: [2022-11-26 01:37:10,410] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/layer_36-model_00-model_states.pt... 0: [2022-11-26 01:37:10,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/layer_36-model_00-model_states.pt. 0: [2022-11-26 01:37:10,416] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step13000/mp_rank_00_model_states.pt 0: [2022-11-26 01:37:10,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/mp_rank_00_model_states.pt... 0: [2022-11-26 01:37:10,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/mp_rank_00_model_states.pt. 0: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:37:10,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step13000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:37:10,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:37:10,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:37:10,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 01:37:10,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: [2022-11-26 01:37:10,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:37:10,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 01:37:10,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: [2022-11-26 01:37:11,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:37:11,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 01:37:11,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: [2022-11-26 01:37:11,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:37:11,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 01:37:11,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: [2022-11-26 01:37:11,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:37:11,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 01:37:11,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 5: [2022-11-26 01:37:11,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:37:11,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:37:11,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 01:37:11,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 01:37:11,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 5: [2022-11-26 01:37:11,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 5: [2022-11-26 01:37:11,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:37:11,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 01:37:11,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: [2022-11-26 01:37:11,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:37:11,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 01:37:11,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: [2022-11-26 01:37:11,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:37:11,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 01:37:11,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-26 01:37:11,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:37:11,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 01:37:11,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:37:11,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-26 01:37:11,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 01:37:11,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 5: [2022-11-26 01:37:11,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:37:11,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 01:37:11,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:37:11,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 5: [2022-11-26 01:37:11,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 01:37:11,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 5: [2022-11-26 01:37:11,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:37:11,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:37:11,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:37:11,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 01:37:11,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 01:37:11,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 01:37:11,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 5: [2022-11-26 01:37:11,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 5: [2022-11-26 01:37:11,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-26 01:37:11,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:37:11,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:37:11,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 01:37:11,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 01:37:11,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-26 01:37:11,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-26 01:37:11,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:37:11,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 01:37:11,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:37:11,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-26 01:37:11,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 01:37:11,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-26 01:37:11,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:37:11,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 01:37:11,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-26 01:37:11,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:37:11,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 01:37:11,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 2: [2022-11-26 01:37:11,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:37:11,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:37:11,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:37:11,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:37:11,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:37:11,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:37:11,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:37:11,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 01:37:11,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 01:37:11,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 01:37:11,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 01:37:11,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 01:37:11,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 01:37:11,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 01:37:11,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 2: [2022-11-26 01:37:11,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 2: [2022-11-26 01:37:11,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 2: [2022-11-26 01:37:11,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 2: [2022-11-26 01:37:11,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 2: [2022-11-26 01:37:11,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 2: [2022-11-26 01:37:11,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 2: [2022-11-26 01:37:11,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:37:11,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 01:37:11,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: [2022-11-26 01:37:11,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 01:37:11,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 4: [2022-11-26 01:37:11,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:37:11,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:37:11,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:37:11,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:37:11,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:37:11,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:37:11,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:37:11,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:37:11,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 01:37:11,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 01:37:11,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 01:37:11,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 01:37:11,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 01:37:11,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 01:37:11,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 01:37:11,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 01:37:11,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 4: [2022-11-26 01:37:11,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 4: [2022-11-26 01:37:11,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 4: [2022-11-26 01:37:11,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 4: [2022-11-26 01:37:11,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 4: [2022-11-26 01:37:11,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 4: [2022-11-26 01:37:11,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 4: [2022-11-26 01:37:11,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 7: [2022-11-26 01:37:11,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:37:11,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:37:11,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:37:11,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:37:11,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:37:11,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 01:37:11,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:37:11,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 01:37:11,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:37:11,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:37:11,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 01:37:11,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 01:37:11,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 7: [2022-11-26 01:37:11,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 7: [2022-11-26 01:37:11,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 01:37:11,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 01:37:11,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 01:37:11,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 01:37:11,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 7: [2022-11-26 01:37:11,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 7: [2022-11-26 01:37:11,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 7: [2022-11-26 01:37:11,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 7: [2022-11-26 01:37:11,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 7: [2022-11-26 01:37:11,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 6: [2022-11-26 01:37:11,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:37:11,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:37:11,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:37:11,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 01:37:11,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:37:11,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:37:11,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 01:37:11,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 6: [2022-11-26 01:37:11,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 01:37:11,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 6: [2022-11-26 01:37:11,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 01:37:11,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 01:37:11,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 6: [2022-11-26 01:37:11,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 6: [2022-11-26 01:37:11,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 6: [2022-11-26 01:37:11,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:37:11,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 01:37:11,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 6: [2022-11-26 01:37:11,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:37:11,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 01:37:11,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 6: [2022-11-26 01:37:11,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:37:11,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 01:37:11,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 3: [2022-11-26 01:37:11,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:37:11,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:37:11,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:37:11,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:37:11,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 01:37:11,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 01:37:11,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 01:37:11,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 3: [2022-11-26 01:37:11,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 01:37:11,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 3: [2022-11-26 01:37:11,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 3: [2022-11-26 01:37:11,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 3: [2022-11-26 01:37:11,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:37:11,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 01:37:11,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 3: [2022-11-26 01:37:11,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:37:11,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:37:11,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:37:11,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 01:37:11,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 01:37:11,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step13000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 01:37:11,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 3: [2022-11-26 01:37:11,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 3: [2022-11-26 01:37:11,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: successfully saved checkpoint at iteration 13000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 5868.29 7: iteration 13010/ 44073 | consumed samples: 6661120 | consumed tokens: 13641973760 | elapsed time per iteration (s): 4.88 | learning rate: 1.656E-04 | global batch size: 512 | lm loss: 2.124317E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 105.025 | TFLOPs: 48.95 | 7: iteration 13020/ 44073 | consumed samples: 6666240 | consumed tokens: 13652459520 | elapsed time per iteration (s): 4.18 | learning rate: 1.655E-04 | global batch size: 512 | lm loss: 2.136787E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.484 | TFLOPs: 57.08 | 7: iteration 13030/ 44073 | consumed samples: 6671360 | consumed tokens: 13662945280 | elapsed time per iteration (s): 4.17 | learning rate: 1.655E-04 | global batch size: 512 | lm loss: 2.144116E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.777 | TFLOPs: 57.22 | 7: iteration 13040/ 44073 | consumed samples: 6676480 | consumed tokens: 13673431040 | elapsed time per iteration (s): 4.19 | learning rate: 1.654E-04 | global batch size: 512 | lm loss: 2.127124E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.305 | TFLOPs: 57.00 | 7: iteration 13050/ 44073 | consumed samples: 6681600 | consumed tokens: 13683916800 | elapsed time per iteration (s): 4.15 | learning rate: 1.654E-04 | global batch size: 512 | lm loss: 2.123075E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.505 | TFLOPs: 57.56 | 7: iteration 13060/ 44073 | consumed samples: 6686720 | consumed tokens: 13694402560 | elapsed time per iteration (s): 4.16 | learning rate: 1.653E-04 | global batch size: 512 | lm loss: 2.129682E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.130 | TFLOPs: 57.38 | 7: iteration 13070/ 44073 | consumed samples: 6691840 | consumed tokens: 13704888320 | elapsed time per iteration (s): 4.15 | learning rate: 1.653E-04 | global batch size: 512 | lm loss: 2.120707E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.456 | TFLOPs: 57.54 | 7: iteration 13080/ 44073 | consumed samples: 6696960 | consumed tokens: 13715374080 | elapsed time per iteration (s): 4.16 | learning rate: 1.652E-04 | global batch size: 512 | lm loss: 2.101934E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.000 | TFLOPs: 57.32 | 7: iteration 13090/ 44073 | consumed samples: 6702080 | consumed tokens: 13725859840 | elapsed time per iteration (s): 4.27 | learning rate: 1.652E-04 | global batch size: 512 | lm loss: 2.122003E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.996 | TFLOPs: 55.92 | 7: iteration 13100/ 44073 | consumed samples: 6707200 | consumed tokens: 13736345600 | elapsed time per iteration (s): 4.16 | learning rate: 1.651E-04 | global batch size: 512 | lm loss: 2.128390E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.151 | TFLOPs: 57.39 | 7: iteration 13110/ 44073 | consumed samples: 6712320 | consumed tokens: 13746831360 | elapsed time per iteration (s): 4.18 | learning rate: 1.651E-04 | global batch size: 512 | lm loss: 2.141655E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.509 | TFLOPs: 57.10 | 7: iteration 13120/ 44073 | consumed samples: 6717440 | consumed tokens: 13757317120 | elapsed time per iteration (s): 4.20 | learning rate: 1.650E-04 | global batch size: 512 | lm loss: 2.112917E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.001 | TFLOPs: 56.86 | 7: iteration 13130/ 44073 | consumed samples: 6722560 | consumed tokens: 13767802880 | elapsed time per iteration (s): 4.25 | learning rate: 1.650E-04 | global batch size: 512 | lm loss: 2.112698E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.603 | TFLOPs: 56.21 | 7: iteration 13140/ 44073 | consumed samples: 6727680 | consumed tokens: 13778288640 | elapsed time per iteration (s): 4.18 | learning rate: 1.649E-04 | global batch size: 512 | lm loss: 2.120119E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.383 | TFLOPs: 57.04 | 7: iteration 13150/ 44073 | consumed samples: 6732800 | consumed tokens: 13788774400 | elapsed time per iteration (s): 4.15 | learning rate: 1.649E-04 | global batch size: 512 | lm loss: 2.134798E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.245 | TFLOPs: 57.44 | 7: iteration 13160/ 44073 | consumed samples: 6737920 | consumed tokens: 13799260160 | elapsed time per iteration (s): 4.19 | learning rate: 1.648E-04 | global batch size: 512 | lm loss: 2.133030E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.073 | TFLOPs: 56.89 | 7: iteration 13170/ 44073 | consumed samples: 6743040 | consumed tokens: 13809745920 | elapsed time per iteration (s): 4.17 | learning rate: 1.648E-04 | global batch size: 512 | lm loss: 2.149528E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.801 | TFLOPs: 57.23 | 7: iteration 13180/ 44073 | consumed samples: 6748160 | consumed tokens: 13820231680 | elapsed time per iteration (s): 4.19 | learning rate: 1.647E-04 | global batch size: 512 | lm loss: 2.107996E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.132 | TFLOPs: 56.92 | 7: iteration 13190/ 44073 | consumed samples: 6753280 | consumed tokens: 13830717440 | elapsed time per iteration (s): 4.17 | learning rate: 1.647E-04 | global batch size: 512 | lm loss: 2.135666E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.825 | TFLOPs: 57.24 | 7: iteration 13200/ 44073 | consumed samples: 6758400 | consumed tokens: 13841203200 | elapsed time per iteration (s): 4.32 | learning rate: 1.646E-04 | global batch size: 512 | lm loss: 2.130817E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.650 | TFLOPs: 55.30 | 7: iteration 13210/ 44073 | consumed samples: 6763520 | consumed tokens: 13851688960 | elapsed time per iteration (s): 4.24 | learning rate: 1.646E-04 | global batch size: 512 | lm loss: 2.129906E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.722 | TFLOPs: 56.26 | 7: iteration 13220/ 44073 | consumed samples: 6768640 | consumed tokens: 13862174720 | elapsed time per iteration (s): 4.16 | learning rate: 1.645E-04 | global batch size: 512 | lm loss: 2.128514E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.099 | TFLOPs: 57.37 | 7: iteration 13230/ 44073 | consumed samples: 6773760 | consumed tokens: 13872660480 | elapsed time per iteration (s): 4.17 | learning rate: 1.645E-04 | global batch size: 512 | lm loss: 2.119969E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.764 | TFLOPs: 57.21 | 7: iteration 13240/ 44073 | consumed samples: 6778880 | consumed tokens: 13883146240 | elapsed time per iteration (s): 4.19 | learning rate: 1.644E-04 | global batch size: 512 | lm loss: 2.110741E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.311 | TFLOPs: 57.00 | 7: iteration 13250/ 44073 | consumed samples: 6784000 | consumed tokens: 13893632000 | elapsed time per iteration (s): 4.15 | learning rate: 1.644E-04 | global batch size: 512 | lm loss: 2.115924E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.471 | TFLOPs: 57.54 | 7: iteration 13260/ 44073 | consumed samples: 6789120 | consumed tokens: 13904117760 | elapsed time per iteration (s): 4.14 | learning rate: 1.643E-04 | global batch size: 512 | lm loss: 2.128089E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.588 | TFLOPs: 57.60 | 7: iteration 13270/ 44073 | consumed samples: 6794240 | consumed tokens: 13914603520 | elapsed time per iteration (s): 4.16 | learning rate: 1.643E-04 | global batch size: 512 | lm loss: 2.116674E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.026 | TFLOPs: 57.34 | 7: iteration 13280/ 44073 | consumed samples: 6799360 | consumed tokens: 13925089280 | elapsed time per iteration (s): 4.15 | learning rate: 1.642E-04 | global batch size: 512 | lm loss: 2.128610E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.468 | TFLOPs: 57.54 | 7: iteration 13290/ 44073 | consumed samples: 6804480 | consumed tokens: 13935575040 | elapsed time per iteration (s): 4.20 | learning rate: 1.642E-04 | global batch size: 512 | lm loss: 2.108096E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.837 | TFLOPs: 56.78 | 7: iteration 13300/ 44073 | consumed samples: 6809600 | consumed tokens: 13946060800 | elapsed time per iteration (s): 4.19 | learning rate: 1.641E-04 | global batch size: 512 | lm loss: 2.113756E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.102 | TFLOPs: 56.91 | 7: iteration 13310/ 44073 | consumed samples: 6814720 | consumed tokens: 13956546560 | elapsed time per iteration (s): 4.16 | learning rate: 1.641E-04 | global batch size: 512 | lm loss: 2.122066E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.134 | TFLOPs: 57.39 | 7: iteration 13320/ 44073 | consumed samples: 6819840 | consumed tokens: 13967032320 | elapsed time per iteration (s): 4.20 | learning rate: 1.640E-04 | global batch size: 512 | lm loss: 2.126364E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.000 | TFLOPs: 56.86 | 7: iteration 13330/ 44073 | consumed samples: 6824960 | consumed tokens: 13977518080 | elapsed time per iteration (s): 4.19 | learning rate: 1.639E-04 | global batch size: 512 | lm loss: 2.104856E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.076 | TFLOPs: 56.89 | 7: iteration 13340/ 44073 | consumed samples: 6830080 | consumed tokens: 13988003840 | elapsed time per iteration (s): 4.16 | learning rate: 1.639E-04 | global batch size: 512 | lm loss: 2.144253E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.043 | TFLOPs: 57.34 | 7: iteration 13350/ 44073 | consumed samples: 6835200 | consumed tokens: 13998489600 | elapsed time per iteration (s): 15.35 | learning rate: 1.638E-04 | global batch size: 512 | lm loss: 2.114163E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 33.362 | TFLOPs: 15.55 | 7: iteration 13360/ 44073 | consumed samples: 6840320 | consumed tokens: 14008975360 | elapsed time per iteration (s): 4.16 | learning rate: 1.638E-04 | global batch size: 512 | lm loss: 2.116475E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.057 | TFLOPs: 57.35 | 7: iteration 13370/ 44073 | consumed samples: 6845440 | consumed tokens: 14019461120 | elapsed time per iteration (s): 4.15 | learning rate: 1.637E-04 | global batch size: 512 | lm loss: 2.113115E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.309 | TFLOPs: 57.47 | 7: iteration 13380/ 44073 | consumed samples: 6850560 | consumed tokens: 14029946880 | elapsed time per iteration (s): 4.17 | learning rate: 1.637E-04 | global batch size: 512 | lm loss: 2.120307E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.749 | TFLOPs: 57.21 | 7: iteration 13390/ 44073 | consumed samples: 6855680 | consumed tokens: 14040432640 | elapsed time per iteration (s): 4.20 | learning rate: 1.636E-04 | global batch size: 512 | lm loss: 2.133282E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.774 | TFLOPs: 56.75 | 7: iteration 13400/ 44073 | consumed samples: 6860800 | consumed tokens: 14050918400 | elapsed time per iteration (s): 4.16 | learning rate: 1.636E-04 | global batch size: 512 | lm loss: 2.108926E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.996 | TFLOPs: 57.32 | 7: iteration 13410/ 44073 | consumed samples: 6865920 | consumed tokens: 14061404160 | elapsed time per iteration (s): 4.16 | learning rate: 1.635E-04 | global batch size: 512 | lm loss: 2.103278E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.032 | TFLOPs: 57.34 | 7: iteration 13420/ 44073 | consumed samples: 6871040 | consumed tokens: 14071889920 | elapsed time per iteration (s): 4.19 | learning rate: 1.635E-04 | global batch size: 512 | lm loss: 2.095359E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.203 | TFLOPs: 56.95 | 7: iteration 13430/ 44073 | consumed samples: 6876160 | consumed tokens: 14082375680 | elapsed time per iteration (s): 4.17 | learning rate: 1.634E-04 | global batch size: 512 | lm loss: 2.115252E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.850 | TFLOPs: 57.25 | 7: iteration 13440/ 44073 | consumed samples: 6881280 | consumed tokens: 14092861440 | elapsed time per iteration (s): 4.24 | learning rate: 1.634E-04 | global batch size: 512 | lm loss: 2.114421E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.681 | TFLOPs: 56.24 | 7: iteration 13450/ 44073 | consumed samples: 6886400 | consumed tokens: 14103347200 | elapsed time per iteration (s): 4.16 | learning rate: 1.633E-04 | global batch size: 512 | lm loss: 2.118052E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.930 | TFLOPs: 57.29 | 7: iteration 13460/ 44073 | consumed samples: 6891520 | consumed tokens: 14113832960 | elapsed time per iteration (s): 4.14 | learning rate: 1.633E-04 | global batch size: 512 | lm loss: 2.108161E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.721 | TFLOPs: 57.66 | 7: iteration 13470/ 44073 | consumed samples: 6896640 | consumed tokens: 14124318720 | elapsed time per iteration (s): 4.16 | learning rate: 1.632E-04 | global batch size: 512 | lm loss: 2.113767E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.056 | TFLOPs: 57.35 | 7: iteration 13480/ 44073 | consumed samples: 6901760 | consumed tokens: 14134804480 | elapsed time per iteration (s): 4.17 | learning rate: 1.632E-04 | global batch size: 512 | lm loss: 2.112270E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.682 | TFLOPs: 57.18 | 7: iteration 13490/ 44073 | consumed samples: 6906880 | consumed tokens: 14145290240 | elapsed time per iteration (s): 4.16 | learning rate: 1.631E-04 | global batch size: 512 | lm loss: 2.109018E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.045 | TFLOPs: 57.35 | 7: iteration 13500/ 44073 | consumed samples: 6912000 | consumed tokens: 14155776000 | elapsed time per iteration (s): 4.29 | learning rate: 1.631E-04 | global batch size: 512 | lm loss: 2.126879E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.424 | TFLOPs: 55.66 | 7: iteration 13510/ 44073 | consumed samples: 6917120 | consumed tokens: 14166261760 | elapsed time per iteration (s): 4.17 | learning rate: 1.630E-04 | global batch size: 512 | lm loss: 2.107373E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.760 | TFLOPs: 57.21 | 7: iteration 13520/ 44073 | consumed samples: 6922240 | consumed tokens: 14176747520 | elapsed time per iteration (s): 4.20 | learning rate: 1.630E-04 | global batch size: 512 | lm loss: 2.099628E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.896 | TFLOPs: 56.81 | 7: iteration 13530/ 44073 | consumed samples: 6927360 | consumed tokens: 14187233280 | elapsed time per iteration (s): 4.21 | learning rate: 1.629E-04 | global batch size: 512 | lm loss: 2.138696E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.694 | TFLOPs: 56.72 | 7: iteration 13540/ 44073 | consumed samples: 6932480 | consumed tokens: 14197719040 | elapsed time per iteration (s): 4.20 | learning rate: 1.629E-04 | global batch size: 512 | lm loss: 2.107356E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.811 | TFLOPs: 56.77 | 7: iteration 13550/ 44073 | consumed samples: 6937600 | consumed tokens: 14208204800 | elapsed time per iteration (s): 4.17 | learning rate: 1.628E-04 | global batch size: 512 | lm loss: 2.106524E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.811 | TFLOPs: 57.24 | 7: iteration 13560/ 44073 | consumed samples: 6942720 | consumed tokens: 14218690560 | elapsed time per iteration (s): 4.19 | learning rate: 1.627E-04 | global batch size: 512 | lm loss: 2.117987E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.292 | TFLOPs: 56.99 | 7: iteration 13570/ 44073 | consumed samples: 6947840 | consumed tokens: 14229176320 | elapsed time per iteration (s): 4.15 | learning rate: 1.627E-04 | global batch size: 512 | lm loss: 2.101119E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.298 | TFLOPs: 57.46 | 7: iteration 13580/ 44073 | consumed samples: 6952960 | consumed tokens: 14239662080 | elapsed time per iteration (s): 4.18 | learning rate: 1.626E-04 | global batch size: 512 | lm loss: 2.124372E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.563 | TFLOPs: 57.12 | 7: iteration 13590/ 44073 | consumed samples: 6958080 | consumed tokens: 14250147840 | elapsed time per iteration (s): 4.16 | learning rate: 1.626E-04 | global batch size: 512 | lm loss: 2.108760E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.175 | TFLOPs: 57.41 | 7: iteration 13600/ 44073 | consumed samples: 6963200 | consumed tokens: 14260633600 | elapsed time per iteration (s): 4.18 | learning rate: 1.625E-04 | global batch size: 512 | lm loss: 2.117233E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.594 | TFLOPs: 57.13 | 7: iteration 13610/ 44073 | consumed samples: 6968320 | consumed tokens: 14271119360 | elapsed time per iteration (s): 4.17 | learning rate: 1.625E-04 | global batch size: 512 | lm loss: 2.116377E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.788 | TFLOPs: 57.23 | 7: iteration 13620/ 44073 | consumed samples: 6973440 | consumed tokens: 14281605120 | elapsed time per iteration (s): 4.18 | learning rate: 1.624E-04 | global batch size: 512 | lm loss: 2.104666E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.568 | TFLOPs: 57.12 | 7: iteration 13630/ 44073 | consumed samples: 6978560 | consumed tokens: 14292090880 | elapsed time per iteration (s): 4.20 | learning rate: 1.624E-04 | global batch size: 512 | lm loss: 2.103949E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.994 | TFLOPs: 56.86 | 7: iteration 13640/ 44073 | consumed samples: 6983680 | consumed tokens: 14302576640 | elapsed time per iteration (s): 4.18 | learning rate: 1.623E-04 | global batch size: 512 | lm loss: 2.104624E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.617 | TFLOPs: 57.15 | 7: iteration 13650/ 44073 | consumed samples: 6988800 | consumed tokens: 14313062400 | elapsed time per iteration (s): 4.19 | learning rate: 1.623E-04 | global batch size: 512 | lm loss: 2.118201E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.158 | TFLOPs: 56.93 | 7: iteration 13660/ 44073 | consumed samples: 6993920 | consumed tokens: 14323548160 | elapsed time per iteration (s): 4.18 | learning rate: 1.622E-04 | global batch size: 512 | lm loss: 2.120203E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.492 | TFLOPs: 57.09 | 7: iteration 13670/ 44073 | consumed samples: 6999040 | consumed tokens: 14334033920 | elapsed time per iteration (s): 4.16 | learning rate: 1.622E-04 | global batch size: 512 | lm loss: 2.109672E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.006 | TFLOPs: 57.33 | 7: iteration 13680/ 44073 | consumed samples: 7004160 | consumed tokens: 14344519680 | elapsed time per iteration (s): 4.21 | learning rate: 1.621E-04 | global batch size: 512 | lm loss: 2.108608E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.514 | TFLOPs: 56.63 | 7: iteration 13690/ 44073 | consumed samples: 7009280 | consumed tokens: 14355005440 | elapsed time per iteration (s): 4.17 | learning rate: 1.621E-04 | global batch size: 512 | lm loss: 2.121804E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.797 | TFLOPs: 57.23 | 7: iteration 13700/ 44073 | consumed samples: 7014400 | consumed tokens: 14365491200 | elapsed time per iteration (s): 4.17 | learning rate: 1.620E-04 | global batch size: 512 | lm loss: 2.107156E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.916 | TFLOPs: 57.29 | 7: iteration 13710/ 44073 | consumed samples: 7019520 | consumed tokens: 14375976960 | elapsed time per iteration (s): 4.15 | learning rate: 1.620E-04 | global batch size: 512 | lm loss: 2.116121E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.491 | TFLOPs: 57.55 | 7: iteration 13720/ 44073 | consumed samples: 7024640 | consumed tokens: 14386462720 | elapsed time per iteration (s): 4.20 | learning rate: 1.619E-04 | global batch size: 512 | lm loss: 2.120885E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.003 | TFLOPs: 56.86 | 7: iteration 13730/ 44073 | consumed samples: 7029760 | consumed tokens: 14396948480 | elapsed time per iteration (s): 4.17 | learning rate: 1.618E-04 | global batch size: 512 | lm loss: 2.119652E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.890 | TFLOPs: 57.27 | 7: iteration 13740/ 44073 | consumed samples: 7034880 | consumed tokens: 14407434240 | elapsed time per iteration (s): 4.20 | learning rate: 1.618E-04 | global batch size: 512 | lm loss: 2.109400E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.841 | TFLOPs: 56.78 | 7: iteration 13750/ 44073 | consumed samples: 7040000 | consumed tokens: 14417920000 | elapsed time per iteration (s): 4.15 | learning rate: 1.617E-04 | global batch size: 512 | lm loss: 2.125152E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.409 | TFLOPs: 57.51 | 7: iteration 13760/ 44073 | consumed samples: 7045120 | consumed tokens: 14428405760 | elapsed time per iteration (s): 4.18 | learning rate: 1.617E-04 | global batch size: 512 | lm loss: 2.113758E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.572 | TFLOPs: 57.12 | 7: iteration 13770/ 44073 | consumed samples: 7050240 | consumed tokens: 14438891520 | elapsed time per iteration (s): 4.14 | learning rate: 1.616E-04 | global batch size: 512 | lm loss: 2.108109E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.816 | TFLOPs: 57.70 | 7: iteration 13780/ 44073 | consumed samples: 7055360 | consumed tokens: 14449377280 | elapsed time per iteration (s): 4.17 | learning rate: 1.616E-04 | global batch size: 512 | lm loss: 2.109042E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.763 | TFLOPs: 57.21 | 7: iteration 13790/ 44073 | consumed samples: 7060480 | consumed tokens: 14459863040 | elapsed time per iteration (s): 4.15 | learning rate: 1.615E-04 | global batch size: 512 | lm loss: 2.105781E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.250 | TFLOPs: 57.44 | 7: iteration 13800/ 44073 | consumed samples: 7065600 | consumed tokens: 14470348800 | elapsed time per iteration (s): 4.17 | learning rate: 1.615E-04 | global batch size: 512 | lm loss: 2.122485E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.877 | TFLOPs: 57.27 | 7: iteration 13810/ 44073 | consumed samples: 7070720 | consumed tokens: 14480834560 | elapsed time per iteration (s): 4.17 | learning rate: 1.614E-04 | global batch size: 512 | lm loss: 2.107761E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.808 | TFLOPs: 57.23 | 7: iteration 13820/ 44073 | consumed samples: 7075840 | consumed tokens: 14491320320 | elapsed time per iteration (s): 4.15 | learning rate: 1.614E-04 | global batch size: 512 | lm loss: 2.119734E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.425 | TFLOPs: 57.52 | 7: iteration 13830/ 44073 | consumed samples: 7080960 | consumed tokens: 14501806080 | elapsed time per iteration (s): 4.18 | learning rate: 1.613E-04 | global batch size: 512 | lm loss: 2.109387E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.515 | TFLOPs: 57.10 | 7: iteration 13840/ 44073 | consumed samples: 7086080 | consumed tokens: 14512291840 | elapsed time per iteration (s): 4.35 | learning rate: 1.613E-04 | global batch size: 512 | lm loss: 2.132994E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.651 | TFLOPs: 54.83 | 7: iteration 13850/ 44073 | consumed samples: 7091200 | consumed tokens: 14522777600 | elapsed time per iteration (s): 4.18 | learning rate: 1.612E-04 | global batch size: 512 | lm loss: 2.120719E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.630 | TFLOPs: 57.15 | 7: iteration 13860/ 44073 | consumed samples: 7096320 | consumed tokens: 14533263360 | elapsed time per iteration (s): 4.17 | learning rate: 1.612E-04 | global batch size: 512 | lm loss: 2.137234E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.651 | TFLOPs: 57.16 | 7: iteration 13870/ 44073 | consumed samples: 7101440 | consumed tokens: 14543749120 | elapsed time per iteration (s): 4.17 | learning rate: 1.611E-04 | global batch size: 512 | lm loss: 2.115384E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.660 | TFLOPs: 57.17 | 7: iteration 13880/ 44073 | consumed samples: 7106560 | consumed tokens: 14554234880 | elapsed time per iteration (s): 4.30 | learning rate: 1.611E-04 | global batch size: 512 | lm loss: 2.124673E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.061 | TFLOPs: 55.49 | 7: iteration 13890/ 44073 | consumed samples: 7111680 | consumed tokens: 14564720640 | elapsed time per iteration (s): 4.14 | learning rate: 1.610E-04 | global batch size: 512 | lm loss: 2.117975E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.735 | TFLOPs: 57.67 | 7: iteration 13900/ 44073 | consumed samples: 7116800 | consumed tokens: 14575206400 | elapsed time per iteration (s): 4.16 | learning rate: 1.609E-04 | global batch size: 512 | lm loss: 2.102507E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.005 | TFLOPs: 57.33 | 7: iteration 13910/ 44073 | consumed samples: 7121920 | consumed tokens: 14585692160 | elapsed time per iteration (s): 4.18 | learning rate: 1.609E-04 | global batch size: 512 | lm loss: 2.099035E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.547 | TFLOPs: 57.11 | 7: iteration 13920/ 44073 | consumed samples: 7127040 | consumed tokens: 14596177920 | elapsed time per iteration (s): 4.16 | learning rate: 1.608E-04 | global batch size: 512 | lm loss: 2.123367E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.182 | TFLOPs: 57.41 | 7: iteration 13930/ 44073 | consumed samples: 7132160 | consumed tokens: 14606663680 | elapsed time per iteration (s): 4.15 | learning rate: 1.608E-04 | global batch size: 512 | lm loss: 2.132955E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.453 | TFLOPs: 57.54 | 7: iteration 13940/ 44073 | consumed samples: 7137280 | consumed tokens: 14617149440 | elapsed time per iteration (s): 4.15 | learning rate: 1.607E-04 | global batch size: 512 | lm loss: 2.116242E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.249 | TFLOPs: 57.44 | 7: iteration 13950/ 44073 | consumed samples: 7142400 | consumed tokens: 14627635200 | elapsed time per iteration (s): 4.15 | learning rate: 1.607E-04 | global batch size: 512 | lm loss: 2.079663E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.482 | TFLOPs: 57.55 | 7: iteration 13960/ 44073 | consumed samples: 7147520 | consumed tokens: 14638120960 | elapsed time per iteration (s): 4.15 | learning rate: 1.606E-04 | global batch size: 512 | lm loss: 2.098298E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.484 | TFLOPs: 57.55 | 7: iteration 13970/ 44073 | consumed samples: 7152640 | consumed tokens: 14648606720 | elapsed time per iteration (s): 4.16 | learning rate: 1.606E-04 | global batch size: 512 | lm loss: 2.128051E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.145 | TFLOPs: 57.39 | 7: iteration 13980/ 44073 | consumed samples: 7157760 | consumed tokens: 14659092480 | elapsed time per iteration (s): 4.16 | learning rate: 1.605E-04 | global batch size: 512 | lm loss: 2.108749E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.107 | TFLOPs: 57.37 | 7: iteration 13990/ 44073 | consumed samples: 7162880 | consumed tokens: 14669578240 | elapsed time per iteration (s): 4.18 | learning rate: 1.605E-04 | global batch size: 512 | lm loss: 2.112614E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.624 | TFLOPs: 57.15 | 0: [2022-11-26 02:48:42,377] [INFO] [logging.py:68:log_dist] [Rank 0] step=14000, skipped=0, lr=[0.00016041012128345587, 0.00016041012128345587, 0.00016041012128345587], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 7: iteration 14000/ 44073 | consumed samples: 7168000 | consumed tokens: 14680064000 | elapsed time per iteration (s): 4.16 | learning rate: 1.604E-04 | global batch size: 512 | lm loss: 2.105220E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.005 | TFLOPs: 57.33 | 0: steps: 14000 loss: 2.1724 iter time (s): 4.232 samples/sec: 120.969 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 14000 | lm loss value: 2.101228E+00 | lm loss PPL: 8.176208E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 14000 to checkpoints_2b2 0: [2022-11-26 02:48:43,789] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step14000 is begin to save! 0: [2022-11-26 02:48:43,873] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_01-model_00-model_states.pt... 0: [2022-11-26 02:48:44,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_01-model_00-model_states.pt. 0: [2022-11-26 02:48:44,402] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_03-model_00-model_states.pt... 0: [2022-11-26 02:48:44,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_03-model_00-model_states.pt. 0: [2022-11-26 02:48:44,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_04-model_00-model_states.pt... 0: [2022-11-26 02:48:44,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_04-model_00-model_states.pt. 0: [2022-11-26 02:48:44,702] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_05-model_00-model_states.pt... 0: [2022-11-26 02:48:44,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_05-model_00-model_states.pt. 0: [2022-11-26 02:48:44,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_06-model_00-model_states.pt... 0: [2022-11-26 02:48:44,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_06-model_00-model_states.pt. 0: [2022-11-26 02:48:44,986] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_07-model_00-model_states.pt... 0: [2022-11-26 02:48:45,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_07-model_00-model_states.pt. 0: [2022-11-26 02:48:45,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_08-model_00-model_states.pt... 0: [2022-11-26 02:48:45,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_08-model_00-model_states.pt. 0: [2022-11-26 02:48:45,268] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_09-model_00-model_states.pt... 0: [2022-11-26 02:48:45,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_09-model_00-model_states.pt. 0: [2022-11-26 02:48:45,408] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_10-model_00-model_states.pt... 0: [2022-11-26 02:48:45,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_10-model_00-model_states.pt. 0: [2022-11-26 02:48:45,551] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_11-model_00-model_states.pt... 0: [2022-11-26 02:48:45,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_11-model_00-model_states.pt. 0: [2022-11-26 02:48:45,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_12-model_00-model_states.pt... 0: [2022-11-26 02:48:45,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_12-model_00-model_states.pt. 0: [2022-11-26 02:48:45,832] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_13-model_00-model_states.pt... 0: [2022-11-26 02:48:45,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_13-model_00-model_states.pt. 0: [2022-11-26 02:48:45,969] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_14-model_00-model_states.pt... 0: [2022-11-26 02:48:46,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_14-model_00-model_states.pt. 0: [2022-11-26 02:48:46,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_15-model_00-model_states.pt... 0: [2022-11-26 02:48:46,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_15-model_00-model_states.pt. 0: [2022-11-26 02:48:46,249] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_16-model_00-model_states.pt... 0: [2022-11-26 02:48:46,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_16-model_00-model_states.pt. 0: [2022-11-26 02:48:46,391] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_17-model_00-model_states.pt... 0: [2022-11-26 02:48:46,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_17-model_00-model_states.pt. 0: [2022-11-26 02:48:46,529] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_18-model_00-model_states.pt... 0: [2022-11-26 02:48:46,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_18-model_00-model_states.pt. 0: [2022-11-26 02:48:46,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_19-model_00-model_states.pt... 0: [2022-11-26 02:48:46,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_19-model_00-model_states.pt. 0: [2022-11-26 02:48:46,807] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_20-model_00-model_states.pt... 0: [2022-11-26 02:48:46,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_20-model_00-model_states.pt. 0: [2022-11-26 02:48:46,948] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_21-model_00-model_states.pt... 0: [2022-11-26 02:48:47,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_21-model_00-model_states.pt. 0: [2022-11-26 02:48:47,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_22-model_00-model_states.pt... 0: [2022-11-26 02:48:47,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_22-model_00-model_states.pt. 0: [2022-11-26 02:48:47,225] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_23-model_00-model_states.pt... 0: [2022-11-26 02:48:47,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_23-model_00-model_states.pt. 0: [2022-11-26 02:48:47,368] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_24-model_00-model_states.pt... 0: [2022-11-26 02:48:47,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_24-model_00-model_states.pt. 0: [2022-11-26 02:48:47,503] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_25-model_00-model_states.pt... 0: [2022-11-26 02:48:47,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_25-model_00-model_states.pt. 0: [2022-11-26 02:48:47,644] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_26-model_00-model_states.pt... 0: [2022-11-26 02:48:47,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_26-model_00-model_states.pt. 0: [2022-11-26 02:48:47,781] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_27-model_00-model_states.pt... 0: [2022-11-26 02:48:47,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_27-model_00-model_states.pt. 0: [2022-11-26 02:48:47,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_28-model_00-model_states.pt... 0: [2022-11-26 02:48:48,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_28-model_00-model_states.pt. 0: [2022-11-26 02:48:48,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_29-model_00-model_states.pt... 0: [2022-11-26 02:48:48,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_29-model_00-model_states.pt. 0: [2022-11-26 02:48:48,191] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_30-model_00-model_states.pt... 0: [2022-11-26 02:48:48,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_30-model_00-model_states.pt. 0: [2022-11-26 02:48:48,327] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_31-model_00-model_states.pt... 0: [2022-11-26 02:48:48,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_31-model_00-model_states.pt. 0: [2022-11-26 02:48:48,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_32-model_00-model_states.pt... 0: [2022-11-26 02:48:48,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_32-model_00-model_states.pt. 0: [2022-11-26 02:48:48,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_33-model_00-model_states.pt... 0: [2022-11-26 02:48:48,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_33-model_00-model_states.pt. 0: [2022-11-26 02:48:48,738] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_34-model_00-model_states.pt... 0: [2022-11-26 02:48:48,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_34-model_00-model_states.pt. 0: [2022-11-26 02:48:48,871] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/layer_36-model_00-model_states.pt... 0: [2022-11-26 02:48:48,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/layer_36-model_00-model_states.pt. 0: [2022-11-26 02:48:48,881] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step14000/mp_rank_00_model_states.pt 0: [2022-11-26 02:48:48,881] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/mp_rank_00_model_states.pt... 0: [2022-11-26 02:48:48,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/mp_rank_00_model_states.pt. 0: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:48:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step14000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:48:49,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:48:49,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:48:49,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 02:48:49,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 5: [2022-11-26 02:48:49,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:48:49,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 02:48:49,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 5: [2022-11-26 02:48:49,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:48:49,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 02:48:49,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 5: [2022-11-26 02:48:49,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:48:49,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 02:48:49,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 0: [2022-11-26 02:48:49,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:48:49,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 02:48:49,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 1: [2022-11-26 02:48:49,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:48:49,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 02:48:49,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 0: [2022-11-26 02:48:49,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:48:49,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 02:48:49,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 1: [2022-11-26 02:48:49,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:48:49,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 02:48:49,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 0: [2022-11-26 02:48:49,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:48:49,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 02:48:49,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 5: [2022-11-26 02:48:49,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:48:49,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 02:48:49,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 0: [2022-11-26 02:48:49,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:48:49,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 02:48:49,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 0: [2022-11-26 02:48:49,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:48:49,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 02:48:49,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 0: [2022-11-26 02:48:49,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:48:49,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 02:48:49,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 1: [2022-11-26 02:48:49,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:48:49,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 02:48:49,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 2: [2022-11-26 02:48:49,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:48:49,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:48:49,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 02:48:49,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 02:48:49,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 2: [2022-11-26 02:48:49,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 2: [2022-11-26 02:48:49,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:48:49,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:48:49,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:48:49,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 02:48:49,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 02:48:49,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 02:48:49,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 2: [2022-11-26 02:48:49,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 2: [2022-11-26 02:48:49,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 5: [2022-11-26 02:48:49,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:48:49,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 02:48:49,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 1: [2022-11-26 02:48:49,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:48:49,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 02:48:49,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 1: [2022-11-26 02:48:49,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:48:49,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 02:48:49,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 5: [2022-11-26 02:48:49,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:48:49,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 02:48:49,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 2: [2022-11-26 02:48:49,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:48:49,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 02:48:49,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 5: [2022-11-26 02:48:49,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:48:49,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 02:48:49,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 2: [2022-11-26 02:48:49,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:48:49,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 02:48:49,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 2: [2022-11-26 02:48:49,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:48:49,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 02:48:49,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 1: [2022-11-26 02:48:49,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:48:49,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 02:48:49,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 1: [2022-11-26 02:48:49,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:48:49,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 02:48:49,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 1: [2022-11-26 02:48:49,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:48:49,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 02:48:49,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 5: [2022-11-26 02:48:49,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:48:49,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 02:48:49,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 4: [2022-11-26 02:48:49,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:48:49,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 02:48:49,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:48:49,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 4: [2022-11-26 02:48:49,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 02:48:49,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 4: [2022-11-26 02:48:49,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:48:49,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:48:49,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:48:49,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 02:48:49,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:48:49,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:48:49,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:48:49,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 02:48:49,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 02:48:49,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 4: [2022-11-26 02:48:49,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 02:48:49,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 4: [2022-11-26 02:48:49,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 4: [2022-11-26 02:48:49,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 02:48:49,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 02:48:49,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 4: [2022-11-26 02:48:49,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 4: [2022-11-26 02:48:49,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 3: [2022-11-26 02:48:49,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:48:49,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:48:49,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:48:49,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 02:48:49,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 02:48:49,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 3: [2022-11-26 02:48:49,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 02:48:49,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 3: [2022-11-26 02:48:49,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 6: [2022-11-26 02:48:49,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:48:49,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:48:49,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:48:49,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:48:49,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 02:48:49,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 02:48:49,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 02:48:49,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 02:48:49,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 6: [2022-11-26 02:48:49,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 6: [2022-11-26 02:48:49,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 6: [2022-11-26 02:48:49,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 6: [2022-11-26 02:48:49,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:48:49,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:48:49,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 02:48:49,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:48:49,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:48:49,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 6: [2022-11-26 02:48:49,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 02:48:49,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 02:48:49,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 02:48:49,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 6: [2022-11-26 02:48:49,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 6: [2022-11-26 02:48:49,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 3: [2022-11-26 02:48:49,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:48:49,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 02:48:49,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 0: [2022-11-26 02:48:50,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 02:48:50,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 3: [2022-11-26 02:48:50,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:48:50,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 02:48:50,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 3: [2022-11-26 02:48:50,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:48:50,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 02:48:50,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 3: [2022-11-26 02:48:50,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:48:50,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 02:48:50,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 3: [2022-11-26 02:48:50,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:48:50,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 02:48:50,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 7: [2022-11-26 02:48:50,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:48:50,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:48:50,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:48:50,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 02:48:50,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:48:50,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:48:50,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 02:48:50,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 02:48:50,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 7: [2022-11-26 02:48:50,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 7: [2022-11-26 02:48:50,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 02:48:50,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 02:48:50,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 7: [2022-11-26 02:48:50,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 7: [2022-11-26 02:48:50,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 7: [2022-11-26 02:48:50,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:48:50,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:48:50,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:48:50,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 02:48:50,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 02:48:50,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step14000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 02:48:50,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 7: [2022-11-26 02:48:50,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 7: [2022-11-26 02:48:50,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 0: successfully saved checkpoint at iteration 14000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 6495.29 7: iteration 14010/ 44073 | consumed samples: 7173120 | consumed tokens: 14690549760 | elapsed time per iteration (s): 5.02 | learning rate: 1.604E-04 | global batch size: 512 | lm loss: 2.102416E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 101.899 | TFLOPs: 47.49 | 7: iteration 14020/ 44073 | consumed samples: 7178240 | consumed tokens: 14701035520 | elapsed time per iteration (s): 4.17 | learning rate: 1.603E-04 | global batch size: 512 | lm loss: 2.115271E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.822 | TFLOPs: 57.24 | 7: iteration 14030/ 44073 | consumed samples: 7183360 | consumed tokens: 14711521280 | elapsed time per iteration (s): 4.15 | learning rate: 1.602E-04 | global batch size: 512 | lm loss: 2.091968E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.247 | TFLOPs: 57.44 | 7: iteration 14040/ 44073 | consumed samples: 7188480 | consumed tokens: 14722007040 | elapsed time per iteration (s): 4.23 | learning rate: 1.602E-04 | global batch size: 512 | lm loss: 2.120389E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.155 | TFLOPs: 56.46 | 7: iteration 14050/ 44073 | consumed samples: 7193600 | consumed tokens: 14732492800 | elapsed time per iteration (s): 4.17 | learning rate: 1.601E-04 | global batch size: 512 | lm loss: 2.115360E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.849 | TFLOPs: 57.25 | 7: iteration 14060/ 44073 | consumed samples: 7198720 | consumed tokens: 14742978560 | elapsed time per iteration (s): 4.21 | learning rate: 1.601E-04 | global batch size: 512 | lm loss: 2.107133E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.709 | TFLOPs: 56.72 | 7: iteration 14070/ 44073 | consumed samples: 7203840 | consumed tokens: 14753464320 | elapsed time per iteration (s): 4.18 | learning rate: 1.600E-04 | global batch size: 512 | lm loss: 2.127243E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.535 | TFLOPs: 57.11 | 7: iteration 14080/ 44073 | consumed samples: 7208960 | consumed tokens: 14763950080 | elapsed time per iteration (s): 4.15 | learning rate: 1.600E-04 | global batch size: 512 | lm loss: 2.100721E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.337 | TFLOPs: 57.48 | 7: iteration 14090/ 44073 | consumed samples: 7214080 | consumed tokens: 14774435840 | elapsed time per iteration (s): 4.17 | learning rate: 1.599E-04 | global batch size: 512 | lm loss: 2.094684E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.882 | TFLOPs: 57.27 | 7: iteration 14100/ 44073 | consumed samples: 7219200 | consumed tokens: 14784921600 | elapsed time per iteration (s): 4.23 | learning rate: 1.599E-04 | global batch size: 512 | lm loss: 2.093608E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.061 | TFLOPs: 56.42 | 7: iteration 14110/ 44073 | consumed samples: 7224320 | consumed tokens: 14795407360 | elapsed time per iteration (s): 4.19 | learning rate: 1.598E-04 | global batch size: 512 | lm loss: 2.108561E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.241 | TFLOPs: 56.97 | 7: iteration 14120/ 44073 | consumed samples: 7229440 | consumed tokens: 14805893120 | elapsed time per iteration (s): 4.18 | learning rate: 1.598E-04 | global batch size: 512 | lm loss: 2.096351E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.408 | TFLOPs: 57.05 | 7: iteration 14130/ 44073 | consumed samples: 7234560 | consumed tokens: 14816378880 | elapsed time per iteration (s): 4.19 | learning rate: 1.597E-04 | global batch size: 512 | lm loss: 2.102662E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.103 | TFLOPs: 56.91 | 7: iteration 14140/ 44073 | consumed samples: 7239680 | consumed tokens: 14826864640 | elapsed time per iteration (s): 4.17 | learning rate: 1.597E-04 | global batch size: 512 | lm loss: 2.100929E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.777 | TFLOPs: 57.22 | 7: iteration 14150/ 44073 | consumed samples: 7244800 | consumed tokens: 14837350400 | elapsed time per iteration (s): 4.18 | learning rate: 1.596E-04 | global batch size: 512 | lm loss: 2.109864E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.369 | TFLOPs: 57.03 | 7: iteration 14160/ 44073 | consumed samples: 7249920 | consumed tokens: 14847836160 | elapsed time per iteration (s): 4.22 | learning rate: 1.595E-04 | global batch size: 512 | lm loss: 2.145725E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.412 | TFLOPs: 56.58 | 7: iteration 14170/ 44073 | consumed samples: 7255040 | consumed tokens: 14858321920 | elapsed time per iteration (s): 4.18 | learning rate: 1.595E-04 | global batch size: 512 | lm loss: 2.122214E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.551 | TFLOPs: 57.12 | 7: iteration 14180/ 44073 | consumed samples: 7260160 | consumed tokens: 14868807680 | elapsed time per iteration (s): 4.24 | learning rate: 1.594E-04 | global batch size: 512 | lm loss: 2.106766E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.748 | TFLOPs: 56.27 | 7: iteration 14190/ 44073 | consumed samples: 7265280 | consumed tokens: 14879293440 | elapsed time per iteration (s): 4.19 | learning rate: 1.594E-04 | global batch size: 512 | lm loss: 2.105994E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.304 | TFLOPs: 57.00 | 7: iteration 14200/ 44073 | consumed samples: 7270400 | consumed tokens: 14889779200 | elapsed time per iteration (s): 4.16 | learning rate: 1.593E-04 | global batch size: 512 | lm loss: 2.112916E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.960 | TFLOPs: 57.31 | 7: iteration 14210/ 44073 | consumed samples: 7275520 | consumed tokens: 14900264960 | elapsed time per iteration (s): 4.15 | learning rate: 1.593E-04 | global batch size: 512 | lm loss: 2.118739E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.325 | TFLOPs: 57.48 | 7: iteration 14220/ 44073 | consumed samples: 7280640 | consumed tokens: 14910750720 | elapsed time per iteration (s): 4.15 | learning rate: 1.592E-04 | global batch size: 512 | lm loss: 2.089349E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.228 | TFLOPs: 57.43 | 7: iteration 14230/ 44073 | consumed samples: 7285760 | consumed tokens: 14921236480 | elapsed time per iteration (s): 4.15 | learning rate: 1.592E-04 | global batch size: 512 | lm loss: 2.091933E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.370 | TFLOPs: 57.50 | 7: iteration 14240/ 44073 | consumed samples: 7290880 | consumed tokens: 14931722240 | elapsed time per iteration (s): 4.16 | learning rate: 1.591E-04 | global batch size: 512 | lm loss: 2.091706E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.965 | TFLOPs: 57.31 | 7: iteration 14250/ 44073 | consumed samples: 7296000 | consumed tokens: 14942208000 | elapsed time per iteration (s): 4.18 | learning rate: 1.591E-04 | global batch size: 512 | lm loss: 2.116011E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.382 | TFLOPs: 57.04 | 7: iteration 14260/ 44073 | consumed samples: 7301120 | consumed tokens: 14952693760 | elapsed time per iteration (s): 4.14 | learning rate: 1.590E-04 | global batch size: 512 | lm loss: 2.119822E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.730 | TFLOPs: 57.66 | 7: iteration 14270/ 44073 | consumed samples: 7306240 | consumed tokens: 14963179520 | elapsed time per iteration (s): 4.20 | learning rate: 1.590E-04 | global batch size: 512 | lm loss: 2.102969E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.836 | TFLOPs: 56.78 | 7: iteration 14280/ 44073 | consumed samples: 7311360 | consumed tokens: 14973665280 | elapsed time per iteration (s): 4.17 | learning rate: 1.589E-04 | global batch size: 512 | lm loss: 2.114957E+00 | grad norm: 0.239 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.718 | TFLOPs: 57.19 | 7: iteration 14290/ 44073 | consumed samples: 7316480 | consumed tokens: 14984151040 | elapsed time per iteration (s): 4.16 | learning rate: 1.588E-04 | global batch size: 512 | lm loss: 2.101647E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.216 | TFLOPs: 57.42 | 7: iteration 14300/ 44073 | consumed samples: 7321600 | consumed tokens: 14994636800 | elapsed time per iteration (s): 4.20 | learning rate: 1.588E-04 | global batch size: 512 | lm loss: 2.112065E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.981 | TFLOPs: 56.85 | 7: iteration 14310/ 44073 | consumed samples: 7326720 | consumed tokens: 15005122560 | elapsed time per iteration (s): 4.16 | learning rate: 1.587E-04 | global batch size: 512 | lm loss: 2.105657E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.957 | TFLOPs: 57.30 | 7: iteration 14320/ 44073 | consumed samples: 7331840 | consumed tokens: 15015608320 | elapsed time per iteration (s): 4.18 | learning rate: 1.587E-04 | global batch size: 512 | lm loss: 2.112788E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.502 | TFLOPs: 57.09 | 7: iteration 14330/ 44073 | consumed samples: 7336960 | consumed tokens: 15026094080 | elapsed time per iteration (s): 4.17 | learning rate: 1.586E-04 | global batch size: 512 | lm loss: 2.106869E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.894 | TFLOPs: 57.27 | 7: iteration 14340/ 44073 | consumed samples: 7342080 | consumed tokens: 15036579840 | elapsed time per iteration (s): 4.15 | learning rate: 1.586E-04 | global batch size: 512 | lm loss: 2.109765E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.448 | TFLOPs: 57.53 | 7: iteration 14350/ 44073 | consumed samples: 7347200 | consumed tokens: 15047065600 | elapsed time per iteration (s): 4.16 | learning rate: 1.585E-04 | global batch size: 512 | lm loss: 2.107785E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.067 | TFLOPs: 57.36 | 7: iteration 14360/ 44073 | consumed samples: 7352320 | consumed tokens: 15057551360 | elapsed time per iteration (s): 5.37 | learning rate: 1.585E-04 | global batch size: 512 | lm loss: 2.104749E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 95.395 | TFLOPs: 44.46 | 7: iteration 14370/ 44073 | consumed samples: 7357440 | consumed tokens: 15068037120 | elapsed time per iteration (s): 4.17 | learning rate: 1.584E-04 | global batch size: 512 | lm loss: 2.129100E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.696 | TFLOPs: 57.18 | 7: iteration 14380/ 44073 | consumed samples: 7362560 | consumed tokens: 15078522880 | elapsed time per iteration (s): 4.20 | learning rate: 1.584E-04 | global batch size: 512 | lm loss: 2.119821E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.038 | TFLOPs: 56.88 | 7: iteration 14390/ 44073 | consumed samples: 7367680 | consumed tokens: 15089008640 | elapsed time per iteration (s): 4.16 | learning rate: 1.583E-04 | global batch size: 512 | lm loss: 2.110661E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.141 | TFLOPs: 57.39 | 7: iteration 14400/ 44073 | consumed samples: 7372800 | consumed tokens: 15099494400 | elapsed time per iteration (s): 4.15 | learning rate: 1.582E-04 | global batch size: 512 | lm loss: 2.106550E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.238 | TFLOPs: 57.43 | 7: iteration 14410/ 44073 | consumed samples: 7377920 | consumed tokens: 15109980160 | elapsed time per iteration (s): 4.17 | learning rate: 1.582E-04 | global batch size: 512 | lm loss: 2.090240E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.816 | TFLOPs: 57.24 | 7: iteration 14420/ 44073 | consumed samples: 7383040 | consumed tokens: 15120465920 | elapsed time per iteration (s): 4.15 | learning rate: 1.581E-04 | global batch size: 512 | lm loss: 2.100039E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.264 | TFLOPs: 57.45 | 7: iteration 14430/ 44073 | consumed samples: 7388160 | consumed tokens: 15130951680 | elapsed time per iteration (s): 4.15 | learning rate: 1.581E-04 | global batch size: 512 | lm loss: 2.108733E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.232 | TFLOPs: 57.43 | 7: iteration 14440/ 44073 | consumed samples: 7393280 | consumed tokens: 15141437440 | elapsed time per iteration (s): 4.15 | learning rate: 1.580E-04 | global batch size: 512 | lm loss: 2.106343E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.402 | TFLOPs: 57.51 | 7: iteration 14450/ 44073 | consumed samples: 7398400 | consumed tokens: 15151923200 | elapsed time per iteration (s): 4.76 | learning rate: 1.580E-04 | global batch size: 512 | lm loss: 2.110620E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 107.514 | TFLOPs: 50.11 | 7: iteration 14460/ 44073 | consumed samples: 7403520 | consumed tokens: 15162408960 | elapsed time per iteration (s): 4.16 | learning rate: 1.579E-04 | global batch size: 512 | lm loss: 2.117722E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.060 | TFLOPs: 57.35 | 7: iteration 14470/ 44073 | consumed samples: 7408640 | consumed tokens: 15172894720 | elapsed time per iteration (s): 4.16 | learning rate: 1.579E-04 | global batch size: 512 | lm loss: 2.114549E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.056 | TFLOPs: 57.35 | 7: iteration 14480/ 44073 | consumed samples: 7413760 | consumed tokens: 15183380480 | elapsed time per iteration (s): 4.19 | learning rate: 1.578E-04 | global batch size: 512 | lm loss: 2.101126E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.109 | TFLOPs: 56.91 | 7: iteration 14490/ 44073 | consumed samples: 7418880 | consumed tokens: 15193866240 | elapsed time per iteration (s): 4.13 | learning rate: 1.577E-04 | global batch size: 512 | lm loss: 2.136470E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.834 | TFLOPs: 57.71 | 7: iteration 14500/ 44073 | consumed samples: 7424000 | consumed tokens: 15204352000 | elapsed time per iteration (s): 4.15 | learning rate: 1.577E-04 | global batch size: 512 | lm loss: 2.100722E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.429 | TFLOPs: 57.52 | 7: iteration 14510/ 44073 | consumed samples: 7429120 | consumed tokens: 15214837760 | elapsed time per iteration (s): 4.14 | learning rate: 1.576E-04 | global batch size: 512 | lm loss: 2.088902E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.526 | TFLOPs: 57.57 | 7: iteration 14520/ 44073 | consumed samples: 7434240 | consumed tokens: 15225323520 | elapsed time per iteration (s): 4.15 | learning rate: 1.576E-04 | global batch size: 512 | lm loss: 2.093712E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.443 | TFLOPs: 57.53 | 7: iteration 14530/ 44073 | consumed samples: 7439360 | consumed tokens: 15235809280 | elapsed time per iteration (s): 4.14 | learning rate: 1.575E-04 | global batch size: 512 | lm loss: 2.113321E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.540 | TFLOPs: 57.58 | 7: iteration 14540/ 44073 | consumed samples: 7444480 | consumed tokens: 15246295040 | elapsed time per iteration (s): 4.15 | learning rate: 1.575E-04 | global batch size: 512 | lm loss: 2.106418E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.392 | TFLOPs: 57.51 | 7: iteration 14550/ 44073 | consumed samples: 7449600 | consumed tokens: 15256780800 | elapsed time per iteration (s): 4.19 | learning rate: 1.574E-04 | global batch size: 512 | lm loss: 2.093078E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.133 | TFLOPs: 56.92 | 7: iteration 14560/ 44073 | consumed samples: 7454720 | consumed tokens: 15267266560 | elapsed time per iteration (s): 4.17 | learning rate: 1.574E-04 | global batch size: 512 | lm loss: 2.095675E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.920 | TFLOPs: 57.29 | 7: iteration 14570/ 44073 | consumed samples: 7459840 | consumed tokens: 15277752320 | elapsed time per iteration (s): 4.15 | learning rate: 1.573E-04 | global batch size: 512 | lm loss: 2.096815E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.360 | TFLOPs: 57.49 | 7: iteration 14580/ 44073 | consumed samples: 7464960 | consumed tokens: 15288238080 | elapsed time per iteration (s): 4.15 | learning rate: 1.573E-04 | global batch size: 512 | lm loss: 2.106208E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.331 | TFLOPs: 57.48 | 7: iteration 14590/ 44073 | consumed samples: 7470080 | consumed tokens: 15298723840 | elapsed time per iteration (s): 4.17 | learning rate: 1.572E-04 | global batch size: 512 | lm loss: 2.101421E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.899 | TFLOPs: 57.28 | 7: iteration 14600/ 44073 | consumed samples: 7475200 | consumed tokens: 15309209600 | elapsed time per iteration (s): 4.17 | learning rate: 1.571E-04 | global batch size: 512 | lm loss: 2.080145E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.786 | TFLOPs: 57.22 | 7: iteration 14610/ 44073 | consumed samples: 7480320 | consumed tokens: 15319695360 | elapsed time per iteration (s): 4.19 | learning rate: 1.571E-04 | global batch size: 512 | lm loss: 2.113181E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.332 | TFLOPs: 57.01 | 7: iteration 14620/ 44073 | consumed samples: 7485440 | consumed tokens: 15330181120 | elapsed time per iteration (s): 4.15 | learning rate: 1.570E-04 | global batch size: 512 | lm loss: 2.086432E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.243 | TFLOPs: 57.44 | 7: iteration 14630/ 44073 | consumed samples: 7490560 | consumed tokens: 15340666880 | elapsed time per iteration (s): 4.16 | learning rate: 1.570E-04 | global batch size: 512 | lm loss: 2.084621E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.964 | TFLOPs: 57.31 | 7: iteration 14640/ 44073 | consumed samples: 7495680 | consumed tokens: 15351152640 | elapsed time per iteration (s): 4.19 | learning rate: 1.569E-04 | global batch size: 512 | lm loss: 2.106980E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.286 | TFLOPs: 56.99 | 7: iteration 14650/ 44073 | consumed samples: 7500800 | consumed tokens: 15361638400 | elapsed time per iteration (s): 4.16 | learning rate: 1.569E-04 | global batch size: 512 | lm loss: 2.100718E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.150 | TFLOPs: 57.39 | 7: iteration 14660/ 44073 | consumed samples: 7505920 | consumed tokens: 15372124160 | elapsed time per iteration (s): 4.16 | learning rate: 1.568E-04 | global batch size: 512 | lm loss: 2.099895E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.086 | TFLOPs: 57.36 | 7: iteration 14670/ 44073 | consumed samples: 7511040 | consumed tokens: 15382609920 | elapsed time per iteration (s): 4.16 | learning rate: 1.568E-04 | global batch size: 512 | lm loss: 2.086511E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.053 | TFLOPs: 57.35 | 7: iteration 14680/ 44073 | consumed samples: 7516160 | consumed tokens: 15393095680 | elapsed time per iteration (s): 4.16 | learning rate: 1.567E-04 | global batch size: 512 | lm loss: 2.099433E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.141 | TFLOPs: 57.39 | 7: iteration 14690/ 44073 | consumed samples: 7521280 | consumed tokens: 15403581440 | elapsed time per iteration (s): 4.22 | learning rate: 1.566E-04 | global batch size: 512 | lm loss: 2.196889E+00 | grad norm: 2.936 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.247 | TFLOPs: 56.51 | 7: iteration 14700/ 44073 | consumed samples: 7526400 | consumed tokens: 15414067200 | elapsed time per iteration (s): 4.18 | learning rate: 1.566E-04 | global batch size: 512 | lm loss: 2.419072E+00 | grad norm: 0.728 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.575 | TFLOPs: 57.13 | 7: iteration 14710/ 44073 | consumed samples: 7531520 | consumed tokens: 15424552960 | elapsed time per iteration (s): 4.17 | learning rate: 1.565E-04 | global batch size: 512 | lm loss: 2.210168E+00 | grad norm: 0.312 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.889 | TFLOPs: 57.27 | 7: iteration 14720/ 44073 | consumed samples: 7536640 | consumed tokens: 15435038720 | elapsed time per iteration (s): 4.16 | learning rate: 1.565E-04 | global batch size: 512 | lm loss: 2.180769E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.989 | TFLOPs: 57.32 | 7: iteration 14730/ 44073 | consumed samples: 7541760 | consumed tokens: 15445524480 | elapsed time per iteration (s): 4.14 | learning rate: 1.564E-04 | global batch size: 512 | lm loss: 2.150791E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.788 | TFLOPs: 57.69 | 7: iteration 14740/ 44073 | consumed samples: 7546880 | consumed tokens: 15456010240 | elapsed time per iteration (s): 4.15 | learning rate: 1.564E-04 | global batch size: 512 | lm loss: 2.142009E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.501 | TFLOPs: 57.56 | 7: iteration 14750/ 44073 | consumed samples: 7552000 | consumed tokens: 15466496000 | elapsed time per iteration (s): 4.16 | learning rate: 1.563E-04 | global batch size: 512 | lm loss: 2.125580E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.113 | TFLOPs: 57.38 | 7: iteration 14760/ 44073 | consumed samples: 7557120 | consumed tokens: 15476981760 | elapsed time per iteration (s): 4.15 | learning rate: 1.563E-04 | global batch size: 512 | lm loss: 2.130080E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.255 | TFLOPs: 57.44 | 7: iteration 14770/ 44073 | consumed samples: 7562240 | consumed tokens: 15487467520 | elapsed time per iteration (s): 4.18 | learning rate: 1.562E-04 | global batch size: 512 | lm loss: 2.113545E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.615 | TFLOPs: 57.14 | 7: iteration 14780/ 44073 | consumed samples: 7567360 | consumed tokens: 15497953280 | elapsed time per iteration (s): 4.17 | learning rate: 1.561E-04 | global batch size: 512 | lm loss: 2.097757E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.841 | TFLOPs: 57.25 | 7: iteration 14790/ 44073 | consumed samples: 7572480 | consumed tokens: 15508439040 | elapsed time per iteration (s): 4.15 | learning rate: 1.561E-04 | global batch size: 512 | lm loss: 2.111228E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.506 | TFLOPs: 57.56 | 7: iteration 14800/ 44073 | consumed samples: 7577600 | consumed tokens: 15518924800 | elapsed time per iteration (s): 4.18 | learning rate: 1.560E-04 | global batch size: 512 | lm loss: 2.114899E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.431 | TFLOPs: 57.06 | 7: iteration 14810/ 44073 | consumed samples: 7582720 | consumed tokens: 15529410560 | elapsed time per iteration (s): 4.18 | learning rate: 1.560E-04 | global batch size: 512 | lm loss: 2.101257E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.348 | TFLOPs: 57.02 | 7: iteration 14820/ 44073 | consumed samples: 7587840 | consumed tokens: 15539896320 | elapsed time per iteration (s): 4.17 | learning rate: 1.559E-04 | global batch size: 512 | lm loss: 2.097361E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.764 | TFLOPs: 57.21 | 7: iteration 14830/ 44073 | consumed samples: 7592960 | consumed tokens: 15550382080 | elapsed time per iteration (s): 4.19 | learning rate: 1.559E-04 | global batch size: 512 | lm loss: 2.099557E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.311 | TFLOPs: 57.00 | 7: iteration 14840/ 44073 | consumed samples: 7598080 | consumed tokens: 15560867840 | elapsed time per iteration (s): 4.19 | learning rate: 1.558E-04 | global batch size: 512 | lm loss: 2.113830E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.157 | TFLOPs: 56.93 | 7: iteration 14850/ 44073 | consumed samples: 7603200 | consumed tokens: 15571353600 | elapsed time per iteration (s): 4.16 | learning rate: 1.558E-04 | global batch size: 512 | lm loss: 2.085984E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.072 | TFLOPs: 57.36 | 7: iteration 14860/ 44073 | consumed samples: 7608320 | consumed tokens: 15581839360 | elapsed time per iteration (s): 4.17 | learning rate: 1.557E-04 | global batch size: 512 | lm loss: 2.099720E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.920 | TFLOPs: 57.29 | 7: iteration 14870/ 44073 | consumed samples: 7613440 | consumed tokens: 15592325120 | elapsed time per iteration (s): 4.15 | learning rate: 1.556E-04 | global batch size: 512 | lm loss: 2.114587E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.480 | TFLOPs: 57.55 | 7: iteration 14880/ 44073 | consumed samples: 7618560 | consumed tokens: 15602810880 | elapsed time per iteration (s): 4.19 | learning rate: 1.556E-04 | global batch size: 512 | lm loss: 2.092368E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.085 | TFLOPs: 56.90 | 7: iteration 14890/ 44073 | consumed samples: 7623680 | consumed tokens: 15613296640 | elapsed time per iteration (s): 4.20 | learning rate: 1.555E-04 | global batch size: 512 | lm loss: 2.081325E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.000 | TFLOPs: 56.86 | 7: iteration 14900/ 44073 | consumed samples: 7628800 | consumed tokens: 15623782400 | elapsed time per iteration (s): 4.20 | learning rate: 1.555E-04 | global batch size: 512 | lm loss: 2.116260E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.956 | TFLOPs: 56.84 | 7: iteration 14910/ 44073 | consumed samples: 7633920 | consumed tokens: 15634268160 | elapsed time per iteration (s): 4.19 | learning rate: 1.554E-04 | global batch size: 512 | lm loss: 2.086860E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.291 | TFLOPs: 56.99 | 7: iteration 14920/ 44073 | consumed samples: 7639040 | consumed tokens: 15644753920 | elapsed time per iteration (s): 4.19 | learning rate: 1.554E-04 | global batch size: 512 | lm loss: 2.095640E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.255 | TFLOPs: 56.98 | 7: iteration 14930/ 44073 | consumed samples: 7644160 | consumed tokens: 15655239680 | elapsed time per iteration (s): 4.14 | learning rate: 1.553E-04 | global batch size: 512 | lm loss: 2.079946E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.561 | TFLOPs: 57.59 | 7: iteration 14940/ 44073 | consumed samples: 7649280 | consumed tokens: 15665725440 | elapsed time per iteration (s): 4.15 | learning rate: 1.553E-04 | global batch size: 512 | lm loss: 2.088846E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.489 | TFLOPs: 57.55 | 7: iteration 14950/ 44073 | consumed samples: 7654400 | consumed tokens: 15676211200 | elapsed time per iteration (s): 4.14 | learning rate: 1.552E-04 | global batch size: 512 | lm loss: 2.106990E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.808 | TFLOPs: 57.70 | 7: iteration 14960/ 44073 | consumed samples: 7659520 | consumed tokens: 15686696960 | elapsed time per iteration (s): 4.15 | learning rate: 1.551E-04 | global batch size: 512 | lm loss: 2.096923E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.512 | TFLOPs: 57.56 | 7: iteration 14970/ 44073 | consumed samples: 7664640 | consumed tokens: 15697182720 | elapsed time per iteration (s): 4.14 | learning rate: 1.551E-04 | global batch size: 512 | lm loss: 2.080641E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.679 | TFLOPs: 57.64 | 7: iteration 14980/ 44073 | consumed samples: 7669760 | consumed tokens: 15707668480 | elapsed time per iteration (s): 4.14 | learning rate: 1.550E-04 | global batch size: 512 | lm loss: 2.099760E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.659 | TFLOPs: 57.63 | 7: iteration 14990/ 44073 | consumed samples: 7674880 | consumed tokens: 15718154240 | elapsed time per iteration (s): 4.17 | learning rate: 1.550E-04 | global batch size: 512 | lm loss: 2.085583E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.780 | TFLOPs: 57.22 | 7: iteration 15000/ 44073 | consumed samples: 7680000 | consumed tokens: 15728640000 | elapsed time per iteration (s): 4.16 | learning rate: 1.549E-04 | global batch size: 512 | lm loss: 2.107909E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.146 | TFLOPs: 57.39 | 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 15000 | lm loss value: 2.150798E+00 | lm loss PPL: 8.591713E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 15000 to checkpoints_2b2 0: [2022-11-26 03:58:39,178] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step15000 is begin to save! 0: [2022-11-26 03:58:39,184] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_01-model_00-model_states.pt... 0: [2022-11-26 03:58:39,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_01-model_00-model_states.pt. 0: [2022-11-26 03:58:39,507] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_03-model_00-model_states.pt... 0: [2022-11-26 03:58:39,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_03-model_00-model_states.pt. 0: [2022-11-26 03:58:39,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_04-model_00-model_states.pt... 0: [2022-11-26 03:58:39,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_04-model_00-model_states.pt. 0: [2022-11-26 03:58:39,805] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_05-model_00-model_states.pt... 0: [2022-11-26 03:58:39,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_05-model_00-model_states.pt. 0: [2022-11-26 03:58:39,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_06-model_00-model_states.pt... 0: [2022-11-26 03:58:40,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_06-model_00-model_states.pt. 0: [2022-11-26 03:58:40,057] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_07-model_00-model_states.pt... 0: [2022-11-26 03:58:40,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_07-model_00-model_states.pt. 0: [2022-11-26 03:58:40,183] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_08-model_00-model_states.pt... 0: [2022-11-26 03:58:40,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_08-model_00-model_states.pt. 0: [2022-11-26 03:58:40,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_09-model_00-model_states.pt... 0: [2022-11-26 03:58:40,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_09-model_00-model_states.pt. 0: [2022-11-26 03:58:40,437] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_10-model_00-model_states.pt... 0: [2022-11-26 03:58:40,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_10-model_00-model_states.pt. 0: [2022-11-26 03:58:40,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_11-model_00-model_states.pt... 0: [2022-11-26 03:58:40,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_11-model_00-model_states.pt. 0: [2022-11-26 03:58:40,687] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_12-model_00-model_states.pt... 0: [2022-11-26 03:58:40,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_12-model_00-model_states.pt. 0: [2022-11-26 03:58:40,810] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_13-model_00-model_states.pt... 0: [2022-11-26 03:58:40,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_13-model_00-model_states.pt. 0: [2022-11-26 03:58:40,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_14-model_00-model_states.pt... 0: [2022-11-26 03:58:41,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_14-model_00-model_states.pt. 0: [2022-11-26 03:58:41,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_15-model_00-model_states.pt... 0: [2022-11-26 03:58:41,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_15-model_00-model_states.pt. 0: [2022-11-26 03:58:41,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_16-model_00-model_states.pt... 0: [2022-11-26 03:58:41,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_16-model_00-model_states.pt. 0: [2022-11-26 03:58:41,301] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_17-model_00-model_states.pt... 0: [2022-11-26 03:58:41,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_17-model_00-model_states.pt. 0: [2022-11-26 03:58:41,424] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_18-model_00-model_states.pt... 0: [2022-11-26 03:58:41,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_18-model_00-model_states.pt. 0: [2022-11-26 03:58:41,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_19-model_00-model_states.pt... 0: [2022-11-26 03:58:41,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_19-model_00-model_states.pt. 0: [2022-11-26 03:58:41,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_20-model_00-model_states.pt... 0: [2022-11-26 03:58:41,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_20-model_00-model_states.pt. 0: [2022-11-26 03:58:41,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_21-model_00-model_states.pt... 0: [2022-11-26 03:58:41,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_21-model_00-model_states.pt. 0: [2022-11-26 03:58:41,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_22-model_00-model_states.pt... 0: [2022-11-26 03:58:42,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_22-model_00-model_states.pt. 0: [2022-11-26 03:58:42,037] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_23-model_00-model_states.pt... 0: [2022-11-26 03:58:42,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_23-model_00-model_states.pt. 0: [2022-11-26 03:58:42,159] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_24-model_00-model_states.pt... 0: [2022-11-26 03:58:42,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_24-model_00-model_states.pt. 0: [2022-11-26 03:58:42,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_25-model_00-model_states.pt... 0: [2022-11-26 03:58:42,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_25-model_00-model_states.pt. 0: [2022-11-26 03:58:42,405] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_26-model_00-model_states.pt... 0: [2022-11-26 03:58:42,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_26-model_00-model_states.pt. 0: [2022-11-26 03:58:42,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_27-model_00-model_states.pt... 0: [2022-11-26 03:58:42,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_27-model_00-model_states.pt. 0: [2022-11-26 03:58:42,650] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_28-model_00-model_states.pt... 0: [2022-11-26 03:58:42,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_28-model_00-model_states.pt. 0: [2022-11-26 03:58:42,773] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_29-model_00-model_states.pt... 0: [2022-11-26 03:58:42,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_29-model_00-model_states.pt. 0: [2022-11-26 03:58:42,896] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_30-model_00-model_states.pt... 0: [2022-11-26 03:58:43,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_30-model_00-model_states.pt. 0: [2022-11-26 03:58:43,020] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_31-model_00-model_states.pt... 0: [2022-11-26 03:58:43,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_31-model_00-model_states.pt. 0: [2022-11-26 03:58:43,143] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_32-model_00-model_states.pt... 0: [2022-11-26 03:58:43,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_32-model_00-model_states.pt. 0: [2022-11-26 03:58:43,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_33-model_00-model_states.pt... 0: [2022-11-26 03:58:43,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_33-model_00-model_states.pt. 0: [2022-11-26 03:58:43,389] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_34-model_00-model_states.pt... 0: [2022-11-26 03:58:43,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_34-model_00-model_states.pt. 0: [2022-11-26 03:58:43,513] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/layer_36-model_00-model_states.pt... 0: [2022-11-26 03:58:43,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/layer_36-model_00-model_states.pt. 0: [2022-11-26 03:58:43,516] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step15000/mp_rank_00_model_states.pt 0: [2022-11-26 03:58:43,516] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/mp_rank_00_model_states.pt... 0: [2022-11-26 03:58:43,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/mp_rank_00_model_states.pt. 0: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:58:43,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step15000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:58:44,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:58:44,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 03:58:44,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 0: [2022-11-26 03:58:44,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:58:44,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 03:58:44,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 0: [2022-11-26 03:58:44,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:58:44,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:58:44,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 03:58:44,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 1: [2022-11-26 03:58:44,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:58:44,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 03:58:44,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 2: [2022-11-26 03:58:44,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:58:44,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 0: [2022-11-26 03:58:44,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:58:44,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 0: [2022-11-26 03:58:44,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 03:58:44,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 0: [2022-11-26 03:58:44,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:58:44,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 03:58:44,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 2: [2022-11-26 03:58:44,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:58:44,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 03:58:44,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 2: [2022-11-26 03:58:44,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:58:44,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 03:58:44,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 2: [2022-11-26 03:58:44,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:58:44,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 03:58:44,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 2: [2022-11-26 03:58:44,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:58:44,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 03:58:44,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 2: [2022-11-26 03:58:44,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:58:44,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 03:58:44,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 0: [2022-11-26 03:58:44,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:58:44,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 03:58:44,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 0: [2022-11-26 03:58:44,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:58:44,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 03:58:44,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 1: [2022-11-26 03:58:44,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:58:44,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:58:44,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 03:58:44,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 03:58:44,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 1: [2022-11-26 03:58:44,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 2: [2022-11-26 03:58:44,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:58:44,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 03:58:44,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 1: [2022-11-26 03:58:44,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:58:44,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 03:58:44,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 1: [2022-11-26 03:58:44,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:58:44,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 03:58:44,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 1: [2022-11-26 03:58:44,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:58:44,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 03:58:44,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 5: [2022-11-26 03:58:44,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:58:44,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:58:44,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:58:44,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:58:44,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 03:58:44,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 03:58:44,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 03:58:44,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 03:58:44,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 5: [2022-11-26 03:58:44,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 5: [2022-11-26 03:58:44,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 5: [2022-11-26 03:58:44,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:58:44,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 5: [2022-11-26 03:58:44,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 03:58:44,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 5: [2022-11-26 03:58:44,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:58:44,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 03:58:44,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 5: [2022-11-26 03:58:44,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:58:44,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 03:58:44,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 5: [2022-11-26 03:58:44,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:58:44,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 03:58:44,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 2: [2022-11-26 03:58:44,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:58:44,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 03:58:44,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 1: [2022-11-26 03:58:44,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:58:44,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:58:44,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 03:58:44,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 03:58:44,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 1: [2022-11-26 03:58:44,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 3: [2022-11-26 03:58:44,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:58:44,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:58:44,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:58:44,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:58:44,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 03:58:44,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 03:58:44,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 03:58:44,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 03:58:44,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 3: [2022-11-26 03:58:44,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 3: [2022-11-26 03:58:44,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 3: [2022-11-26 03:58:44,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 3: [2022-11-26 03:58:44,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:58:44,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 03:58:44,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 3: [2022-11-26 03:58:44,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:58:44,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 03:58:44,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 3: [2022-11-26 03:58:44,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:58:44,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 03:58:44,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 3: [2022-11-26 03:58:44,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:58:44,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 03:58:44,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 7: [2022-11-26 03:58:44,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:58:44,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:58:44,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:58:44,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 03:58:44,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 03:58:44,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 03:58:44,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 7: [2022-11-26 03:58:44,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 7: [2022-11-26 03:58:44,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 7: [2022-11-26 03:58:44,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:58:44,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:58:44,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:58:44,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 03:58:44,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:58:44,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 03:58:44,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 03:58:44,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:58:44,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 7: [2022-11-26 03:58:44,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 7: [2022-11-26 03:58:44,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 7: [2022-11-26 03:58:44,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 03:58:44,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 03:58:44,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 7: [2022-11-26 03:58:44,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 4: [2022-11-26 03:58:44,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:58:44,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:58:44,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:58:44,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:58:44,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 03:58:44,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 03:58:44,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 03:58:44,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 4: [2022-11-26 03:58:44,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 03:58:44,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 4: [2022-11-26 03:58:44,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 4: [2022-11-26 03:58:44,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 6: [2022-11-26 03:58:44,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:58:44,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:58:44,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:58:44,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:58:44,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:58:44,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 03:58:44,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 03:58:44,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 03:58:44,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 03:58:44,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 6: [2022-11-26 03:58:44,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 6: [2022-11-26 03:58:44,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 6: [2022-11-26 03:58:44,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 6: [2022-11-26 03:58:44,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 03:58:44,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 6: [2022-11-26 03:58:44,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:58:44,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:58:44,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 03:58:44,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:58:44,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 6: [2022-11-26 03:58:44,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 03:58:44,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 03:58:44,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 6: [2022-11-26 03:58:44,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 4: [2022-11-26 03:58:44,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:58:44,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 03:58:44,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 4: [2022-11-26 03:58:44,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:58:44,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 03:58:44,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 4: [2022-11-26 03:58:44,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:58:44,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 03:58:44,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 4: [2022-11-26 03:58:44,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:58:44,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 03:58:44,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 0: [2022-11-26 03:58:44,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step15000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 03:58:44,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 0: successfully saved checkpoint at iteration 15000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 5642.06 7: iteration 15010/ 44073 | consumed samples: 7685120 | consumed tokens: 15739125760 | elapsed time per iteration (s): 4.86 | learning rate: 1.549E-04 | global batch size: 512 | lm loss: 2.098916E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 105.243 | TFLOPs: 49.05 | 7: iteration 15020/ 44073 | consumed samples: 7690240 | consumed tokens: 15749611520 | elapsed time per iteration (s): 4.16 | learning rate: 1.548E-04 | global batch size: 512 | lm loss: 2.097154E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.113 | TFLOPs: 57.38 | 7: iteration 15030/ 44073 | consumed samples: 7695360 | consumed tokens: 15760097280 | elapsed time per iteration (s): 4.16 | learning rate: 1.547E-04 | global batch size: 512 | lm loss: 2.107091E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.951 | TFLOPs: 57.30 | 7: iteration 15040/ 44073 | consumed samples: 7700480 | consumed tokens: 15770583040 | elapsed time per iteration (s): 4.16 | learning rate: 1.547E-04 | global batch size: 512 | lm loss: 2.106018E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.930 | TFLOPs: 57.29 | 7: iteration 15050/ 44073 | consumed samples: 7705600 | consumed tokens: 15781068800 | elapsed time per iteration (s): 4.15 | learning rate: 1.546E-04 | global batch size: 512 | lm loss: 2.083692E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.237 | TFLOPs: 57.43 | 7: iteration 15060/ 44073 | consumed samples: 7710720 | consumed tokens: 15791554560 | elapsed time per iteration (s): 4.18 | learning rate: 1.546E-04 | global batch size: 512 | lm loss: 2.096817E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.602 | TFLOPs: 57.14 | 7: iteration 15070/ 44073 | consumed samples: 7715840 | consumed tokens: 15802040320 | elapsed time per iteration (s): 4.19 | learning rate: 1.545E-04 | global batch size: 512 | lm loss: 2.098241E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.182 | TFLOPs: 56.94 | 7: iteration 15080/ 44073 | consumed samples: 7720960 | consumed tokens: 15812526080 | elapsed time per iteration (s): 4.18 | learning rate: 1.545E-04 | global batch size: 512 | lm loss: 2.102127E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.565 | TFLOPs: 57.12 | 7: iteration 15090/ 44073 | consumed samples: 7726080 | consumed tokens: 15823011840 | elapsed time per iteration (s): 4.27 | learning rate: 1.544E-04 | global batch size: 512 | lm loss: 2.095070E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.998 | TFLOPs: 55.93 | 7: iteration 15100/ 44073 | consumed samples: 7731200 | consumed tokens: 15833497600 | elapsed time per iteration (s): 4.18 | learning rate: 1.544E-04 | global batch size: 512 | lm loss: 2.092828E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.363 | TFLOPs: 57.03 | 7: iteration 15110/ 44073 | consumed samples: 7736320 | consumed tokens: 15843983360 | elapsed time per iteration (s): 4.14 | learning rate: 1.543E-04 | global batch size: 512 | lm loss: 2.084817E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.686 | TFLOPs: 57.64 | 7: iteration 15120/ 44073 | consumed samples: 7741440 | consumed tokens: 15854469120 | elapsed time per iteration (s): 4.13 | learning rate: 1.542E-04 | global batch size: 512 | lm loss: 2.079173E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.824 | TFLOPs: 57.71 | 7: iteration 15130/ 44073 | consumed samples: 7746560 | consumed tokens: 15864954880 | elapsed time per iteration (s): 4.14 | learning rate: 1.542E-04 | global batch size: 512 | lm loss: 2.119132E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.558 | TFLOPs: 57.58 | 7: iteration 15140/ 44073 | consumed samples: 7751680 | consumed tokens: 15875440640 | elapsed time per iteration (s): 4.14 | learning rate: 1.541E-04 | global batch size: 512 | lm loss: 2.105765E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.584 | TFLOPs: 57.60 | 7: iteration 15150/ 44073 | consumed samples: 7756800 | consumed tokens: 15885926400 | elapsed time per iteration (s): 4.14 | learning rate: 1.541E-04 | global batch size: 512 | lm loss: 2.094680E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.580 | TFLOPs: 57.59 | 7: iteration 15160/ 44073 | consumed samples: 7761920 | consumed tokens: 15896412160 | elapsed time per iteration (s): 4.18 | learning rate: 1.540E-04 | global batch size: 512 | lm loss: 2.092803E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.398 | TFLOPs: 57.04 | 7: iteration 15170/ 44073 | consumed samples: 7767040 | consumed tokens: 15906897920 | elapsed time per iteration (s): 4.15 | learning rate: 1.540E-04 | global batch size: 512 | lm loss: 2.092118E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.421 | TFLOPs: 57.52 | 7: iteration 15180/ 44073 | consumed samples: 7772160 | consumed tokens: 15917383680 | elapsed time per iteration (s): 4.15 | learning rate: 1.539E-04 | global batch size: 512 | lm loss: 2.092018E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.254 | TFLOPs: 57.44 | 7: iteration 15190/ 44073 | consumed samples: 7777280 | consumed tokens: 15927869440 | elapsed time per iteration (s): 4.13 | learning rate: 1.538E-04 | global batch size: 512 | lm loss: 2.103202E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.872 | TFLOPs: 57.73 | 7: iteration 15200/ 44073 | consumed samples: 7782400 | consumed tokens: 15938355200 | elapsed time per iteration (s): 4.13 | learning rate: 1.538E-04 | global batch size: 512 | lm loss: 2.109587E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.883 | TFLOPs: 57.74 | 7: iteration 15210/ 44073 | consumed samples: 7787520 | consumed tokens: 15948840960 | elapsed time per iteration (s): 4.14 | learning rate: 1.537E-04 | global batch size: 512 | lm loss: 2.080280E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.766 | TFLOPs: 57.68 | 7: iteration 15220/ 44073 | consumed samples: 7792640 | consumed tokens: 15959326720 | elapsed time per iteration (s): 4.14 | learning rate: 1.537E-04 | global batch size: 512 | lm loss: 2.094109E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.776 | TFLOPs: 57.69 | 7: iteration 15230/ 44073 | consumed samples: 7797760 | consumed tokens: 15969812480 | elapsed time per iteration (s): 4.14 | learning rate: 1.536E-04 | global batch size: 512 | lm loss: 2.073018E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.740 | TFLOPs: 57.67 | 7: iteration 15240/ 44073 | consumed samples: 7802880 | consumed tokens: 15980298240 | elapsed time per iteration (s): 4.26 | learning rate: 1.536E-04 | global batch size: 512 | lm loss: 2.094477E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.233 | TFLOPs: 56.03 | 7: iteration 15250/ 44073 | consumed samples: 7808000 | consumed tokens: 15990784000 | elapsed time per iteration (s): 4.14 | learning rate: 1.535E-04 | global batch size: 512 | lm loss: 2.088433E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.674 | TFLOPs: 57.64 | 7: iteration 15260/ 44073 | consumed samples: 7813120 | consumed tokens: 16001269760 | elapsed time per iteration (s): 4.16 | learning rate: 1.534E-04 | global batch size: 512 | lm loss: 2.093852E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.188 | TFLOPs: 57.41 | 7: iteration 15270/ 44073 | consumed samples: 7818240 | consumed tokens: 16011755520 | elapsed time per iteration (s): 4.15 | learning rate: 1.534E-04 | global batch size: 512 | lm loss: 2.070876E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.499 | TFLOPs: 57.56 | 7: iteration 15280/ 44073 | consumed samples: 7823360 | consumed tokens: 16022241280 | elapsed time per iteration (s): 4.20 | learning rate: 1.533E-04 | global batch size: 512 | lm loss: 2.107723E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.984 | TFLOPs: 56.85 | 7: iteration 15290/ 44073 | consumed samples: 7828480 | consumed tokens: 16032727040 | elapsed time per iteration (s): 4.17 | learning rate: 1.533E-04 | global batch size: 512 | lm loss: 2.089456E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.903 | TFLOPs: 57.28 | 7: iteration 15300/ 44073 | consumed samples: 7833600 | consumed tokens: 16043212800 | elapsed time per iteration (s): 4.14 | learning rate: 1.532E-04 | global batch size: 512 | lm loss: 2.098215E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.635 | TFLOPs: 57.62 | 7: iteration 15310/ 44073 | consumed samples: 7838720 | consumed tokens: 16053698560 | elapsed time per iteration (s): 4.29 | learning rate: 1.532E-04 | global batch size: 512 | lm loss: 2.090899E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.388 | TFLOPs: 55.64 | 7: iteration 15320/ 44073 | consumed samples: 7843840 | consumed tokens: 16064184320 | elapsed time per iteration (s): 4.20 | learning rate: 1.531E-04 | global batch size: 512 | lm loss: 2.070292E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.927 | TFLOPs: 56.82 | 7: iteration 15330/ 44073 | consumed samples: 7848960 | consumed tokens: 16074670080 | elapsed time per iteration (s): 4.16 | learning rate: 1.531E-04 | global batch size: 512 | lm loss: 2.093762E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.194 | TFLOPs: 57.41 | 7: iteration 15340/ 44073 | consumed samples: 7854080 | consumed tokens: 16085155840 | elapsed time per iteration (s): 4.16 | learning rate: 1.530E-04 | global batch size: 512 | lm loss: 2.118253E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.222 | TFLOPs: 57.43 | 7: iteration 15350/ 44073 | consumed samples: 7859200 | consumed tokens: 16095641600 | elapsed time per iteration (s): 4.16 | learning rate: 1.529E-04 | global batch size: 512 | lm loss: 2.108239E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.992 | TFLOPs: 57.32 | 7: iteration 15360/ 44073 | consumed samples: 7864320 | consumed tokens: 16106127360 | elapsed time per iteration (s): 4.25 | learning rate: 1.529E-04 | global batch size: 512 | lm loss: 2.085032E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.543 | TFLOPs: 56.18 | 7: iteration 15370/ 44073 | consumed samples: 7869440 | consumed tokens: 16116613120 | elapsed time per iteration (s): 4.31 | learning rate: 1.528E-04 | global batch size: 512 | lm loss: 2.081502E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.859 | TFLOPs: 55.39 | 7: iteration 15380/ 44073 | consumed samples: 7874560 | consumed tokens: 16127098880 | elapsed time per iteration (s): 4.15 | learning rate: 1.528E-04 | global batch size: 512 | lm loss: 2.083813E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.288 | TFLOPs: 57.46 | 7: iteration 15390/ 44073 | consumed samples: 7879680 | consumed tokens: 16137584640 | elapsed time per iteration (s): 4.14 | learning rate: 1.527E-04 | global batch size: 512 | lm loss: 2.080190E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.696 | TFLOPs: 57.65 | 7: iteration 15400/ 44073 | consumed samples: 7884800 | consumed tokens: 16148070400 | elapsed time per iteration (s): 4.14 | learning rate: 1.527E-04 | global batch size: 512 | lm loss: 2.071273E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.547 | TFLOPs: 57.58 | 7: iteration 15410/ 44073 | consumed samples: 7889920 | consumed tokens: 16158556160 | elapsed time per iteration (s): 4.17 | learning rate: 1.526E-04 | global batch size: 512 | lm loss: 2.078078E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.787 | TFLOPs: 57.22 | 7: iteration 15420/ 44073 | consumed samples: 7895040 | consumed tokens: 16169041920 | elapsed time per iteration (s): 4.18 | learning rate: 1.525E-04 | global batch size: 512 | lm loss: 2.088574E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.433 | TFLOPs: 57.06 | 7: iteration 15430/ 44073 | consumed samples: 7900160 | consumed tokens: 16179527680 | elapsed time per iteration (s): 4.16 | learning rate: 1.525E-04 | global batch size: 512 | lm loss: 2.087593E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.933 | TFLOPs: 57.29 | 7: iteration 15440/ 44073 | consumed samples: 7905280 | consumed tokens: 16190013440 | elapsed time per iteration (s): 4.14 | learning rate: 1.524E-04 | global batch size: 512 | lm loss: 2.088327E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.548 | TFLOPs: 57.58 | 7: iteration 15450/ 44073 | consumed samples: 7910400 | consumed tokens: 16200499200 | elapsed time per iteration (s): 4.14 | learning rate: 1.524E-04 | global batch size: 512 | lm loss: 2.081227E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.628 | TFLOPs: 57.62 | 7: iteration 15460/ 44073 | consumed samples: 7915520 | consumed tokens: 16210984960 | elapsed time per iteration (s): 4.22 | learning rate: 1.523E-04 | global batch size: 512 | lm loss: 2.086001E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.217 | TFLOPs: 56.49 | 7: iteration 15470/ 44073 | consumed samples: 7920640 | consumed tokens: 16221470720 | elapsed time per iteration (s): 4.19 | learning rate: 1.523E-04 | global batch size: 512 | lm loss: 2.089326E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.281 | TFLOPs: 56.99 | 7: iteration 15480/ 44073 | consumed samples: 7925760 | consumed tokens: 16231956480 | elapsed time per iteration (s): 4.17 | learning rate: 1.522E-04 | global batch size: 512 | lm loss: 2.086923E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.642 | TFLOPs: 57.16 | 7: iteration 15490/ 44073 | consumed samples: 7930880 | consumed tokens: 16242442240 | elapsed time per iteration (s): 4.17 | learning rate: 1.521E-04 | global batch size: 512 | lm loss: 2.074356E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.821 | TFLOPs: 57.24 | 7: iteration 15500/ 44073 | consumed samples: 7936000 | consumed tokens: 16252928000 | elapsed time per iteration (s): 4.18 | learning rate: 1.521E-04 | global batch size: 512 | lm loss: 2.087604E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.500 | TFLOPs: 57.09 | 7: iteration 15510/ 44073 | consumed samples: 7941120 | consumed tokens: 16263413760 | elapsed time per iteration (s): 4.18 | learning rate: 1.520E-04 | global batch size: 512 | lm loss: 2.118186E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.388 | TFLOPs: 57.04 | 7: iteration 15520/ 44073 | consumed samples: 7946240 | consumed tokens: 16273899520 | elapsed time per iteration (s): 4.17 | learning rate: 1.520E-04 | global batch size: 512 | lm loss: 2.115223E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.684 | TFLOPs: 57.18 | 7: iteration 15530/ 44073 | consumed samples: 7951360 | consumed tokens: 16284385280 | elapsed time per iteration (s): 4.19 | learning rate: 1.519E-04 | global batch size: 512 | lm loss: 2.078189E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.240 | TFLOPs: 56.97 | 7: iteration 15540/ 44073 | consumed samples: 7956480 | consumed tokens: 16294871040 | elapsed time per iteration (s): 4.15 | learning rate: 1.519E-04 | global batch size: 512 | lm loss: 2.115371E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.416 | TFLOPs: 57.52 | 7: iteration 15550/ 44073 | consumed samples: 7961600 | consumed tokens: 16305356800 | elapsed time per iteration (s): 4.21 | learning rate: 1.518E-04 | global batch size: 512 | lm loss: 2.093099E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.653 | TFLOPs: 56.70 | 7: iteration 15560/ 44073 | consumed samples: 7966720 | consumed tokens: 16315842560 | elapsed time per iteration (s): 4.17 | learning rate: 1.517E-04 | global batch size: 512 | lm loss: 2.089835E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.923 | TFLOPs: 57.29 | 7: iteration 15570/ 44073 | consumed samples: 7971840 | consumed tokens: 16326328320 | elapsed time per iteration (s): 4.17 | learning rate: 1.517E-04 | global batch size: 512 | lm loss: 2.101535E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.637 | TFLOPs: 57.15 | 7: iteration 15580/ 44073 | consumed samples: 7976960 | consumed tokens: 16336814080 | elapsed time per iteration (s): 4.22 | learning rate: 1.516E-04 | global batch size: 512 | lm loss: 2.075826E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.329 | TFLOPs: 56.55 | 7: iteration 15590/ 44073 | consumed samples: 7982080 | consumed tokens: 16347299840 | elapsed time per iteration (s): 4.17 | learning rate: 1.516E-04 | global batch size: 512 | lm loss: 2.095565E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.848 | TFLOPs: 57.25 | 7: iteration 15600/ 44073 | consumed samples: 7987200 | consumed tokens: 16357785600 | elapsed time per iteration (s): 4.21 | learning rate: 1.515E-04 | global batch size: 512 | lm loss: 2.083793E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.638 | TFLOPs: 56.69 | 7: iteration 15610/ 44073 | consumed samples: 7992320 | consumed tokens: 16368271360 | elapsed time per iteration (s): 4.20 | learning rate: 1.514E-04 | global batch size: 512 | lm loss: 2.091897E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.943 | TFLOPs: 56.83 | 7: iteration 15620/ 44073 | consumed samples: 7997440 | consumed tokens: 16378757120 | elapsed time per iteration (s): 4.17 | learning rate: 1.514E-04 | global batch size: 512 | lm loss: 2.092098E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.696 | TFLOPs: 57.18 | 7: iteration 15630/ 44073 | consumed samples: 8002560 | consumed tokens: 16389242880 | elapsed time per iteration (s): 4.23 | learning rate: 1.513E-04 | global batch size: 512 | lm loss: 2.068483E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.174 | TFLOPs: 56.47 | 7: iteration 15640/ 44073 | consumed samples: 8007680 | consumed tokens: 16399728640 | elapsed time per iteration (s): 4.19 | learning rate: 1.513E-04 | global batch size: 512 | lm loss: 2.060982E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.285 | TFLOPs: 56.99 | 7: iteration 15650/ 44073 | consumed samples: 8012800 | consumed tokens: 16410214400 | elapsed time per iteration (s): 4.16 | learning rate: 1.512E-04 | global batch size: 512 | lm loss: 2.075516E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.186 | TFLOPs: 57.41 | 7: iteration 15660/ 44073 | consumed samples: 8017920 | consumed tokens: 16420700160 | elapsed time per iteration (s): 4.21 | learning rate: 1.512E-04 | global batch size: 512 | lm loss: 2.077088E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.641 | TFLOPs: 56.69 | 7: iteration 15670/ 44073 | consumed samples: 8023040 | consumed tokens: 16431185920 | elapsed time per iteration (s): 4.18 | learning rate: 1.511E-04 | global batch size: 512 | lm loss: 2.087893E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.501 | TFLOPs: 57.09 | 7: iteration 15680/ 44073 | consumed samples: 8028160 | consumed tokens: 16441671680 | elapsed time per iteration (s): 4.17 | learning rate: 1.510E-04 | global batch size: 512 | lm loss: 2.089276E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.807 | TFLOPs: 57.23 | 7: iteration 15690/ 44073 | consumed samples: 8033280 | consumed tokens: 16452157440 | elapsed time per iteration (s): 4.16 | learning rate: 1.510E-04 | global batch size: 512 | lm loss: 2.068509E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.082 | TFLOPs: 57.36 | 7: iteration 15700/ 44073 | consumed samples: 8038400 | consumed tokens: 16462643200 | elapsed time per iteration (s): 4.16 | learning rate: 1.509E-04 | global batch size: 512 | lm loss: 2.090847E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.049 | TFLOPs: 57.35 | 7: iteration 15710/ 44073 | consumed samples: 8043520 | consumed tokens: 16473128960 | elapsed time per iteration (s): 4.15 | learning rate: 1.509E-04 | global batch size: 512 | lm loss: 2.076705E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.270 | TFLOPs: 57.45 | 7: iteration 15720/ 44073 | consumed samples: 8048640 | consumed tokens: 16483614720 | elapsed time per iteration (s): 4.19 | learning rate: 1.508E-04 | global batch size: 512 | lm loss: 2.097689E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.100 | TFLOPs: 56.90 | 7: iteration 15730/ 44073 | consumed samples: 8053760 | consumed tokens: 16494100480 | elapsed time per iteration (s): 4.20 | learning rate: 1.508E-04 | global batch size: 512 | lm loss: 2.088483E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.951 | TFLOPs: 56.84 | 7: iteration 15740/ 44073 | consumed samples: 8058880 | consumed tokens: 16504586240 | elapsed time per iteration (s): 4.17 | learning rate: 1.507E-04 | global batch size: 512 | lm loss: 2.104487E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.873 | TFLOPs: 57.27 | 7: iteration 15750/ 44073 | consumed samples: 8064000 | consumed tokens: 16515072000 | elapsed time per iteration (s): 4.15 | learning rate: 1.506E-04 | global batch size: 512 | lm loss: 2.074344E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.495 | TFLOPs: 57.56 | 7: iteration 15760/ 44073 | consumed samples: 8069120 | consumed tokens: 16525557760 | elapsed time per iteration (s): 4.14 | learning rate: 1.506E-04 | global batch size: 512 | lm loss: 2.083213E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.745 | TFLOPs: 57.67 | 7: iteration 15770/ 44073 | consumed samples: 8074240 | consumed tokens: 16536043520 | elapsed time per iteration (s): 4.18 | learning rate: 1.505E-04 | global batch size: 512 | lm loss: 2.070879E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.463 | TFLOPs: 57.07 | 7: iteration 15780/ 44073 | consumed samples: 8079360 | consumed tokens: 16546529280 | elapsed time per iteration (s): 4.19 | learning rate: 1.505E-04 | global batch size: 512 | lm loss: 2.087195E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.167 | TFLOPs: 56.94 | 7: iteration 15790/ 44073 | consumed samples: 8084480 | consumed tokens: 16557015040 | elapsed time per iteration (s): 4.15 | learning rate: 1.504E-04 | global batch size: 512 | lm loss: 2.083703E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.268 | TFLOPs: 57.45 | 7: iteration 15800/ 44073 | consumed samples: 8089600 | consumed tokens: 16567500800 | elapsed time per iteration (s): 4.19 | learning rate: 1.504E-04 | global batch size: 512 | lm loss: 2.066316E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.082 | TFLOPs: 56.90 | 7: iteration 15810/ 44073 | consumed samples: 8094720 | consumed tokens: 16577986560 | elapsed time per iteration (s): 4.17 | learning rate: 1.503E-04 | global batch size: 512 | lm loss: 2.079001E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.693 | TFLOPs: 57.18 | 7: iteration 15820/ 44073 | consumed samples: 8099840 | consumed tokens: 16588472320 | elapsed time per iteration (s): 4.21 | learning rate: 1.502E-04 | global batch size: 512 | lm loss: 2.086097E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.681 | TFLOPs: 56.71 | 7: iteration 15830/ 44073 | consumed samples: 8104960 | consumed tokens: 16598958080 | elapsed time per iteration (s): 4.19 | learning rate: 1.502E-04 | global batch size: 512 | lm loss: 2.076560E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.242 | TFLOPs: 56.97 | 7: iteration 15840/ 44073 | consumed samples: 8110080 | consumed tokens: 16609443840 | elapsed time per iteration (s): 4.15 | learning rate: 1.501E-04 | global batch size: 512 | lm loss: 2.082891E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.511 | TFLOPs: 57.56 | 7: iteration 15850/ 44073 | consumed samples: 8115200 | consumed tokens: 16619929600 | elapsed time per iteration (s): 4.14 | learning rate: 1.501E-04 | global batch size: 512 | lm loss: 2.077685E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.707 | TFLOPs: 57.65 | 7: iteration 15860/ 44073 | consumed samples: 8120320 | consumed tokens: 16630415360 | elapsed time per iteration (s): 4.14 | learning rate: 1.500E-04 | global batch size: 512 | lm loss: 2.093828E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.762 | TFLOPs: 57.68 | 7: iteration 15870/ 44073 | consumed samples: 8125440 | consumed tokens: 16640901120 | elapsed time per iteration (s): 4.14 | learning rate: 1.499E-04 | global batch size: 512 | lm loss: 2.100145E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.652 | TFLOPs: 57.63 | 7: iteration 15880/ 44073 | consumed samples: 8130560 | consumed tokens: 16651386880 | elapsed time per iteration (s): 4.14 | learning rate: 1.499E-04 | global batch size: 512 | lm loss: 2.090165E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.768 | TFLOPs: 57.68 | 7: iteration 15890/ 44073 | consumed samples: 8135680 | consumed tokens: 16661872640 | elapsed time per iteration (s): 4.14 | learning rate: 1.498E-04 | global batch size: 512 | lm loss: 2.083812E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.655 | TFLOPs: 57.63 | 7: iteration 15900/ 44073 | consumed samples: 8140800 | consumed tokens: 16672358400 | elapsed time per iteration (s): 4.20 | learning rate: 1.498E-04 | global batch size: 512 | lm loss: 2.088859E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.028 | TFLOPs: 56.87 | 7: iteration 15910/ 44073 | consumed samples: 8145920 | consumed tokens: 16682844160 | elapsed time per iteration (s): 4.14 | learning rate: 1.497E-04 | global batch size: 512 | lm loss: 2.081093E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.747 | TFLOPs: 57.67 | 7: iteration 15920/ 44073 | consumed samples: 8151040 | consumed tokens: 16693329920 | elapsed time per iteration (s): 4.14 | learning rate: 1.497E-04 | global batch size: 512 | lm loss: 2.092651E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.579 | TFLOPs: 57.59 | 7: iteration 15930/ 44073 | consumed samples: 8156160 | consumed tokens: 16703815680 | elapsed time per iteration (s): 4.14 | learning rate: 1.496E-04 | global batch size: 512 | lm loss: 2.068431E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.582 | TFLOPs: 57.60 | 7: iteration 15940/ 44073 | consumed samples: 8161280 | consumed tokens: 16714301440 | elapsed time per iteration (s): 4.19 | learning rate: 1.495E-04 | global batch size: 512 | lm loss: 2.077612E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.160 | TFLOPs: 56.93 | 7: iteration 15950/ 44073 | consumed samples: 8166400 | consumed tokens: 16724787200 | elapsed time per iteration (s): 4.18 | learning rate: 1.495E-04 | global batch size: 512 | lm loss: 2.079257E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.452 | TFLOPs: 57.07 | 7: iteration 15960/ 44073 | consumed samples: 8171520 | consumed tokens: 16735272960 | elapsed time per iteration (s): 4.20 | learning rate: 1.494E-04 | global batch size: 512 | lm loss: 2.070835E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.020 | TFLOPs: 56.87 | 7: iteration 15970/ 44073 | consumed samples: 8176640 | consumed tokens: 16745758720 | elapsed time per iteration (s): 4.14 | learning rate: 1.494E-04 | global batch size: 512 | lm loss: 2.088109E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.626 | TFLOPs: 57.62 | 7: iteration 15980/ 44073 | consumed samples: 8181760 | consumed tokens: 16756244480 | elapsed time per iteration (s): 4.16 | learning rate: 1.493E-04 | global batch size: 512 | lm loss: 2.056945E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.151 | TFLOPs: 57.39 | 7: iteration 15990/ 44073 | consumed samples: 8186880 | consumed tokens: 16766730240 | elapsed time per iteration (s): 4.15 | learning rate: 1.492E-04 | global batch size: 512 | lm loss: 2.088416E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.329 | TFLOPs: 57.48 | 0: [2022-11-26 05:08:16,127] [INFO] [logging.py:68:log_dist] [Rank 0] step=16000, skipped=0, lr=[0.00014918906570796453, 0.00014918906570796453, 0.00014918906570796453], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 7: iteration 16000/ 44073 | consumed samples: 8192000 | consumed tokens: 16777216000 | elapsed time per iteration (s): 4.14 | learning rate: 1.492E-04 | global batch size: 512 | lm loss: 2.069855E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.540 | TFLOPs: 57.58 | 0: steps: 16000 loss: 2.0450 iter time (s): 4.177 samples/sec: 122.588 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 16000 | lm loss value: 1.978814E+00 | lm loss PPL: 7.234161E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 16000 to checkpoints_2b2 0: [2022-11-26 05:08:17,483] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step16000 is begin to save! 0: [2022-11-26 05:08:17,489] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_01-model_00-model_states.pt... 0: [2022-11-26 05:08:17,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_01-model_00-model_states.pt. 0: [2022-11-26 05:08:17,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_03-model_00-model_states.pt... 0: [2022-11-26 05:08:17,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_03-model_00-model_states.pt. 0: [2022-11-26 05:08:17,976] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_04-model_00-model_states.pt... 0: [2022-11-26 05:08:18,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_04-model_00-model_states.pt. 0: [2022-11-26 05:08:18,128] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_05-model_00-model_states.pt... 0: [2022-11-26 05:08:18,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_05-model_00-model_states.pt. 0: [2022-11-26 05:08:18,277] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_06-model_00-model_states.pt... 0: [2022-11-26 05:08:18,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_06-model_00-model_states.pt. 0: [2022-11-26 05:08:18,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_07-model_00-model_states.pt... 0: [2022-11-26 05:08:18,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_07-model_00-model_states.pt. 0: [2022-11-26 05:08:18,568] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_08-model_00-model_states.pt... 0: [2022-11-26 05:08:18,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_08-model_00-model_states.pt. 0: [2022-11-26 05:08:18,709] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_09-model_00-model_states.pt... 0: [2022-11-26 05:08:18,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_09-model_00-model_states.pt. 0: [2022-11-26 05:08:18,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_10-model_00-model_states.pt... 0: [2022-11-26 05:08:18,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_10-model_00-model_states.pt. 0: [2022-11-26 05:08:18,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_11-model_00-model_states.pt... 0: [2022-11-26 05:08:19,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_11-model_00-model_states.pt. 0: [2022-11-26 05:08:19,135] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_12-model_00-model_states.pt... 0: [2022-11-26 05:08:19,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_12-model_00-model_states.pt. 0: [2022-11-26 05:08:19,278] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_13-model_00-model_states.pt... 0: [2022-11-26 05:08:19,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_13-model_00-model_states.pt. 0: [2022-11-26 05:08:19,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_14-model_00-model_states.pt... 0: [2022-11-26 05:08:19,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_14-model_00-model_states.pt. 0: [2022-11-26 05:08:19,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_15-model_00-model_states.pt... 0: [2022-11-26 05:08:19,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_15-model_00-model_states.pt. 0: [2022-11-26 05:08:19,698] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_16-model_00-model_states.pt... 0: [2022-11-26 05:08:19,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_16-model_00-model_states.pt. 0: [2022-11-26 05:08:19,836] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_17-model_00-model_states.pt... 0: [2022-11-26 05:08:19,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_17-model_00-model_states.pt. 0: [2022-11-26 05:08:19,975] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_18-model_00-model_states.pt... 0: [2022-11-26 05:08:20,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_18-model_00-model_states.pt. 0: [2022-11-26 05:08:20,118] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_19-model_00-model_states.pt... 0: [2022-11-26 05:08:20,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_19-model_00-model_states.pt. 0: [2022-11-26 05:08:20,259] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_20-model_00-model_states.pt... 0: [2022-11-26 05:08:20,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_20-model_00-model_states.pt. 0: [2022-11-26 05:08:20,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_21-model_00-model_states.pt... 0: [2022-11-26 05:08:20,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_21-model_00-model_states.pt. 0: [2022-11-26 05:08:20,539] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_22-model_00-model_states.pt... 0: [2022-11-26 05:08:20,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_22-model_00-model_states.pt. 0: [2022-11-26 05:08:20,679] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_23-model_00-model_states.pt... 0: [2022-11-26 05:08:20,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_23-model_00-model_states.pt. 0: [2022-11-26 05:08:20,821] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_24-model_00-model_states.pt... 0: [2022-11-26 05:08:20,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_24-model_00-model_states.pt. 0: [2022-11-26 05:08:20,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_25-model_00-model_states.pt... 0: [2022-11-26 05:08:21,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_25-model_00-model_states.pt. 0: [2022-11-26 05:08:21,096] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_26-model_00-model_states.pt... 0: [2022-11-26 05:08:21,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_26-model_00-model_states.pt. 0: [2022-11-26 05:08:21,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_27-model_00-model_states.pt... 0: [2022-11-26 05:08:21,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_27-model_00-model_states.pt. 0: [2022-11-26 05:08:21,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_28-model_00-model_states.pt... 0: [2022-11-26 05:08:21,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_28-model_00-model_states.pt. 0: [2022-11-26 05:08:21,510] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_29-model_00-model_states.pt... 0: [2022-11-26 05:08:21,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_29-model_00-model_states.pt. 0: [2022-11-26 05:08:21,645] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_30-model_00-model_states.pt... 0: [2022-11-26 05:08:21,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_30-model_00-model_states.pt. 0: [2022-11-26 05:08:21,784] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_31-model_00-model_states.pt... 0: [2022-11-26 05:08:21,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_31-model_00-model_states.pt. 0: [2022-11-26 05:08:21,920] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_32-model_00-model_states.pt... 0: [2022-11-26 05:08:22,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_32-model_00-model_states.pt. 0: [2022-11-26 05:08:22,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_33-model_00-model_states.pt... 0: [2022-11-26 05:08:22,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_33-model_00-model_states.pt. 0: [2022-11-26 05:08:22,194] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_34-model_00-model_states.pt... 0: [2022-11-26 05:08:22,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_34-model_00-model_states.pt. 0: [2022-11-26 05:08:22,335] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/layer_36-model_00-model_states.pt... 0: [2022-11-26 05:08:22,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/layer_36-model_00-model_states.pt. 0: [2022-11-26 05:08:22,337] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step16000/mp_rank_00_model_states.pt 0: [2022-11-26 05:08:22,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/mp_rank_00_model_states.pt... 0: [2022-11-26 05:08:22,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/mp_rank_00_model_states.pt. 0: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:08:22,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step16000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:08:22,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:08:22,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:08:22,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 05:08:22,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 5: [2022-11-26 05:08:22,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:08:22,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 05:08:22,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 0: [2022-11-26 05:08:22,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:08:22,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:08:22,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 05:08:22,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 05:08:22,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 0: [2022-11-26 05:08:22,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 0: [2022-11-26 05:08:22,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:08:22,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 05:08:22,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 5: [2022-11-26 05:08:22,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:08:22,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 05:08:22,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 0: [2022-11-26 05:08:22,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:08:22,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 05:08:22,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 0: [2022-11-26 05:08:22,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:08:22,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 05:08:22,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 0: [2022-11-26 05:08:22,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:08:22,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 05:08:22,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 5: [2022-11-26 05:08:22,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:08:22,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 05:08:22,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 5: [2022-11-26 05:08:22,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:08:22,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 05:08:22,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 5: [2022-11-26 05:08:22,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:08:22,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 05:08:22,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 5: [2022-11-26 05:08:22,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:08:22,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 05:08:22,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 5: [2022-11-26 05:08:22,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:08:22,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 05:08:22,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 0: [2022-11-26 05:08:22,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 05:08:22,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 5: [2022-11-26 05:08:22,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:08:22,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 05:08:22,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 1: [2022-11-26 05:08:23,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:08:23,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 05:08:23,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 1: [2022-11-26 05:08:23,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:08:23,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 05:08:23,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 1: [2022-11-26 05:08:23,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:08:23,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 05:08:23,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 1: [2022-11-26 05:08:23,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:08:23,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 05:08:23,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 1: [2022-11-26 05:08:23,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:08:23,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 05:08:23,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 1: [2022-11-26 05:08:23,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:08:23,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 05:08:23,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 1: [2022-11-26 05:08:23,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:08:23,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 05:08:23,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 1: [2022-11-26 05:08:23,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:08:23,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 05:08:23,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 2: [2022-11-26 05:08:23,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:08:23,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:08:23,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:08:23,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:08:23,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 05:08:23,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 05:08:23,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 05:08:23,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 2: [2022-11-26 05:08:23,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 2: [2022-11-26 05:08:23,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 2: [2022-11-26 05:08:23,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 05:08:23,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 2: [2022-11-26 05:08:23,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:08:23,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 05:08:23,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 3: [2022-11-26 05:08:23,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:08:23,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:08:23,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:08:23,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:08:23,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:08:23,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 05:08:23,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 05:08:23,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 05:08:23,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 05:08:23,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 3: [2022-11-26 05:08:23,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 05:08:23,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 3: [2022-11-26 05:08:23,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 3: [2022-11-26 05:08:23,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 3: [2022-11-26 05:08:23,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 3: [2022-11-26 05:08:23,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:08:23,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 05:08:23,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 3: [2022-11-26 05:08:23,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:08:23,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 05:08:23,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 3: [2022-11-26 05:08:23,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:08:23,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 05:08:23,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 2: [2022-11-26 05:08:23,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:08:23,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 05:08:23,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 2: [2022-11-26 05:08:23,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:08:23,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 05:08:23,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 2: [2022-11-26 05:08:23,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:08:23,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 05:08:23,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 6: [2022-11-26 05:08:23,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:08:23,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:08:23,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:08:23,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:08:23,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:08:23,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 05:08:23,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:08:23,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:08:23,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:08:23,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 05:08:23,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 05:08:23,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 05:08:23,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 05:08:23,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 6: [2022-11-26 05:08:23,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 05:08:23,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 05:08:23,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 05:08:23,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 6: [2022-11-26 05:08:23,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 6: [2022-11-26 05:08:23,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 6: [2022-11-26 05:08:23,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 6: [2022-11-26 05:08:23,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 6: [2022-11-26 05:08:23,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 6: [2022-11-26 05:08:23,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 7: [2022-11-26 05:08:23,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:08:23,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:08:23,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:08:23,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:08:23,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:08:23,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:08:23,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:08:23,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:08:23,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 05:08:23,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 05:08:23,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 05:08:23,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 05:08:23,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 05:08:23,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 05:08:23,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 7: [2022-11-26 05:08:23,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 05:08:23,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 7: [2022-11-26 05:08:23,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 7: [2022-11-26 05:08:23,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 05:08:23,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 7: [2022-11-26 05:08:23,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 7: [2022-11-26 05:08:23,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 7: [2022-11-26 05:08:23,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 7: [2022-11-26 05:08:23,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 4: [2022-11-26 05:08:23,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:08:23,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:08:23,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:08:23,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:08:23,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:08:23,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 05:08:23,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 05:08:23,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:08:23,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 05:08:23,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 05:08:23,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:08:23,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 05:08:23,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 4: [2022-11-26 05:08:23,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 4: [2022-11-26 05:08:23,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 4: [2022-11-26 05:08:23,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 4: [2022-11-26 05:08:23,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 05:08:23,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 05:08:23,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 4: [2022-11-26 05:08:23,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 4: [2022-11-26 05:08:23,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 4: [2022-11-26 05:08:23,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:08:23,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step16000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 05:08:23,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 0: successfully saved checkpoint at iteration 16000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 6216.53 7: iteration 16010/ 44073 | consumed samples: 8197120 | consumed tokens: 16787701760 | elapsed time per iteration (s): 4.89 | learning rate: 1.491E-04 | global batch size: 512 | lm loss: 2.069197E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 104.697 | TFLOPs: 48.79 | 7: iteration 16020/ 44073 | consumed samples: 8202240 | consumed tokens: 16798187520 | elapsed time per iteration (s): 5.79 | learning rate: 1.491E-04 | global batch size: 512 | lm loss: 2.096666E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 88.364 | TFLOPs: 41.18 | 7: iteration 16030/ 44073 | consumed samples: 8207360 | consumed tokens: 16808673280 | elapsed time per iteration (s): 4.15 | learning rate: 1.490E-04 | global batch size: 512 | lm loss: 2.089955E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.427 | TFLOPs: 57.52 | 7: iteration 16040/ 44073 | consumed samples: 8212480 | consumed tokens: 16819159040 | elapsed time per iteration (s): 4.16 | learning rate: 1.490E-04 | global batch size: 512 | lm loss: 2.092699E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.159 | TFLOPs: 57.40 | 7: iteration 16050/ 44073 | consumed samples: 8217600 | consumed tokens: 16829644800 | elapsed time per iteration (s): 4.21 | learning rate: 1.489E-04 | global batch size: 512 | lm loss: 2.093180E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.669 | TFLOPs: 56.70 | 7: iteration 16060/ 44073 | consumed samples: 8222720 | consumed tokens: 16840130560 | elapsed time per iteration (s): 4.15 | learning rate: 1.488E-04 | global batch size: 512 | lm loss: 2.078085E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.350 | TFLOPs: 57.49 | 7: iteration 16070/ 44073 | consumed samples: 8227840 | consumed tokens: 16850616320 | elapsed time per iteration (s): 4.21 | learning rate: 1.488E-04 | global batch size: 512 | lm loss: 2.084764E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.602 | TFLOPs: 56.67 | 7: iteration 16080/ 44073 | consumed samples: 8232960 | consumed tokens: 16861102080 | elapsed time per iteration (s): 4.16 | learning rate: 1.487E-04 | global batch size: 512 | lm loss: 2.085165E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.063 | TFLOPs: 57.35 | 7: iteration 16090/ 44073 | consumed samples: 8238080 | consumed tokens: 16871587840 | elapsed time per iteration (s): 4.20 | learning rate: 1.487E-04 | global batch size: 512 | lm loss: 2.069855E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.877 | TFLOPs: 56.80 | 7: iteration 16100/ 44073 | consumed samples: 8243200 | consumed tokens: 16882073600 | elapsed time per iteration (s): 4.17 | learning rate: 1.486E-04 | global batch size: 512 | lm loss: 2.085733E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.667 | TFLOPs: 57.17 | 7: iteration 16110/ 44073 | consumed samples: 8248320 | consumed tokens: 16892559360 | elapsed time per iteration (s): 4.17 | learning rate: 1.485E-04 | global batch size: 512 | lm loss: 2.077919E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.743 | TFLOPs: 57.20 | 7: iteration 16120/ 44073 | consumed samples: 8253440 | consumed tokens: 16903045120 | elapsed time per iteration (s): 4.15 | learning rate: 1.485E-04 | global batch size: 512 | lm loss: 2.082611E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.488 | TFLOPs: 57.55 | 7: iteration 16130/ 44073 | consumed samples: 8258560 | consumed tokens: 16913530880 | elapsed time per iteration (s): 4.13 | learning rate: 1.484E-04 | global batch size: 512 | lm loss: 2.120234E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.873 | TFLOPs: 57.73 | 7: iteration 16140/ 44073 | consumed samples: 8263680 | consumed tokens: 16924016640 | elapsed time per iteration (s): 4.13 | learning rate: 1.484E-04 | global batch size: 512 | lm loss: 2.076675E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.868 | TFLOPs: 57.73 | 7: iteration 16150/ 44073 | consumed samples: 8268800 | consumed tokens: 16934502400 | elapsed time per iteration (s): 4.18 | learning rate: 1.483E-04 | global batch size: 512 | lm loss: 2.089164E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.538 | TFLOPs: 57.11 | 7: iteration 16160/ 44073 | consumed samples: 8273920 | consumed tokens: 16944988160 | elapsed time per iteration (s): 4.20 | learning rate: 1.483E-04 | global batch size: 512 | lm loss: 2.093293E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.766 | TFLOPs: 56.75 | 7: iteration 16170/ 44073 | consumed samples: 8279040 | consumed tokens: 16955473920 | elapsed time per iteration (s): 4.26 | learning rate: 1.482E-04 | global batch size: 512 | lm loss: 2.103035E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.172 | TFLOPs: 56.01 | 7: iteration 16180/ 44073 | consumed samples: 8284160 | consumed tokens: 16965959680 | elapsed time per iteration (s): 4.14 | learning rate: 1.481E-04 | global batch size: 512 | lm loss: 2.084515E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.529 | TFLOPs: 57.57 | 7: iteration 16190/ 44073 | consumed samples: 8289280 | consumed tokens: 16976445440 | elapsed time per iteration (s): 4.18 | learning rate: 1.481E-04 | global batch size: 512 | lm loss: 2.076620E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.442 | TFLOPs: 57.06 | 7: iteration 16200/ 44073 | consumed samples: 8294400 | consumed tokens: 16986931200 | elapsed time per iteration (s): 4.14 | learning rate: 1.480E-04 | global batch size: 512 | lm loss: 2.088320E+00 | grad norm: 5.949 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.712 | TFLOPs: 57.66 | 7: iteration 16210/ 44073 | consumed samples: 8299520 | consumed tokens: 16997416960 | elapsed time per iteration (s): 4.15 | learning rate: 1.480E-04 | global batch size: 512 | lm loss: 2.100670E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.257 | TFLOPs: 57.44 | 7: iteration 16220/ 44073 | consumed samples: 8304640 | consumed tokens: 17007902720 | elapsed time per iteration (s): 4.16 | learning rate: 1.479E-04 | global batch size: 512 | lm loss: 2.099041E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.095 | TFLOPs: 57.37 | 7: iteration 16230/ 44073 | consumed samples: 8309760 | consumed tokens: 17018388480 | elapsed time per iteration (s): 4.28 | learning rate: 1.478E-04 | global batch size: 512 | lm loss: 2.085999E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.495 | TFLOPs: 55.69 | 7: iteration 16240/ 44073 | consumed samples: 8314880 | consumed tokens: 17028874240 | elapsed time per iteration (s): 4.16 | learning rate: 1.478E-04 | global batch size: 512 | lm loss: 2.092932E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.128 | TFLOPs: 57.38 | 7: iteration 16250/ 44073 | consumed samples: 8320000 | consumed tokens: 17039360000 | elapsed time per iteration (s): 4.19 | learning rate: 1.477E-04 | global batch size: 512 | lm loss: 2.053453E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.158 | TFLOPs: 56.93 | 7: iteration 16260/ 44073 | consumed samples: 8325120 | consumed tokens: 17049845760 | elapsed time per iteration (s): 4.19 | learning rate: 1.477E-04 | global batch size: 512 | lm loss: 2.074416E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.134 | TFLOPs: 56.92 | 7: iteration 16270/ 44073 | consumed samples: 8330240 | consumed tokens: 17060331520 | elapsed time per iteration (s): 4.18 | learning rate: 1.476E-04 | global batch size: 512 | lm loss: 2.068634E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.541 | TFLOPs: 57.11 | 7: iteration 16280/ 44073 | consumed samples: 8335360 | consumed tokens: 17070817280 | elapsed time per iteration (s): 4.21 | learning rate: 1.475E-04 | global batch size: 512 | lm loss: 2.092704E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.689 | TFLOPs: 56.71 | 7: iteration 16290/ 44073 | consumed samples: 8340480 | consumed tokens: 17081303040 | elapsed time per iteration (s): 4.26 | learning rate: 1.475E-04 | global batch size: 512 | lm loss: 2.068752E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.274 | TFLOPs: 56.05 | 7: iteration 16300/ 44073 | consumed samples: 8345600 | consumed tokens: 17091788800 | elapsed time per iteration (s): 4.27 | learning rate: 1.474E-04 | global batch size: 512 | lm loss: 2.080350E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.963 | TFLOPs: 55.91 | 7: iteration 16310/ 44073 | consumed samples: 8350720 | consumed tokens: 17102274560 | elapsed time per iteration (s): 4.17 | learning rate: 1.474E-04 | global batch size: 512 | lm loss: 2.061720E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.895 | TFLOPs: 57.28 | 7: iteration 16320/ 44073 | consumed samples: 8355840 | consumed tokens: 17112760320 | elapsed time per iteration (s): 4.19 | learning rate: 1.473E-04 | global batch size: 512 | lm loss: 2.064189E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.102 | TFLOPs: 56.91 | 7: iteration 16330/ 44073 | consumed samples: 8360960 | consumed tokens: 17123246080 | elapsed time per iteration (s): 4.19 | learning rate: 1.473E-04 | global batch size: 512 | lm loss: 2.046288E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.054 | TFLOPs: 56.88 | 7: iteration 16340/ 44073 | consumed samples: 8366080 | consumed tokens: 17133731840 | elapsed time per iteration (s): 4.23 | learning rate: 1.472E-04 | global batch size: 512 | lm loss: 2.086109E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.939 | TFLOPs: 56.36 | 7: iteration 16350/ 44073 | consumed samples: 8371200 | consumed tokens: 17144217600 | elapsed time per iteration (s): 4.25 | learning rate: 1.471E-04 | global batch size: 512 | lm loss: 2.069810E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.425 | TFLOPs: 56.12 | 7: iteration 16360/ 44073 | consumed samples: 8376320 | consumed tokens: 17154703360 | elapsed time per iteration (s): 4.15 | learning rate: 1.471E-04 | global batch size: 512 | lm loss: 2.066676E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.390 | TFLOPs: 57.51 | 7: iteration 16370/ 44073 | consumed samples: 8381440 | consumed tokens: 17165189120 | elapsed time per iteration (s): 4.16 | learning rate: 1.470E-04 | global batch size: 512 | lm loss: 2.072311E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.185 | TFLOPs: 57.41 | 7: iteration 16380/ 44073 | consumed samples: 8386560 | consumed tokens: 17175674880 | elapsed time per iteration (s): 4.15 | learning rate: 1.470E-04 | global batch size: 512 | lm loss: 2.082036E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.440 | TFLOPs: 57.53 | 7: iteration 16390/ 44073 | consumed samples: 8391680 | consumed tokens: 17186160640 | elapsed time per iteration (s): 4.20 | learning rate: 1.469E-04 | global batch size: 512 | lm loss: 2.073886E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.926 | TFLOPs: 56.82 | 7: iteration 16400/ 44073 | consumed samples: 8396800 | consumed tokens: 17196646400 | elapsed time per iteration (s): 4.18 | learning rate: 1.468E-04 | global batch size: 512 | lm loss: 2.077003E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.371 | TFLOPs: 57.03 | 7: iteration 16410/ 44073 | consumed samples: 8401920 | consumed tokens: 17207132160 | elapsed time per iteration (s): 4.17 | learning rate: 1.468E-04 | global batch size: 512 | lm loss: 2.092652E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.767 | TFLOPs: 57.22 | 7: iteration 16420/ 44073 | consumed samples: 8407040 | consumed tokens: 17217617920 | elapsed time per iteration (s): 4.17 | learning rate: 1.467E-04 | global batch size: 512 | lm loss: 2.100924E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.720 | TFLOPs: 57.19 | 7: iteration 16430/ 44073 | consumed samples: 8412160 | consumed tokens: 17228103680 | elapsed time per iteration (s): 4.20 | learning rate: 1.467E-04 | global batch size: 512 | lm loss: 2.072456E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.857 | TFLOPs: 56.79 | 7: iteration 16440/ 44073 | consumed samples: 8417280 | consumed tokens: 17238589440 | elapsed time per iteration (s): 4.19 | learning rate: 1.466E-04 | global batch size: 512 | lm loss: 2.069982E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.317 | TFLOPs: 57.01 | 7: iteration 16450/ 44073 | consumed samples: 8422400 | consumed tokens: 17249075200 | elapsed time per iteration (s): 4.17 | learning rate: 1.465E-04 | global batch size: 512 | lm loss: 2.063978E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.726 | TFLOPs: 57.20 | 7: iteration 16460/ 44073 | consumed samples: 8427520 | consumed tokens: 17259560960 | elapsed time per iteration (s): 4.19 | learning rate: 1.465E-04 | global batch size: 512 | lm loss: 2.064734E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.316 | TFLOPs: 57.01 | 7: iteration 16470/ 44073 | consumed samples: 8432640 | consumed tokens: 17270046720 | elapsed time per iteration (s): 4.23 | learning rate: 1.464E-04 | global batch size: 512 | lm loss: 2.104480E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.166 | TFLOPs: 56.47 | 7: iteration 16480/ 44073 | consumed samples: 8437760 | consumed tokens: 17280532480 | elapsed time per iteration (s): 4.20 | learning rate: 1.464E-04 | global batch size: 512 | lm loss: 2.085657E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.933 | TFLOPs: 56.83 | 7: iteration 16490/ 44073 | consumed samples: 8442880 | consumed tokens: 17291018240 | elapsed time per iteration (s): 4.18 | learning rate: 1.463E-04 | global batch size: 512 | lm loss: 2.091505E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.370 | TFLOPs: 57.03 | 7: iteration 16500/ 44073 | consumed samples: 8448000 | consumed tokens: 17301504000 | elapsed time per iteration (s): 4.20 | learning rate: 1.462E-04 | global batch size: 512 | lm loss: 2.079112E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.007 | TFLOPs: 56.86 | 7: iteration 16510/ 44073 | consumed samples: 8453120 | consumed tokens: 17311989760 | elapsed time per iteration (s): 4.18 | learning rate: 1.462E-04 | global batch size: 512 | lm loss: 2.092097E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.536 | TFLOPs: 57.11 | 7: iteration 16520/ 44073 | consumed samples: 8458240 | consumed tokens: 17322475520 | elapsed time per iteration (s): 4.14 | learning rate: 1.461E-04 | global batch size: 512 | lm loss: 2.067221E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.633 | TFLOPs: 57.62 | 7: iteration 16530/ 44073 | consumed samples: 8463360 | consumed tokens: 17332961280 | elapsed time per iteration (s): 4.14 | learning rate: 1.461E-04 | global batch size: 512 | lm loss: 2.056372E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.686 | TFLOPs: 57.64 | 7: iteration 16540/ 44073 | consumed samples: 8468480 | consumed tokens: 17343447040 | elapsed time per iteration (s): 4.15 | learning rate: 1.460E-04 | global batch size: 512 | lm loss: 2.055270E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.386 | TFLOPs: 57.50 | 7: iteration 16550/ 44073 | consumed samples: 8473600 | consumed tokens: 17353932800 | elapsed time per iteration (s): 4.14 | learning rate: 1.460E-04 | global batch size: 512 | lm loss: 2.075956E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.567 | TFLOPs: 57.59 | 7: iteration 16560/ 44073 | consumed samples: 8478720 | consumed tokens: 17364418560 | elapsed time per iteration (s): 4.15 | learning rate: 1.459E-04 | global batch size: 512 | lm loss: 2.056635E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.245 | TFLOPs: 57.44 | 7: iteration 16570/ 44073 | consumed samples: 8483840 | consumed tokens: 17374904320 | elapsed time per iteration (s): 4.14 | learning rate: 1.458E-04 | global batch size: 512 | lm loss: 2.051942E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.715 | TFLOPs: 57.66 | 7: iteration 16580/ 44073 | consumed samples: 8488960 | consumed tokens: 17385390080 | elapsed time per iteration (s): 4.16 | learning rate: 1.458E-04 | global batch size: 512 | lm loss: 2.070963E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.125 | TFLOPs: 57.38 | 7: iteration 16590/ 44073 | consumed samples: 8494080 | consumed tokens: 17395875840 | elapsed time per iteration (s): 4.15 | learning rate: 1.457E-04 | global batch size: 512 | lm loss: 2.073227E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.478 | TFLOPs: 57.55 | 7: iteration 16600/ 44073 | consumed samples: 8499200 | consumed tokens: 17406361600 | elapsed time per iteration (s): 4.14 | learning rate: 1.457E-04 | global batch size: 512 | lm loss: 2.068513E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.781 | TFLOPs: 57.69 | 7: iteration 16610/ 44073 | consumed samples: 8504320 | consumed tokens: 17416847360 | elapsed time per iteration (s): 4.15 | learning rate: 1.456E-04 | global batch size: 512 | lm loss: 2.076322E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.239 | TFLOPs: 57.44 | 7: iteration 16620/ 44073 | consumed samples: 8509440 | consumed tokens: 17427333120 | elapsed time per iteration (s): 4.15 | learning rate: 1.455E-04 | global batch size: 512 | lm loss: 2.081173E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.325 | TFLOPs: 57.48 | 7: iteration 16630/ 44073 | consumed samples: 8514560 | consumed tokens: 17437818880 | elapsed time per iteration (s): 4.14 | learning rate: 1.455E-04 | global batch size: 512 | lm loss: 2.058570E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.733 | TFLOPs: 57.67 | 7: iteration 16640/ 44073 | consumed samples: 8519680 | consumed tokens: 17448304640 | elapsed time per iteration (s): 4.16 | learning rate: 1.454E-04 | global batch size: 512 | lm loss: 2.109013E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.100 | TFLOPs: 57.37 | 7: iteration 16650/ 44073 | consumed samples: 8524800 | consumed tokens: 17458790400 | elapsed time per iteration (s): 4.15 | learning rate: 1.454E-04 | global batch size: 512 | lm loss: 2.070741E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.388 | TFLOPs: 57.50 | 7: iteration 16660/ 44073 | consumed samples: 8529920 | consumed tokens: 17469276160 | elapsed time per iteration (s): 4.14 | learning rate: 1.453E-04 | global batch size: 512 | lm loss: 2.106706E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.705 | TFLOPs: 57.65 | 7: iteration 16670/ 44073 | consumed samples: 8535040 | consumed tokens: 17479761920 | elapsed time per iteration (s): 4.14 | learning rate: 1.452E-04 | global batch size: 512 | lm loss: 2.079339E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.725 | TFLOPs: 57.66 | 7: iteration 16680/ 44073 | consumed samples: 8540160 | consumed tokens: 17490247680 | elapsed time per iteration (s): 4.20 | learning rate: 1.452E-04 | global batch size: 512 | lm loss: 2.059837E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.941 | TFLOPs: 56.83 | 7: iteration 16690/ 44073 | consumed samples: 8545280 | consumed tokens: 17500733440 | elapsed time per iteration (s): 4.15 | learning rate: 1.451E-04 | global batch size: 512 | lm loss: 2.086169E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.464 | TFLOPs: 57.54 | 7: iteration 16700/ 44073 | consumed samples: 8550400 | consumed tokens: 17511219200 | elapsed time per iteration (s): 4.15 | learning rate: 1.451E-04 | global batch size: 512 | lm loss: 2.080857E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.506 | TFLOPs: 57.56 | 7: iteration 16710/ 44073 | consumed samples: 8555520 | consumed tokens: 17521704960 | elapsed time per iteration (s): 4.13 | learning rate: 1.450E-04 | global batch size: 512 | lm loss: 2.064622E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.880 | TFLOPs: 57.73 | 7: iteration 16720/ 44073 | consumed samples: 8560640 | consumed tokens: 17532190720 | elapsed time per iteration (s): 4.16 | learning rate: 1.449E-04 | global batch size: 512 | lm loss: 2.060202E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.971 | TFLOPs: 57.31 | 7: iteration 16730/ 44073 | consumed samples: 8565760 | consumed tokens: 17542676480 | elapsed time per iteration (s): 4.14 | learning rate: 1.449E-04 | global batch size: 512 | lm loss: 2.099085E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.727 | TFLOPs: 57.66 | 7: iteration 16740/ 44073 | consumed samples: 8570880 | consumed tokens: 17553162240 | elapsed time per iteration (s): 4.14 | learning rate: 1.448E-04 | global batch size: 512 | lm loss: 2.068093E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.733 | TFLOPs: 57.67 | 7: iteration 16750/ 44073 | consumed samples: 8576000 | consumed tokens: 17563648000 | elapsed time per iteration (s): 4.15 | learning rate: 1.448E-04 | global batch size: 512 | lm loss: 2.075967E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.463 | TFLOPs: 57.54 | 7: iteration 16760/ 44073 | consumed samples: 8581120 | consumed tokens: 17574133760 | elapsed time per iteration (s): 4.14 | learning rate: 1.447E-04 | global batch size: 512 | lm loss: 2.067370E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.544 | TFLOPs: 57.58 | 7: iteration 16770/ 44073 | consumed samples: 8586240 | consumed tokens: 17584619520 | elapsed time per iteration (s): 4.13 | learning rate: 1.446E-04 | global batch size: 512 | lm loss: 2.069118E+00 | grad norm: 0.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.865 | TFLOPs: 57.73 | 7: iteration 16780/ 44073 | consumed samples: 8591360 | consumed tokens: 17595105280 | elapsed time per iteration (s): 4.15 | learning rate: 1.446E-04 | global batch size: 512 | lm loss: 2.077406E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.478 | TFLOPs: 57.55 | 7: iteration 16790/ 44073 | consumed samples: 8596480 | consumed tokens: 17605591040 | elapsed time per iteration (s): 4.16 | learning rate: 1.445E-04 | global batch size: 512 | lm loss: 2.077912E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.223 | TFLOPs: 57.43 | 7: iteration 16800/ 44073 | consumed samples: 8601600 | consumed tokens: 17616076800 | elapsed time per iteration (s): 4.18 | learning rate: 1.445E-04 | global batch size: 512 | lm loss: 2.071704E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.536 | TFLOPs: 57.11 | 7: iteration 16810/ 44073 | consumed samples: 8606720 | consumed tokens: 17626562560 | elapsed time per iteration (s): 4.14 | learning rate: 1.444E-04 | global batch size: 512 | lm loss: 2.071149E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.794 | TFLOPs: 57.69 | 7: iteration 16820/ 44073 | consumed samples: 8611840 | consumed tokens: 17637048320 | elapsed time per iteration (s): 4.13 | learning rate: 1.443E-04 | global batch size: 512 | lm loss: 2.072661E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.857 | TFLOPs: 57.72 | 7: iteration 16830/ 44073 | consumed samples: 8616960 | consumed tokens: 17647534080 | elapsed time per iteration (s): 4.14 | learning rate: 1.443E-04 | global batch size: 512 | lm loss: 2.077473E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.774 | TFLOPs: 57.68 | 7: iteration 16840/ 44073 | consumed samples: 8622080 | consumed tokens: 17658019840 | elapsed time per iteration (s): 4.15 | learning rate: 1.442E-04 | global batch size: 512 | lm loss: 2.061225E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.417 | TFLOPs: 57.52 | 7: iteration 16850/ 44073 | consumed samples: 8627200 | consumed tokens: 17668505600 | elapsed time per iteration (s): 4.14 | learning rate: 1.442E-04 | global batch size: 512 | lm loss: 2.076212E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.633 | TFLOPs: 57.62 | 7: iteration 16860/ 44073 | consumed samples: 8632320 | consumed tokens: 17678991360 | elapsed time per iteration (s): 4.15 | learning rate: 1.441E-04 | global batch size: 512 | lm loss: 2.086522E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.444 | TFLOPs: 57.53 | 7: iteration 16870/ 44073 | consumed samples: 8637440 | consumed tokens: 17689477120 | elapsed time per iteration (s): 4.14 | learning rate: 1.440E-04 | global batch size: 512 | lm loss: 2.086348E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.669 | TFLOPs: 57.64 | 7: iteration 16880/ 44073 | consumed samples: 8642560 | consumed tokens: 17699962880 | elapsed time per iteration (s): 4.15 | learning rate: 1.440E-04 | global batch size: 512 | lm loss: 2.097546E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.355 | TFLOPs: 57.49 | 7: iteration 16890/ 44073 | consumed samples: 8647680 | consumed tokens: 17710448640 | elapsed time per iteration (s): 4.16 | learning rate: 1.439E-04 | global batch size: 512 | lm loss: 2.073972E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.202 | TFLOPs: 57.42 | 7: iteration 16900/ 44073 | consumed samples: 8652800 | consumed tokens: 17720934400 | elapsed time per iteration (s): 4.14 | learning rate: 1.439E-04 | global batch size: 512 | lm loss: 2.082137E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.601 | TFLOPs: 57.60 | 7: iteration 16910/ 44073 | consumed samples: 8657920 | consumed tokens: 17731420160 | elapsed time per iteration (s): 4.14 | learning rate: 1.438E-04 | global batch size: 512 | lm loss: 2.086074E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.658 | TFLOPs: 57.63 | 7: iteration 16920/ 44073 | consumed samples: 8663040 | consumed tokens: 17741905920 | elapsed time per iteration (s): 4.14 | learning rate: 1.437E-04 | global batch size: 512 | lm loss: 2.103411E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.664 | TFLOPs: 57.63 | 7: iteration 16930/ 44073 | consumed samples: 8668160 | consumed tokens: 17752391680 | elapsed time per iteration (s): 4.14 | learning rate: 1.437E-04 | global batch size: 512 | lm loss: 2.085984E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.659 | TFLOPs: 57.63 | 7: iteration 16940/ 44073 | consumed samples: 8673280 | consumed tokens: 17762877440 | elapsed time per iteration (s): 4.23 | learning rate: 1.436E-04 | global batch size: 512 | lm loss: 2.066550E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.031 | TFLOPs: 56.41 | 7: iteration 16950/ 44073 | consumed samples: 8678400 | consumed tokens: 17773363200 | elapsed time per iteration (s): 4.30 | learning rate: 1.436E-04 | global batch size: 512 | lm loss: 2.067971E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.087 | TFLOPs: 55.50 | 7: iteration 16960/ 44073 | consumed samples: 8683520 | consumed tokens: 17783848960 | elapsed time per iteration (s): 4.14 | learning rate: 1.435E-04 | global batch size: 512 | lm loss: 2.082032E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.678 | TFLOPs: 57.64 | 7: iteration 16970/ 44073 | consumed samples: 8688640 | consumed tokens: 17794334720 | elapsed time per iteration (s): 4.13 | learning rate: 1.434E-04 | global batch size: 512 | lm loss: 2.061027E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.871 | TFLOPs: 57.73 | 7: iteration 16980/ 44073 | consumed samples: 8693760 | consumed tokens: 17804820480 | elapsed time per iteration (s): 4.16 | learning rate: 1.434E-04 | global batch size: 512 | lm loss: 2.074516E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.066 | TFLOPs: 57.36 | 7: iteration 16990/ 44073 | consumed samples: 8698880 | consumed tokens: 17815306240 | elapsed time per iteration (s): 4.15 | learning rate: 1.433E-04 | global batch size: 512 | lm loss: 2.059802E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.293 | TFLOPs: 57.46 | 7: iteration 17000/ 44073 | consumed samples: 8704000 | consumed tokens: 17825792000 | elapsed time per iteration (s): 4.14 | learning rate: 1.433E-04 | global batch size: 512 | lm loss: 2.084096E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.609 | TFLOPs: 57.61 | 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 17000 | lm loss value: 2.031860E+00 | lm loss PPL: 7.628264E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 17000 to checkpoints_2b2 0: [2022-11-26 06:18:09,559] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step17000 is begin to save! 0: [2022-11-26 06:18:09,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_01-model_00-model_states.pt... 0: [2022-11-26 06:18:09,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_01-model_00-model_states.pt. 0: [2022-11-26 06:18:10,000] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_03-model_00-model_states.pt... 0: [2022-11-26 06:18:10,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_03-model_00-model_states.pt. 0: [2022-11-26 06:18:10,139] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_04-model_00-model_states.pt... 0: [2022-11-26 06:18:10,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_04-model_00-model_states.pt. 0: [2022-11-26 06:18:10,290] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_05-model_00-model_states.pt... 0: [2022-11-26 06:18:10,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_05-model_00-model_states.pt. 0: [2022-11-26 06:18:10,436] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_06-model_00-model_states.pt... 0: [2022-11-26 06:18:10,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_06-model_00-model_states.pt. 0: [2022-11-26 06:18:10,564] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_07-model_00-model_states.pt... 0: [2022-11-26 06:18:10,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_07-model_00-model_states.pt. 0: [2022-11-26 06:18:10,695] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_08-model_00-model_states.pt... 0: [2022-11-26 06:18:10,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_08-model_00-model_states.pt. 0: [2022-11-26 06:18:10,823] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_09-model_00-model_states.pt... 0: [2022-11-26 06:18:10,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_09-model_00-model_states.pt. 0: [2022-11-26 06:18:10,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_10-model_00-model_states.pt... 0: [2022-11-26 06:18:11,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_10-model_00-model_states.pt. 0: [2022-11-26 06:18:11,074] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_11-model_00-model_states.pt... 0: [2022-11-26 06:18:11,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_11-model_00-model_states.pt. 0: [2022-11-26 06:18:11,200] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_12-model_00-model_states.pt... 0: [2022-11-26 06:18:11,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_12-model_00-model_states.pt. 0: [2022-11-26 06:18:11,326] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_13-model_00-model_states.pt... 0: [2022-11-26 06:18:11,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_13-model_00-model_states.pt. 0: [2022-11-26 06:18:11,454] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_14-model_00-model_states.pt... 0: [2022-11-26 06:18:11,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_14-model_00-model_states.pt. 0: [2022-11-26 06:18:11,585] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_15-model_00-model_states.pt... 0: [2022-11-26 06:18:11,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_15-model_00-model_states.pt. 0: [2022-11-26 06:18:11,730] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_16-model_00-model_states.pt... 0: [2022-11-26 06:18:11,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_16-model_00-model_states.pt. 0: [2022-11-26 06:18:11,874] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_17-model_00-model_states.pt... 0: [2022-11-26 06:18:12,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_17-model_00-model_states.pt. 0: [2022-11-26 06:18:12,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_18-model_00-model_states.pt... 0: [2022-11-26 06:18:12,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_18-model_00-model_states.pt. 0: [2022-11-26 06:18:12,157] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_19-model_00-model_states.pt... 0: [2022-11-26 06:18:12,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_19-model_00-model_states.pt. 0: [2022-11-26 06:18:12,295] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_20-model_00-model_states.pt... 0: [2022-11-26 06:18:12,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_20-model_00-model_states.pt. 0: [2022-11-26 06:18:12,436] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_21-model_00-model_states.pt... 0: [2022-11-26 06:18:12,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_21-model_00-model_states.pt. 0: [2022-11-26 06:18:12,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_22-model_00-model_states.pt... 0: [2022-11-26 06:18:12,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_22-model_00-model_states.pt. 0: [2022-11-26 06:18:12,718] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_23-model_00-model_states.pt... 0: [2022-11-26 06:18:12,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_23-model_00-model_states.pt. 0: [2022-11-26 06:18:12,861] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_24-model_00-model_states.pt... 0: [2022-11-26 06:18:12,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_24-model_00-model_states.pt. 0: [2022-11-26 06:18:12,999] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_25-model_00-model_states.pt... 0: [2022-11-26 06:18:13,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_25-model_00-model_states.pt. 0: [2022-11-26 06:18:13,138] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_26-model_00-model_states.pt... 0: [2022-11-26 06:18:13,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_26-model_00-model_states.pt. 0: [2022-11-26 06:18:13,278] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_27-model_00-model_states.pt... 0: [2022-11-26 06:18:13,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_27-model_00-model_states.pt. 0: [2022-11-26 06:18:13,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_28-model_00-model_states.pt... 0: [2022-11-26 06:18:13,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_28-model_00-model_states.pt. 0: [2022-11-26 06:18:13,555] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_29-model_00-model_states.pt... 0: [2022-11-26 06:18:13,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_29-model_00-model_states.pt. 0: [2022-11-26 06:18:13,694] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_30-model_00-model_states.pt... 0: [2022-11-26 06:18:13,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_30-model_00-model_states.pt. 0: [2022-11-26 06:18:13,837] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_31-model_00-model_states.pt... 0: [2022-11-26 06:18:13,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_31-model_00-model_states.pt. 0: [2022-11-26 06:18:13,980] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_32-model_00-model_states.pt... 0: [2022-11-26 06:18:14,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_32-model_00-model_states.pt. 0: [2022-11-26 06:18:14,117] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_33-model_00-model_states.pt... 0: [2022-11-26 06:18:14,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_33-model_00-model_states.pt. 0: [2022-11-26 06:18:14,253] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_34-model_00-model_states.pt... 0: [2022-11-26 06:18:14,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_34-model_00-model_states.pt. 0: [2022-11-26 06:18:14,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/layer_36-model_00-model_states.pt... 0: [2022-11-26 06:18:14,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/layer_36-model_00-model_states.pt. 0: [2022-11-26 06:18:14,397] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step17000/mp_rank_00_model_states.pt 0: [2022-11-26 06:18:14,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/mp_rank_00_model_states.pt... 0: [2022-11-26 06:18:14,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/mp_rank_00_model_states.pt. 0: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:18:14,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step17000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:18:15,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:18:15,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 06:18:15,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 0: [2022-11-26 06:18:15,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:18:15,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 06:18:15,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 0: [2022-11-26 06:18:15,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:18:15,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 06:18:15,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 0: [2022-11-26 06:18:15,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:18:15,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 06:18:15,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 0: [2022-11-26 06:18:15,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:18:15,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:18:15,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 06:18:15,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 0: [2022-11-26 06:18:15,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:18:15,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 06:18:15,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 0: [2022-11-26 06:18:15,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:18:15,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 06:18:15,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 5: [2022-11-26 06:18:15,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:18:15,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:18:15,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:18:15,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:18:15,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:18:15,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:18:15,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:18:15,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:18:15,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 06:18:15,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 06:18:15,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 06:18:15,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 06:18:15,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 06:18:15,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 06:18:15,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 06:18:15,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 06:18:15,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 5: [2022-11-26 06:18:15,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 5: [2022-11-26 06:18:15,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 5: [2022-11-26 06:18:15,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 5: [2022-11-26 06:18:15,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 5: [2022-11-26 06:18:15,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 5: [2022-11-26 06:18:15,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 5: [2022-11-26 06:18:15,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 1: [2022-11-26 06:18:15,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:18:15,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:18:15,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:18:15,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 06:18:15,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 06:18:15,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 06:18:15,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 1: [2022-11-26 06:18:15,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 1: [2022-11-26 06:18:15,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 1: [2022-11-26 06:18:15,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:18:15,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:18:15,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 06:18:15,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 06:18:15,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 1: [2022-11-26 06:18:15,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 1: [2022-11-26 06:18:15,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:18:15,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 06:18:15,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 1: [2022-11-26 06:18:15,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:18:15,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 06:18:15,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 1: [2022-11-26 06:18:15,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:18:15,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 06:18:15,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 4: [2022-11-26 06:18:15,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:18:15,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 06:18:15,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 4: [2022-11-26 06:18:15,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:18:15,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 06:18:15,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 4: [2022-11-26 06:18:15,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:18:15,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 06:18:15,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 4: [2022-11-26 06:18:15,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:18:15,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 06:18:15,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 4: [2022-11-26 06:18:15,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:18:15,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:18:15,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:18:15,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:18:15,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 06:18:15,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 06:18:15,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 06:18:15,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 4: [2022-11-26 06:18:15,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 4: [2022-11-26 06:18:15,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 4: [2022-11-26 06:18:15,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 06:18:15,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 6: [2022-11-26 06:18:15,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:18:15,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:18:15,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:18:15,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 06:18:15,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 06:18:15,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 06:18:15,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 6: [2022-11-26 06:18:15,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 6: [2022-11-26 06:18:15,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 6: [2022-11-26 06:18:15,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:18:15,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 06:18:15,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 2: [2022-11-26 06:18:15,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:18:15,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:18:15,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:18:15,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 06:18:15,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 06:18:15,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:18:15,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 2: [2022-11-26 06:18:15,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 06:18:15,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 2: [2022-11-26 06:18:15,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 2: [2022-11-26 06:18:15,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 06:18:15,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 2: [2022-11-26 06:18:15,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:18:15,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:18:15,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 6: [2022-11-26 06:18:15,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 2: [2022-11-26 06:18:15,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 6: [2022-11-26 06:18:15,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 2: [2022-11-26 06:18:15,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:18:15,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 06:18:15,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 2: [2022-11-26 06:18:15,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:18:15,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 06:18:15,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 6: [2022-11-26 06:18:15,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:18:15,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 06:18:15,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 6: [2022-11-26 06:18:15,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:18:15,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 2: [2022-11-26 06:18:15,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:18:15,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 2: [2022-11-26 06:18:15,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 06:18:15,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 6: [2022-11-26 06:18:15,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:18:15,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 06:18:15,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 3: [2022-11-26 06:18:15,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:18:15,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:18:15,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:18:15,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:18:15,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 06:18:15,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 06:18:15,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 06:18:15,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 06:18:15,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 3: [2022-11-26 06:18:15,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 3: [2022-11-26 06:18:15,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 3: [2022-11-26 06:18:15,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 3: [2022-11-26 06:18:15,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:18:15,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 06:18:15,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 3: [2022-11-26 06:18:15,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:18:15,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 06:18:15,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:18:15,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 06:18:15,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 3: [2022-11-26 06:18:15,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 3: [2022-11-26 06:18:15,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:18:15,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 06:18:15,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 0: [2022-11-26 06:18:15,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 06:18:15,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 7: [2022-11-26 06:18:15,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:18:15,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:18:15,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:18:15,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 06:18:15,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 06:18:15,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:18:15,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 7: [2022-11-26 06:18:15,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 06:18:15,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 7: [2022-11-26 06:18:15,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 06:18:15,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 7: [2022-11-26 06:18:15,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 7: [2022-11-26 06:18:15,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:18:15,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:18:15,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:18:15,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:18:15,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 06:18:15,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 06:18:15,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 06:18:15,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 7: [2022-11-26 06:18:15,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step17000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 06:18:15,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 7: [2022-11-26 06:18:15,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 7: [2022-11-26 06:18:15,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 0: successfully saved checkpoint at iteration 17000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 6313.15 7: iteration 17010/ 44073 | consumed samples: 8709120 | consumed tokens: 17836277760 | elapsed time per iteration (s): 4.93 | learning rate: 1.432E-04 | global batch size: 512 | lm loss: 2.086681E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 103.758 | TFLOPs: 48.36 | 7: iteration 17020/ 44073 | consumed samples: 8714240 | consumed tokens: 17846763520 | elapsed time per iteration (s): 4.15 | learning rate: 1.431E-04 | global batch size: 512 | lm loss: 2.076559E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.494 | TFLOPs: 57.55 | 7: iteration 17030/ 44073 | consumed samples: 8719360 | consumed tokens: 17857249280 | elapsed time per iteration (s): 4.14 | learning rate: 1.431E-04 | global batch size: 512 | lm loss: 2.057426E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.574 | TFLOPs: 57.59 | 7: iteration 17040/ 44073 | consumed samples: 8724480 | consumed tokens: 17867735040 | elapsed time per iteration (s): 4.16 | learning rate: 1.430E-04 | global batch size: 512 | lm loss: 2.071498E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.965 | TFLOPs: 57.31 | 7: iteration 17050/ 44073 | consumed samples: 8729600 | consumed tokens: 17878220800 | elapsed time per iteration (s): 4.14 | learning rate: 1.430E-04 | global batch size: 512 | lm loss: 2.078176E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.743 | TFLOPs: 57.67 | 7: iteration 17060/ 44073 | consumed samples: 8734720 | consumed tokens: 17888706560 | elapsed time per iteration (s): 4.14 | learning rate: 1.429E-04 | global batch size: 512 | lm loss: 2.064818E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.760 | TFLOPs: 57.68 | 7: iteration 17070/ 44073 | consumed samples: 8739840 | consumed tokens: 17899192320 | elapsed time per iteration (s): 4.14 | learning rate: 1.428E-04 | global batch size: 512 | lm loss: 2.067929E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.545 | TFLOPs: 57.58 | 7: iteration 17080/ 44073 | consumed samples: 8744960 | consumed tokens: 17909678080 | elapsed time per iteration (s): 4.20 | learning rate: 1.428E-04 | global batch size: 512 | lm loss: 2.064266E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.797 | TFLOPs: 56.76 | 7: iteration 17090/ 44073 | consumed samples: 8750080 | consumed tokens: 17920163840 | elapsed time per iteration (s): 4.16 | learning rate: 1.427E-04 | global batch size: 512 | lm loss: 2.086562E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.104 | TFLOPs: 57.37 | 7: iteration 17100/ 44073 | consumed samples: 8755200 | consumed tokens: 17930649600 | elapsed time per iteration (s): 4.20 | learning rate: 1.427E-04 | global batch size: 512 | lm loss: 2.091684E+00 | grad norm: 0.351 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.870 | TFLOPs: 56.80 | 7: iteration 17110/ 44073 | consumed samples: 8760320 | consumed tokens: 17941135360 | elapsed time per iteration (s): 4.14 | learning rate: 1.426E-04 | global batch size: 512 | lm loss: 2.098496E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.795 | TFLOPs: 57.69 | 7: iteration 17120/ 44073 | consumed samples: 8765440 | consumed tokens: 17951621120 | elapsed time per iteration (s): 4.15 | learning rate: 1.425E-04 | global batch size: 512 | lm loss: 2.066230E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.354 | TFLOPs: 57.49 | 7: iteration 17130/ 44073 | consumed samples: 8770560 | consumed tokens: 17962106880 | elapsed time per iteration (s): 4.22 | learning rate: 1.425E-04 | global batch size: 512 | lm loss: 2.077411E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.337 | TFLOPs: 56.55 | 7: iteration 17140/ 44073 | consumed samples: 8775680 | consumed tokens: 17972592640 | elapsed time per iteration (s): 4.20 | learning rate: 1.424E-04 | global batch size: 512 | lm loss: 2.090733E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.942 | TFLOPs: 56.83 | 7: iteration 17150/ 44073 | consumed samples: 8780800 | consumed tokens: 17983078400 | elapsed time per iteration (s): 4.16 | learning rate: 1.424E-04 | global batch size: 512 | lm loss: 2.097130E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.073 | TFLOPs: 57.36 | 7: iteration 17160/ 44073 | consumed samples: 8785920 | consumed tokens: 17993564160 | elapsed time per iteration (s): 4.15 | learning rate: 1.423E-04 | global batch size: 512 | lm loss: 2.094907E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.234 | TFLOPs: 57.43 | 7: iteration 17170/ 44073 | consumed samples: 8791040 | consumed tokens: 18004049920 | elapsed time per iteration (s): 4.39 | learning rate: 1.422E-04 | global batch size: 512 | lm loss: 2.072649E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 116.684 | TFLOPs: 54.38 | 7: iteration 17180/ 44073 | consumed samples: 8796160 | consumed tokens: 18014535680 | elapsed time per iteration (s): 4.15 | learning rate: 1.422E-04 | global batch size: 512 | lm loss: 2.049578E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.449 | TFLOPs: 57.53 | 7: iteration 17190/ 44073 | consumed samples: 8801280 | consumed tokens: 18025021440 | elapsed time per iteration (s): 4.15 | learning rate: 1.421E-04 | global batch size: 512 | lm loss: 2.091572E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.490 | TFLOPs: 57.55 | 7: iteration 17200/ 44073 | consumed samples: 8806400 | consumed tokens: 18035507200 | elapsed time per iteration (s): 4.16 | learning rate: 1.421E-04 | global batch size: 512 | lm loss: 2.071854E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.098 | TFLOPs: 57.37 | 7: iteration 17210/ 44073 | consumed samples: 8811520 | consumed tokens: 18045992960 | elapsed time per iteration (s): 4.14 | learning rate: 1.420E-04 | global batch size: 512 | lm loss: 2.082692E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.695 | TFLOPs: 57.65 | 7: iteration 17220/ 44073 | consumed samples: 8816640 | consumed tokens: 18056478720 | elapsed time per iteration (s): 4.15 | learning rate: 1.419E-04 | global batch size: 512 | lm loss: 2.083325E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.476 | TFLOPs: 57.55 | 7: iteration 17230/ 44073 | consumed samples: 8821760 | consumed tokens: 18066964480 | elapsed time per iteration (s): 4.36 | learning rate: 1.419E-04 | global batch size: 512 | lm loss: 2.063861E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.401 | TFLOPs: 54.71 | 7: iteration 17240/ 44073 | consumed samples: 8826880 | consumed tokens: 18077450240 | elapsed time per iteration (s): 4.16 | learning rate: 1.418E-04 | global batch size: 512 | lm loss: 2.045482E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.938 | TFLOPs: 57.30 | 7: iteration 17250/ 44073 | consumed samples: 8832000 | consumed tokens: 18087936000 | elapsed time per iteration (s): 4.15 | learning rate: 1.417E-04 | global batch size: 512 | lm loss: 2.048806E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.429 | TFLOPs: 57.52 | 7: iteration 17260/ 44073 | consumed samples: 8837120 | consumed tokens: 18098421760 | elapsed time per iteration (s): 4.15 | learning rate: 1.417E-04 | global batch size: 512 | lm loss: 2.049168E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.519 | TFLOPs: 57.57 | 7: iteration 17270/ 44073 | consumed samples: 8842240 | consumed tokens: 18108907520 | elapsed time per iteration (s): 4.17 | learning rate: 1.416E-04 | global batch size: 512 | lm loss: 2.065171E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.733 | TFLOPs: 57.20 | 7: iteration 17280/ 44073 | consumed samples: 8847360 | consumed tokens: 18119393280 | elapsed time per iteration (s): 4.16 | learning rate: 1.416E-04 | global batch size: 512 | lm loss: 2.066977E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.008 | TFLOPs: 57.33 | 7: iteration 17290/ 44073 | consumed samples: 8852480 | consumed tokens: 18129879040 | elapsed time per iteration (s): 4.19 | learning rate: 1.415E-04 | global batch size: 512 | lm loss: 2.078177E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.162 | TFLOPs: 56.93 | 7: iteration 17300/ 44073 | consumed samples: 8857600 | consumed tokens: 18140364800 | elapsed time per iteration (s): 4.16 | learning rate: 1.414E-04 | global batch size: 512 | lm loss: 2.050081E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.197 | TFLOPs: 57.42 | 7: iteration 17310/ 44073 | consumed samples: 8862720 | consumed tokens: 18150850560 | elapsed time per iteration (s): 4.17 | learning rate: 1.414E-04 | global batch size: 512 | lm loss: 2.036332E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.872 | TFLOPs: 57.26 | 7: iteration 17320/ 44073 | consumed samples: 8867840 | consumed tokens: 18161336320 | elapsed time per iteration (s): 4.15 | learning rate: 1.413E-04 | global batch size: 512 | lm loss: 2.066636E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.508 | TFLOPs: 57.56 | 7: iteration 17330/ 44073 | consumed samples: 8872960 | consumed tokens: 18171822080 | elapsed time per iteration (s): 4.18 | learning rate: 1.413E-04 | global batch size: 512 | lm loss: 2.070275E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.381 | TFLOPs: 57.04 | 7: iteration 17340/ 44073 | consumed samples: 8878080 | consumed tokens: 18182307840 | elapsed time per iteration (s): 4.17 | learning rate: 1.412E-04 | global batch size: 512 | lm loss: 2.071506E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.745 | TFLOPs: 57.21 | 7: iteration 17350/ 44073 | consumed samples: 8883200 | consumed tokens: 18192793600 | elapsed time per iteration (s): 4.21 | learning rate: 1.411E-04 | global batch size: 512 | lm loss: 2.065818E+00 | grad norm: 0.314 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.653 | TFLOPs: 56.70 | 7: iteration 17360/ 44073 | consumed samples: 8888320 | consumed tokens: 18203279360 | elapsed time per iteration (s): 4.18 | learning rate: 1.411E-04 | global batch size: 512 | lm loss: 2.081050E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.573 | TFLOPs: 57.13 | 7: iteration 17370/ 44073 | consumed samples: 8893440 | consumed tokens: 18213765120 | elapsed time per iteration (s): 4.18 | learning rate: 1.410E-04 | global batch size: 512 | lm loss: 2.083117E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.521 | TFLOPs: 57.10 | 7: iteration 17380/ 44073 | consumed samples: 8898560 | consumed tokens: 18224250880 | elapsed time per iteration (s): 4.15 | learning rate: 1.410E-04 | global batch size: 512 | lm loss: 2.068282E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.495 | TFLOPs: 57.55 | 7: iteration 17390/ 44073 | consumed samples: 8903680 | consumed tokens: 18234736640 | elapsed time per iteration (s): 4.18 | learning rate: 1.409E-04 | global batch size: 512 | lm loss: 2.046012E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.559 | TFLOPs: 57.12 | 7: iteration 17400/ 44073 | consumed samples: 8908800 | consumed tokens: 18245222400 | elapsed time per iteration (s): 4.18 | learning rate: 1.408E-04 | global batch size: 512 | lm loss: 2.060892E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.587 | TFLOPs: 57.13 | 7: iteration 17410/ 44073 | consumed samples: 8913920 | consumed tokens: 18255708160 | elapsed time per iteration (s): 4.19 | learning rate: 1.408E-04 | global batch size: 512 | lm loss: 2.059426E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.128 | TFLOPs: 56.92 | 7: iteration 17420/ 44073 | consumed samples: 8919040 | consumed tokens: 18266193920 | elapsed time per iteration (s): 4.16 | learning rate: 1.407E-04 | global batch size: 512 | lm loss: 2.060678E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.952 | TFLOPs: 57.30 | 7: iteration 17430/ 44073 | consumed samples: 8924160 | consumed tokens: 18276679680 | elapsed time per iteration (s): 4.16 | learning rate: 1.407E-04 | global batch size: 512 | lm loss: 2.075521E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.181 | TFLOPs: 57.41 | 7: iteration 17440/ 44073 | consumed samples: 8929280 | consumed tokens: 18287165440 | elapsed time per iteration (s): 4.21 | learning rate: 1.406E-04 | global batch size: 512 | lm loss: 2.079466E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.687 | TFLOPs: 56.71 | 7: iteration 17450/ 44073 | consumed samples: 8934400 | consumed tokens: 18297651200 | elapsed time per iteration (s): 4.15 | learning rate: 1.405E-04 | global batch size: 512 | lm loss: 2.086389E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.374 | TFLOPs: 57.50 | 7: iteration 17460/ 44073 | consumed samples: 8939520 | consumed tokens: 18308136960 | elapsed time per iteration (s): 4.14 | learning rate: 1.405E-04 | global batch size: 512 | lm loss: 2.067679E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.580 | TFLOPs: 57.59 | 7: iteration 17470/ 44073 | consumed samples: 8944640 | consumed tokens: 18318622720 | elapsed time per iteration (s): 4.18 | learning rate: 1.404E-04 | global batch size: 512 | lm loss: 2.066700E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.507 | TFLOPs: 57.09 | 7: iteration 17480/ 44073 | consumed samples: 8949760 | consumed tokens: 18329108480 | elapsed time per iteration (s): 4.17 | learning rate: 1.403E-04 | global batch size: 512 | lm loss: 2.063995E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.709 | TFLOPs: 57.19 | 7: iteration 17490/ 44073 | consumed samples: 8954880 | consumed tokens: 18339594240 | elapsed time per iteration (s): 4.19 | learning rate: 1.403E-04 | global batch size: 512 | lm loss: 2.086735E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.288 | TFLOPs: 56.99 | 7: iteration 17500/ 44073 | consumed samples: 8960000 | consumed tokens: 18350080000 | elapsed time per iteration (s): 4.15 | learning rate: 1.402E-04 | global batch size: 512 | lm loss: 2.100480E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.266 | TFLOPs: 57.45 | 7: iteration 17510/ 44073 | consumed samples: 8965120 | consumed tokens: 18360565760 | elapsed time per iteration (s): 4.17 | learning rate: 1.402E-04 | global batch size: 512 | lm loss: 2.085161E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.738 | TFLOPs: 57.20 | 7: iteration 17520/ 44073 | consumed samples: 8970240 | consumed tokens: 18371051520 | elapsed time per iteration (s): 4.15 | learning rate: 1.401E-04 | global batch size: 512 | lm loss: 2.072819E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.262 | TFLOPs: 57.45 | 7: iteration 17530/ 44073 | consumed samples: 8975360 | consumed tokens: 18381537280 | elapsed time per iteration (s): 4.17 | learning rate: 1.400E-04 | global batch size: 512 | lm loss: 2.074770E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.775 | TFLOPs: 57.22 | 7: iteration 17540/ 44073 | consumed samples: 8980480 | consumed tokens: 18392023040 | elapsed time per iteration (s): 4.15 | learning rate: 1.400E-04 | global batch size: 512 | lm loss: 2.059834E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.441 | TFLOPs: 57.53 | 7: iteration 17550/ 44073 | consumed samples: 8985600 | consumed tokens: 18402508800 | elapsed time per iteration (s): 4.15 | learning rate: 1.399E-04 | global batch size: 512 | lm loss: 2.049100E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.304 | TFLOPs: 57.47 | 7: iteration 17560/ 44073 | consumed samples: 8990720 | consumed tokens: 18412994560 | elapsed time per iteration (s): 4.18 | learning rate: 1.399E-04 | global batch size: 512 | lm loss: 2.054249E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.606 | TFLOPs: 57.14 | 7: iteration 17570/ 44073 | consumed samples: 8995840 | consumed tokens: 18423480320 | elapsed time per iteration (s): 4.20 | learning rate: 1.398E-04 | global batch size: 512 | lm loss: 2.072364E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.960 | TFLOPs: 56.84 | 7: iteration 17580/ 44073 | consumed samples: 9000960 | consumed tokens: 18433966080 | elapsed time per iteration (s): 4.18 | learning rate: 1.397E-04 | global batch size: 512 | lm loss: 2.062745E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.360 | TFLOPs: 57.03 | 7: iteration 17590/ 44073 | consumed samples: 9006080 | consumed tokens: 18444451840 | elapsed time per iteration (s): 4.14 | learning rate: 1.397E-04 | global batch size: 512 | lm loss: 2.051584E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.802 | TFLOPs: 57.70 | 7: iteration 17600/ 44073 | consumed samples: 9011200 | consumed tokens: 18454937600 | elapsed time per iteration (s): 4.14 | learning rate: 1.396E-04 | global batch size: 512 | lm loss: 2.075768E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.574 | TFLOPs: 57.59 | 7: iteration 17610/ 44073 | consumed samples: 9016320 | consumed tokens: 18465423360 | elapsed time per iteration (s): 4.14 | learning rate: 1.396E-04 | global batch size: 512 | lm loss: 2.067952E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.633 | TFLOPs: 57.62 | 7: iteration 17620/ 44073 | consumed samples: 9021440 | consumed tokens: 18475909120 | elapsed time per iteration (s): 4.20 | learning rate: 1.395E-04 | global batch size: 512 | lm loss: 2.060097E+00 | grad norm: 0.114 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.020 | TFLOPs: 56.87 | 7: iteration 17630/ 44073 | consumed samples: 9026560 | consumed tokens: 18486394880 | elapsed time per iteration (s): 4.20 | learning rate: 1.394E-04 | global batch size: 512 | lm loss: 2.054453E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.960 | TFLOPs: 56.84 | 7: iteration 17640/ 44073 | consumed samples: 9031680 | consumed tokens: 18496880640 | elapsed time per iteration (s): 4.22 | learning rate: 1.394E-04 | global batch size: 512 | lm loss: 2.071965E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.268 | TFLOPs: 56.52 | 7: iteration 17650/ 44073 | consumed samples: 9036800 | consumed tokens: 18507366400 | elapsed time per iteration (s): 4.16 | learning rate: 1.393E-04 | global batch size: 512 | lm loss: 2.062996E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.937 | TFLOPs: 57.29 | 7: iteration 17660/ 44073 | consumed samples: 9041920 | consumed tokens: 18517852160 | elapsed time per iteration (s): 4.15 | learning rate: 1.392E-04 | global batch size: 512 | lm loss: 2.071724E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.346 | TFLOPs: 57.49 | 7: iteration 17670/ 44073 | consumed samples: 9047040 | consumed tokens: 18528337920 | elapsed time per iteration (s): 4.14 | learning rate: 1.392E-04 | global batch size: 512 | lm loss: 2.059361E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.692 | TFLOPs: 57.65 | 7: iteration 17680/ 44073 | consumed samples: 9052160 | consumed tokens: 18538823680 | elapsed time per iteration (s): 4.16 | learning rate: 1.391E-04 | global batch size: 512 | lm loss: 2.055113E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.016 | TFLOPs: 57.33 | 7: iteration 17690/ 44073 | consumed samples: 9057280 | consumed tokens: 18549309440 | elapsed time per iteration (s): 4.14 | learning rate: 1.391E-04 | global batch size: 512 | lm loss: 2.060067E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.701 | TFLOPs: 57.65 | 7: iteration 17700/ 44073 | consumed samples: 9062400 | consumed tokens: 18559795200 | elapsed time per iteration (s): 4.14 | learning rate: 1.390E-04 | global batch size: 512 | lm loss: 2.081857E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.571 | TFLOPs: 57.59 | 7: iteration 17710/ 44073 | consumed samples: 9067520 | consumed tokens: 18570280960 | elapsed time per iteration (s): 4.15 | learning rate: 1.389E-04 | global batch size: 512 | lm loss: 2.058886E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.262 | TFLOPs: 57.45 | 7: iteration 17720/ 44073 | consumed samples: 9072640 | consumed tokens: 18580766720 | elapsed time per iteration (s): 4.15 | learning rate: 1.389E-04 | global batch size: 512 | lm loss: 2.043548E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.510 | TFLOPs: 57.56 | 7: iteration 17730/ 44073 | consumed samples: 9077760 | consumed tokens: 18591252480 | elapsed time per iteration (s): 4.14 | learning rate: 1.388E-04 | global batch size: 512 | lm loss: 2.041107E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.816 | TFLOPs: 57.70 | 7: iteration 17740/ 44073 | consumed samples: 9082880 | consumed tokens: 18601738240 | elapsed time per iteration (s): 4.15 | learning rate: 1.388E-04 | global batch size: 512 | lm loss: 2.072240E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.503 | TFLOPs: 57.56 | 7: iteration 17750/ 44073 | consumed samples: 9088000 | consumed tokens: 18612224000 | elapsed time per iteration (s): 4.17 | learning rate: 1.387E-04 | global batch size: 512 | lm loss: 2.075051E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.830 | TFLOPs: 57.24 | 7: iteration 17760/ 44073 | consumed samples: 9093120 | consumed tokens: 18622709760 | elapsed time per iteration (s): 4.14 | learning rate: 1.386E-04 | global batch size: 512 | lm loss: 2.062194E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.535 | TFLOPs: 57.57 | 7: iteration 17770/ 44073 | consumed samples: 9098240 | consumed tokens: 18633195520 | elapsed time per iteration (s): 4.15 | learning rate: 1.386E-04 | global batch size: 512 | lm loss: 2.051233E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.302 | TFLOPs: 57.47 | 7: iteration 17780/ 44073 | consumed samples: 9103360 | consumed tokens: 18643681280 | elapsed time per iteration (s): 4.16 | learning rate: 1.385E-04 | global batch size: 512 | lm loss: 2.075379E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.057 | TFLOPs: 57.35 | 7: iteration 17790/ 44073 | consumed samples: 9108480 | consumed tokens: 18654167040 | elapsed time per iteration (s): 4.17 | learning rate: 1.385E-04 | global batch size: 512 | lm loss: 2.070362E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.902 | TFLOPs: 57.28 | 7: iteration 17800/ 44073 | consumed samples: 9113600 | consumed tokens: 18664652800 | elapsed time per iteration (s): 4.23 | learning rate: 1.384E-04 | global batch size: 512 | lm loss: 2.043705E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.168 | TFLOPs: 56.47 | 7: iteration 17810/ 44073 | consumed samples: 9118720 | consumed tokens: 18675138560 | elapsed time per iteration (s): 4.17 | learning rate: 1.383E-04 | global batch size: 512 | lm loss: 2.060272E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.730 | TFLOPs: 57.20 | 7: iteration 17820/ 44073 | consumed samples: 9123840 | consumed tokens: 18685624320 | elapsed time per iteration (s): 4.46 | learning rate: 1.383E-04 | global batch size: 512 | lm loss: 2.061380E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 114.725 | TFLOPs: 53.47 | 7: iteration 17830/ 44073 | consumed samples: 9128960 | consumed tokens: 18696110080 | elapsed time per iteration (s): 4.15 | learning rate: 1.382E-04 | global batch size: 512 | lm loss: 2.062506E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.358 | TFLOPs: 57.49 | 7: iteration 17840/ 44073 | consumed samples: 9134080 | consumed tokens: 18706595840 | elapsed time per iteration (s): 4.17 | learning rate: 1.381E-04 | global batch size: 512 | lm loss: 2.066390E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.786 | TFLOPs: 57.22 | 7: iteration 17850/ 44073 | consumed samples: 9139200 | consumed tokens: 18717081600 | elapsed time per iteration (s): 4.14 | learning rate: 1.381E-04 | global batch size: 512 | lm loss: 2.039178E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.623 | TFLOPs: 57.61 | 7: iteration 17860/ 44073 | consumed samples: 9144320 | consumed tokens: 18727567360 | elapsed time per iteration (s): 4.14 | learning rate: 1.380E-04 | global batch size: 512 | lm loss: 2.051437E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.754 | TFLOPs: 57.68 | 7: iteration 17870/ 44073 | consumed samples: 9149440 | consumed tokens: 18738053120 | elapsed time per iteration (s): 4.25 | learning rate: 1.380E-04 | global batch size: 512 | lm loss: 2.075984E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.443 | TFLOPs: 56.13 | 7: iteration 17880/ 44073 | consumed samples: 9154560 | consumed tokens: 18748538880 | elapsed time per iteration (s): 4.15 | learning rate: 1.379E-04 | global batch size: 512 | lm loss: 2.059655E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.244 | TFLOPs: 57.44 | 7: iteration 17890/ 44073 | consumed samples: 9159680 | consumed tokens: 18759024640 | elapsed time per iteration (s): 4.14 | learning rate: 1.378E-04 | global batch size: 512 | lm loss: 2.045815E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.780 | TFLOPs: 57.69 | 7: iteration 17900/ 44073 | consumed samples: 9164800 | consumed tokens: 18769510400 | elapsed time per iteration (s): 4.14 | learning rate: 1.378E-04 | global batch size: 512 | lm loss: 2.051617E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.632 | TFLOPs: 57.62 | 7: iteration 17910/ 44073 | consumed samples: 9169920 | consumed tokens: 18779996160 | elapsed time per iteration (s): 4.17 | learning rate: 1.377E-04 | global batch size: 512 | lm loss: 2.059059E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.849 | TFLOPs: 57.25 | 7: iteration 17920/ 44073 | consumed samples: 9175040 | consumed tokens: 18790481920 | elapsed time per iteration (s): 4.16 | learning rate: 1.377E-04 | global batch size: 512 | lm loss: 2.032176E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.126 | TFLOPs: 57.38 | 7: iteration 17930/ 44073 | consumed samples: 9180160 | consumed tokens: 18800967680 | elapsed time per iteration (s): 4.13 | learning rate: 1.376E-04 | global batch size: 512 | lm loss: 2.063897E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.859 | TFLOPs: 57.72 | 7: iteration 17940/ 44073 | consumed samples: 9185280 | consumed tokens: 18811453440 | elapsed time per iteration (s): 4.13 | learning rate: 1.375E-04 | global batch size: 512 | lm loss: 2.070608E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.908 | TFLOPs: 57.75 | 7: iteration 17950/ 44073 | consumed samples: 9190400 | consumed tokens: 18821939200 | elapsed time per iteration (s): 4.17 | learning rate: 1.375E-04 | global batch size: 512 | lm loss: 2.056781E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.905 | TFLOPs: 57.28 | 7: iteration 17960/ 44073 | consumed samples: 9195520 | consumed tokens: 18832424960 | elapsed time per iteration (s): 4.14 | learning rate: 1.374E-04 | global batch size: 512 | lm loss: 2.060600E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.804 | TFLOPs: 57.70 | 7: iteration 17970/ 44073 | consumed samples: 9200640 | consumed tokens: 18842910720 | elapsed time per iteration (s): 4.13 | learning rate: 1.373E-04 | global batch size: 512 | lm loss: 2.043696E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.874 | TFLOPs: 57.73 | 7: iteration 17980/ 44073 | consumed samples: 9205760 | consumed tokens: 18853396480 | elapsed time per iteration (s): 4.16 | learning rate: 1.373E-04 | global batch size: 512 | lm loss: 2.075584E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.199 | TFLOPs: 57.42 | 7: iteration 17990/ 44073 | consumed samples: 9210880 | consumed tokens: 18863882240 | elapsed time per iteration (s): 4.14 | learning rate: 1.372E-04 | global batch size: 512 | lm loss: 2.052670E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.717 | TFLOPs: 57.66 | 0: [2022-11-26 07:27:45,332] [INFO] [logging.py:68:log_dist] [Rank 0] step=18000, skipped=0, lr=[0.00013715677597501234, 0.00013715677597501234, 0.00013715677597501234], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 7: iteration 18000/ 44073 | consumed samples: 9216000 | consumed tokens: 18874368000 | elapsed time per iteration (s): 4.14 | learning rate: 1.372E-04 | global batch size: 512 | lm loss: 2.058842E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.760 | TFLOPs: 57.68 | 0: steps: 18000 loss: 2.0757 iter time (s): 4.174 samples/sec: 122.678 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 18000 | lm loss value: 2.055872E+00 | lm loss PPL: 7.813648E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 18000 to checkpoints_2b2 0: [2022-11-26 07:27:46,669] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step18000 is begin to save! 0: [2022-11-26 07:27:46,674] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_01-model_00-model_states.pt... 0: [2022-11-26 07:27:47,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_01-model_00-model_states.pt. 0: [2022-11-26 07:27:47,011] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_03-model_00-model_states.pt... 0: [2022-11-26 07:27:47,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_03-model_00-model_states.pt. 0: [2022-11-26 07:27:47,160] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_04-model_00-model_states.pt... 0: [2022-11-26 07:27:47,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_04-model_00-model_states.pt. 0: [2022-11-26 07:27:47,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_05-model_00-model_states.pt... 0: [2022-11-26 07:27:47,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_05-model_00-model_states.pt. 0: [2022-11-26 07:27:47,459] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_06-model_00-model_states.pt... 0: [2022-11-26 07:27:47,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_06-model_00-model_states.pt. 0: [2022-11-26 07:27:47,585] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_07-model_00-model_states.pt... 0: [2022-11-26 07:27:47,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_07-model_00-model_states.pt. 0: [2022-11-26 07:27:47,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_08-model_00-model_states.pt... 0: [2022-11-26 07:27:47,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_08-model_00-model_states.pt. 0: [2022-11-26 07:27:47,838] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_09-model_00-model_states.pt... 0: [2022-11-26 07:27:47,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_09-model_00-model_states.pt. 0: [2022-11-26 07:27:47,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_10-model_00-model_states.pt... 0: [2022-11-26 07:27:48,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_10-model_00-model_states.pt. 0: [2022-11-26 07:27:48,092] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_11-model_00-model_states.pt... 0: [2022-11-26 07:27:48,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_11-model_00-model_states.pt. 0: [2022-11-26 07:27:48,217] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_12-model_00-model_states.pt... 0: [2022-11-26 07:27:48,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_12-model_00-model_states.pt. 0: [2022-11-26 07:27:48,345] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_13-model_00-model_states.pt... 0: [2022-11-26 07:27:48,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_13-model_00-model_states.pt. 0: [2022-11-26 07:27:48,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_14-model_00-model_states.pt... 0: [2022-11-26 07:27:48,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_14-model_00-model_states.pt. 0: [2022-11-26 07:27:48,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_15-model_00-model_states.pt... 0: [2022-11-26 07:27:48,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_15-model_00-model_states.pt. 0: [2022-11-26 07:27:48,721] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_16-model_00-model_states.pt... 0: [2022-11-26 07:27:48,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_16-model_00-model_states.pt. 0: [2022-11-26 07:27:48,850] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_17-model_00-model_states.pt... 0: [2022-11-26 07:27:48,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_17-model_00-model_states.pt. 0: [2022-11-26 07:27:48,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_18-model_00-model_states.pt... 0: [2022-11-26 07:27:49,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_18-model_00-model_states.pt. 0: [2022-11-26 07:27:49,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_19-model_00-model_states.pt... 0: [2022-11-26 07:27:49,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_19-model_00-model_states.pt. 0: [2022-11-26 07:27:49,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_20-model_00-model_states.pt... 0: [2022-11-26 07:27:49,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_20-model_00-model_states.pt. 0: [2022-11-26 07:27:49,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_21-model_00-model_states.pt... 0: [2022-11-26 07:27:49,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_21-model_00-model_states.pt. 0: [2022-11-26 07:27:49,478] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_22-model_00-model_states.pt... 0: [2022-11-26 07:27:49,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_22-model_00-model_states.pt. 0: [2022-11-26 07:27:49,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_23-model_00-model_states.pt... 0: [2022-11-26 07:27:49,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_23-model_00-model_states.pt. 0: [2022-11-26 07:27:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_24-model_00-model_states.pt... 0: [2022-11-26 07:27:49,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_24-model_00-model_states.pt. 0: [2022-11-26 07:27:49,848] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_25-model_00-model_states.pt... 0: [2022-11-26 07:27:49,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_25-model_00-model_states.pt. 0: [2022-11-26 07:27:49,974] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_26-model_00-model_states.pt... 0: [2022-11-26 07:27:50,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_26-model_00-model_states.pt. 0: [2022-11-26 07:27:50,099] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_27-model_00-model_states.pt... 0: [2022-11-26 07:27:50,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_27-model_00-model_states.pt. 0: [2022-11-26 07:27:50,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_28-model_00-model_states.pt... 0: [2022-11-26 07:27:50,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_28-model_00-model_states.pt. 0: [2022-11-26 07:27:50,352] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_29-model_00-model_states.pt... 0: [2022-11-26 07:27:50,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_29-model_00-model_states.pt. 0: [2022-11-26 07:27:50,477] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_30-model_00-model_states.pt... 0: [2022-11-26 07:27:50,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_30-model_00-model_states.pt. 0: [2022-11-26 07:27:50,600] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_31-model_00-model_states.pt... 0: [2022-11-26 07:27:50,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_31-model_00-model_states.pt. 0: [2022-11-26 07:27:50,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_32-model_00-model_states.pt... 0: [2022-11-26 07:27:50,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_32-model_00-model_states.pt. 0: [2022-11-26 07:27:50,849] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_33-model_00-model_states.pt... 0: [2022-11-26 07:27:50,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_33-model_00-model_states.pt. 0: [2022-11-26 07:27:50,974] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_34-model_00-model_states.pt... 0: [2022-11-26 07:27:51,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_34-model_00-model_states.pt. 0: [2022-11-26 07:27:51,098] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/layer_36-model_00-model_states.pt... 0: [2022-11-26 07:27:51,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/layer_36-model_00-model_states.pt. 0: [2022-11-26 07:27:51,104] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step18000/mp_rank_00_model_states.pt 0: [2022-11-26 07:27:51,104] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/mp_rank_00_model_states.pt... 0: [2022-11-26 07:27:51,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/mp_rank_00_model_states.pt. 0: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:27:51,128] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:27:51,128] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:27:51,128] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:27:51,128] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:27:51,128] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:27:51,128] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:27:51,128] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:27:51,128] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:27:51,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step18000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:27:51,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:27:51,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:27:51,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 07:27:51,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: [2022-11-26 07:27:51,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:27:51,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 07:27:51,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 3: [2022-11-26 07:27:51,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:27:51,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 07:27:51,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 3: [2022-11-26 07:27:51,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:27:51,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 07:27:51,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 4: [2022-11-26 07:27:51,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:27:51,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 07:27:51,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 4: [2022-11-26 07:27:51,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:27:51,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 07:27:51,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 4: [2022-11-26 07:27:51,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:27:51,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 07:27:51,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 4: [2022-11-26 07:27:51,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:27:51,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 07:27:51,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 3: [2022-11-26 07:27:51,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:27:51,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 07:27:51,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 3: [2022-11-26 07:27:51,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:27:51,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:27:51,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 07:27:51,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 07:27:51,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 3: [2022-11-26 07:27:51,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 2: [2022-11-26 07:27:51,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:27:51,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 07:27:51,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 2: [2022-11-26 07:27:51,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:27:51,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 07:27:51,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 2: [2022-11-26 07:27:51,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:27:51,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 07:27:51,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 2: [2022-11-26 07:27:51,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:27:51,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 07:27:51,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: [2022-11-26 07:27:51,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:27:51,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:27:51,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 07:27:51,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 3: [2022-11-26 07:27:51,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 07:27:51,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: [2022-11-26 07:27:51,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:27:51,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 07:27:51,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 4: [2022-11-26 07:27:51,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:27:51,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 07:27:51,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: [2022-11-26 07:27:51,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:27:51,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 07:27:51,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: [2022-11-26 07:27:51,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:27:51,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 07:27:51,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 2: [2022-11-26 07:27:51,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:27:51,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 07:27:51,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: [2022-11-26 07:27:51,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:27:51,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 07:27:51,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 2: [2022-11-26 07:27:51,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:27:51,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:27:51,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 07:27:51,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 2: [2022-11-26 07:27:51,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 07:27:51,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 2: [2022-11-26 07:27:51,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:27:51,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 07:27:51,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 3: [2022-11-26 07:27:51,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:27:51,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 07:27:51,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 3: [2022-11-26 07:27:51,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:27:51,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 07:27:51,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 4: [2022-11-26 07:27:51,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:27:51,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 07:27:51,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 4: [2022-11-26 07:27:51,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:27:51,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 07:27:51,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 4: [2022-11-26 07:27:51,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:27:51,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 07:27:51,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 6: [2022-11-26 07:27:52,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:27:52,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 07:27:52,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 6: [2022-11-26 07:27:52,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:27:52,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 07:27:52,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:27:52,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:27:52,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 6: [2022-11-26 07:27:52,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 07:27:52,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 07:27:52,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 6: [2022-11-26 07:27:52,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 6: [2022-11-26 07:27:52,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:27:52,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 07:27:52,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 6: [2022-11-26 07:27:52,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:27:52,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 07:27:52,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:27:52,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 6: [2022-11-26 07:27:52,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 07:27:52,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 6: [2022-11-26 07:27:52,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:27:52,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 07:27:52,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-26 07:27:52,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:27:52,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:27:52,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 07:27:52,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 07:27:52,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-26 07:27:52,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-26 07:27:52,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:27:52,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:27:52,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:27:52,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:27:52,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:27:52,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:27:52,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 07:27:52,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 07:27:52,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-26 07:27:52,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 07:27:52,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 07:27:52,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 07:27:52,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 07:27:52,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-26 07:27:52,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-26 07:27:52,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-26 07:27:52,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-26 07:27:52,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 5: [2022-11-26 07:27:52,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:27:52,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:27:52,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:27:52,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:27:52,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:27:52,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:27:52,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:27:52,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:27:52,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 07:27:52,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 07:27:52,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 07:27:52,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 07:27:52,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 07:27:52,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 07:27:52,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 07:27:52,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 07:27:52,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 5: [2022-11-26 07:27:52,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 5: [2022-11-26 07:27:52,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 5: [2022-11-26 07:27:52,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 5: [2022-11-26 07:27:52,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 5: [2022-11-26 07:27:52,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 5: [2022-11-26 07:27:52,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 5: [2022-11-26 07:27:52,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 7: [2022-11-26 07:27:52,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:27:52,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:27:52,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:27:52,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:27:52,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:27:52,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 07:27:52,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:27:52,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 07:27:52,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:27:52,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 07:27:52,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 07:27:52,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 7: [2022-11-26 07:27:52,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 07:27:52,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 7: [2022-11-26 07:27:52,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 07:27:52,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 07:27:52,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 7: [2022-11-26 07:27:52,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 7: [2022-11-26 07:27:52,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 7: [2022-11-26 07:27:52,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 7: [2022-11-26 07:27:52,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 7: [2022-11-26 07:27:52,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:27:52,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 07:27:52,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: [2022-11-26 07:27:52,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step18000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 07:27:52,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: successfully saved checkpoint at iteration 18000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 5727.26 7: iteration 18010/ 44073 | consumed samples: 9221120 | consumed tokens: 18884853760 | elapsed time per iteration (s): 4.89 | learning rate: 1.371E-04 | global batch size: 512 | lm loss: 2.049681E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 104.741 | TFLOPs: 48.81 | 7: iteration 18020/ 44073 | consumed samples: 9226240 | consumed tokens: 18895339520 | elapsed time per iteration (s): 4.14 | learning rate: 1.370E-04 | global batch size: 512 | lm loss: 2.044558E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.771 | TFLOPs: 57.68 | 7: iteration 18030/ 44073 | consumed samples: 9231360 | consumed tokens: 18905825280 | elapsed time per iteration (s): 4.18 | learning rate: 1.370E-04 | global batch size: 512 | lm loss: 2.072787E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.615 | TFLOPs: 57.14 | 7: iteration 18040/ 44073 | consumed samples: 9236480 | consumed tokens: 18916311040 | elapsed time per iteration (s): 4.17 | learning rate: 1.369E-04 | global batch size: 512 | lm loss: 2.054808E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.829 | TFLOPs: 57.24 | 7: iteration 18050/ 44073 | consumed samples: 9241600 | consumed tokens: 18926796800 | elapsed time per iteration (s): 4.15 | learning rate: 1.368E-04 | global batch size: 512 | lm loss: 2.064273E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.295 | TFLOPs: 57.46 | 7: iteration 18060/ 44073 | consumed samples: 9246720 | consumed tokens: 18937282560 | elapsed time per iteration (s): 4.22 | learning rate: 1.368E-04 | global batch size: 512 | lm loss: 2.059023E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.319 | TFLOPs: 56.54 | 7: iteration 18070/ 44073 | consumed samples: 9251840 | consumed tokens: 18947768320 | elapsed time per iteration (s): 4.17 | learning rate: 1.367E-04 | global batch size: 512 | lm loss: 2.037278E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.882 | TFLOPs: 57.27 | 7: iteration 18080/ 44073 | consumed samples: 9256960 | consumed tokens: 18958254080 | elapsed time per iteration (s): 4.14 | learning rate: 1.367E-04 | global batch size: 512 | lm loss: 2.085754E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.781 | TFLOPs: 57.69 | 7: iteration 18090/ 44073 | consumed samples: 9262080 | consumed tokens: 18968739840 | elapsed time per iteration (s): 4.15 | learning rate: 1.366E-04 | global batch size: 512 | lm loss: 2.064939E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.379 | TFLOPs: 57.50 | 7: iteration 18100/ 44073 | consumed samples: 9267200 | consumed tokens: 18979225600 | elapsed time per iteration (s): 4.28 | learning rate: 1.365E-04 | global batch size: 512 | lm loss: 2.065208E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.664 | TFLOPs: 55.77 | 7: iteration 18110/ 44073 | consumed samples: 9272320 | consumed tokens: 18989711360 | elapsed time per iteration (s): 4.16 | learning rate: 1.365E-04 | global batch size: 512 | lm loss: 2.063237E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.211 | TFLOPs: 57.42 | 7: iteration 18120/ 44073 | consumed samples: 9277440 | consumed tokens: 19000197120 | elapsed time per iteration (s): 4.20 | learning rate: 1.364E-04 | global batch size: 512 | lm loss: 2.071869E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.862 | TFLOPs: 56.79 | 7: iteration 18130/ 44073 | consumed samples: 9282560 | consumed tokens: 19010682880 | elapsed time per iteration (s): 4.19 | learning rate: 1.364E-04 | global batch size: 512 | lm loss: 2.041297E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.251 | TFLOPs: 56.98 | 7: iteration 18140/ 44073 | consumed samples: 9287680 | consumed tokens: 19021168640 | elapsed time per iteration (s): 4.14 | learning rate: 1.363E-04 | global batch size: 512 | lm loss: 2.060842E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.680 | TFLOPs: 57.64 | 7: iteration 18150/ 44073 | consumed samples: 9292800 | consumed tokens: 19031654400 | elapsed time per iteration (s): 4.16 | learning rate: 1.362E-04 | global batch size: 512 | lm loss: 2.073730E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.211 | TFLOPs: 57.42 | 7: iteration 18160/ 44073 | consumed samples: 9297920 | consumed tokens: 19042140160 | elapsed time per iteration (s): 4.30 | learning rate: 1.362E-04 | global batch size: 512 | lm loss: 2.056730E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.205 | TFLOPs: 55.56 | 7: iteration 18170/ 44073 | consumed samples: 9303040 | consumed tokens: 19052625920 | elapsed time per iteration (s): 4.21 | learning rate: 1.361E-04 | global batch size: 512 | lm loss: 2.046653E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.543 | TFLOPs: 56.65 | 7: iteration 18180/ 44073 | consumed samples: 9308160 | consumed tokens: 19063111680 | elapsed time per iteration (s): 4.15 | learning rate: 1.360E-04 | global batch size: 512 | lm loss: 2.049689E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.291 | TFLOPs: 57.46 | 7: iteration 18190/ 44073 | consumed samples: 9313280 | consumed tokens: 19073597440 | elapsed time per iteration (s): 4.14 | learning rate: 1.360E-04 | global batch size: 512 | lm loss: 2.063396E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.587 | TFLOPs: 57.60 | 7: iteration 18200/ 44073 | consumed samples: 9318400 | consumed tokens: 19084083200 | elapsed time per iteration (s): 4.16 | learning rate: 1.359E-04 | global batch size: 512 | lm loss: 2.061522E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.972 | TFLOPs: 57.31 | 7: iteration 18210/ 44073 | consumed samples: 9323520 | consumed tokens: 19094568960 | elapsed time per iteration (s): 4.14 | learning rate: 1.359E-04 | global batch size: 512 | lm loss: 2.043436E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.776 | TFLOPs: 57.69 | 7: iteration 18220/ 44073 | consumed samples: 9328640 | consumed tokens: 19105054720 | elapsed time per iteration (s): 4.17 | learning rate: 1.358E-04 | global batch size: 512 | lm loss: 2.049555E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.697 | TFLOPs: 57.18 | 7: iteration 18230/ 44073 | consumed samples: 9333760 | consumed tokens: 19115540480 | elapsed time per iteration (s): 4.14 | learning rate: 1.357E-04 | global batch size: 512 | lm loss: 2.062953E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.593 | TFLOPs: 57.60 | 7: iteration 18240/ 44073 | consumed samples: 9338880 | consumed tokens: 19126026240 | elapsed time per iteration (s): 4.17 | learning rate: 1.357E-04 | global batch size: 512 | lm loss: 2.052654E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.819 | TFLOPs: 57.24 | 7: iteration 18250/ 44073 | consumed samples: 9344000 | consumed tokens: 19136512000 | elapsed time per iteration (s): 4.14 | learning rate: 1.356E-04 | global batch size: 512 | lm loss: 2.053437E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.776 | TFLOPs: 57.69 | 7: iteration 18260/ 44073 | consumed samples: 9349120 | consumed tokens: 19146997760 | elapsed time per iteration (s): 4.18 | learning rate: 1.355E-04 | global batch size: 512 | lm loss: 2.045684E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.575 | TFLOPs: 57.13 | 7: iteration 18270/ 44073 | consumed samples: 9354240 | consumed tokens: 19157483520 | elapsed time per iteration (s): 4.21 | learning rate: 1.355E-04 | global batch size: 512 | lm loss: 2.058392E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.537 | TFLOPs: 56.64 | 7: iteration 18280/ 44073 | consumed samples: 9359360 | consumed tokens: 19167969280 | elapsed time per iteration (s): 4.20 | learning rate: 1.354E-04 | global batch size: 512 | lm loss: 2.058848E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.818 | TFLOPs: 56.77 | 7: iteration 18290/ 44073 | consumed samples: 9364480 | consumed tokens: 19178455040 | elapsed time per iteration (s): 4.16 | learning rate: 1.354E-04 | global batch size: 512 | lm loss: 2.035673E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.064 | TFLOPs: 57.35 | 7: iteration 18300/ 44073 | consumed samples: 9369600 | consumed tokens: 19188940800 | elapsed time per iteration (s): 4.19 | learning rate: 1.353E-04 | global batch size: 512 | lm loss: 2.028804E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.225 | TFLOPs: 56.96 | 7: iteration 18310/ 44073 | consumed samples: 9374720 | consumed tokens: 19199426560 | elapsed time per iteration (s): 4.14 | learning rate: 1.352E-04 | global batch size: 512 | lm loss: 2.071270E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.653 | TFLOPs: 57.63 | 7: iteration 18320/ 44073 | consumed samples: 9379840 | consumed tokens: 19209912320 | elapsed time per iteration (s): 4.20 | learning rate: 1.352E-04 | global batch size: 512 | lm loss: 2.034492E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.878 | TFLOPs: 56.80 | 7: iteration 18330/ 44073 | consumed samples: 9384960 | consumed tokens: 19220398080 | elapsed time per iteration (s): 4.16 | learning rate: 1.351E-04 | global batch size: 512 | lm loss: 2.043907E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.068 | TFLOPs: 57.36 | 7: iteration 18340/ 44073 | consumed samples: 9390080 | consumed tokens: 19230883840 | elapsed time per iteration (s): 4.16 | learning rate: 1.350E-04 | global batch size: 512 | lm loss: 2.062342E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.947 | TFLOPs: 57.30 | 7: iteration 18350/ 44073 | consumed samples: 9395200 | consumed tokens: 19241369600 | elapsed time per iteration (s): 4.20 | learning rate: 1.350E-04 | global batch size: 512 | lm loss: 2.050022E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.860 | TFLOPs: 56.79 | 7: iteration 18360/ 44073 | consumed samples: 9400320 | consumed tokens: 19251855360 | elapsed time per iteration (s): 4.17 | learning rate: 1.349E-04 | global batch size: 512 | lm loss: 2.040531E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.896 | TFLOPs: 57.28 | 7: iteration 18370/ 44073 | consumed samples: 9405440 | consumed tokens: 19262341120 | elapsed time per iteration (s): 4.14 | learning rate: 1.349E-04 | global batch size: 512 | lm loss: 2.065427E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.648 | TFLOPs: 57.63 | 7: iteration 18380/ 44073 | consumed samples: 9410560 | consumed tokens: 19272826880 | elapsed time per iteration (s): 4.13 | learning rate: 1.348E-04 | global batch size: 512 | lm loss: 2.048327E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.887 | TFLOPs: 57.74 | 7: iteration 18390/ 44073 | consumed samples: 9415680 | consumed tokens: 19283312640 | elapsed time per iteration (s): 4.14 | learning rate: 1.347E-04 | global batch size: 512 | lm loss: 2.060707E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.558 | TFLOPs: 57.58 | 7: iteration 18400/ 44073 | consumed samples: 9420800 | consumed tokens: 19293798400 | elapsed time per iteration (s): 4.15 | learning rate: 1.347E-04 | global batch size: 512 | lm loss: 2.050178E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.404 | TFLOPs: 57.51 | 7: iteration 18410/ 44073 | consumed samples: 9425920 | consumed tokens: 19304284160 | elapsed time per iteration (s): 4.17 | learning rate: 1.346E-04 | global batch size: 512 | lm loss: 2.059094E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.794 | TFLOPs: 57.23 | 7: iteration 18420/ 44073 | consumed samples: 9431040 | consumed tokens: 19314769920 | elapsed time per iteration (s): 5.55 | learning rate: 1.345E-04 | global batch size: 512 | lm loss: 2.051884E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 92.290 | TFLOPs: 43.01 | 7: iteration 18430/ 44073 | consumed samples: 9436160 | consumed tokens: 19325255680 | elapsed time per iteration (s): 4.17 | learning rate: 1.345E-04 | global batch size: 512 | lm loss: 2.068199E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.732 | TFLOPs: 57.20 | 7: iteration 18440/ 44073 | consumed samples: 9441280 | consumed tokens: 19335741440 | elapsed time per iteration (s): 4.16 | learning rate: 1.344E-04 | global batch size: 512 | lm loss: 2.026363E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.200 | TFLOPs: 57.42 | 7: iteration 18450/ 44073 | consumed samples: 9446400 | consumed tokens: 19346227200 | elapsed time per iteration (s): 4.14 | learning rate: 1.344E-04 | global batch size: 512 | lm loss: 2.059113E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.813 | TFLOPs: 57.70 | 7: iteration 18460/ 44073 | consumed samples: 9451520 | consumed tokens: 19356712960 | elapsed time per iteration (s): 4.14 | learning rate: 1.343E-04 | global batch size: 512 | lm loss: 2.065023E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.708 | TFLOPs: 57.65 | 7: iteration 18470/ 44073 | consumed samples: 9456640 | consumed tokens: 19367198720 | elapsed time per iteration (s): 4.22 | learning rate: 1.342E-04 | global batch size: 512 | lm loss: 2.076707E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.423 | TFLOPs: 56.59 | 7: iteration 18480/ 44073 | consumed samples: 9461760 | consumed tokens: 19377684480 | elapsed time per iteration (s): 4.18 | learning rate: 1.342E-04 | global batch size: 512 | lm loss: 2.044702E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.529 | TFLOPs: 57.10 | 7: iteration 18490/ 44073 | consumed samples: 9466880 | consumed tokens: 19388170240 | elapsed time per iteration (s): 4.19 | learning rate: 1.341E-04 | global batch size: 512 | lm loss: 2.068556E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.193 | TFLOPs: 56.95 | 7: iteration 18500/ 44073 | consumed samples: 9472000 | consumed tokens: 19398656000 | elapsed time per iteration (s): 4.15 | learning rate: 1.341E-04 | global batch size: 512 | lm loss: 2.042187E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.385 | TFLOPs: 57.50 | 7: iteration 18510/ 44073 | consumed samples: 9477120 | consumed tokens: 19409141760 | elapsed time per iteration (s): 4.15 | learning rate: 1.340E-04 | global batch size: 512 | lm loss: 2.049460E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.323 | TFLOPs: 57.47 | 7: iteration 18520/ 44073 | consumed samples: 9482240 | consumed tokens: 19419627520 | elapsed time per iteration (s): 4.15 | learning rate: 1.339E-04 | global batch size: 512 | lm loss: 2.055923E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.225 | TFLOPs: 57.43 | 7: iteration 18530/ 44073 | consumed samples: 9487360 | consumed tokens: 19430113280 | elapsed time per iteration (s): 4.18 | learning rate: 1.339E-04 | global batch size: 512 | lm loss: 2.055570E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.539 | TFLOPs: 57.11 | 7: iteration 18540/ 44073 | consumed samples: 9492480 | consumed tokens: 19440599040 | elapsed time per iteration (s): 4.16 | learning rate: 1.338E-04 | global batch size: 512 | lm loss: 2.047104E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.026 | TFLOPs: 57.34 | 7: iteration 18550/ 44073 | consumed samples: 9497600 | consumed tokens: 19451084800 | elapsed time per iteration (s): 4.18 | learning rate: 1.337E-04 | global batch size: 512 | lm loss: 2.054601E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.345 | TFLOPs: 57.02 | 7: iteration 18560/ 44073 | consumed samples: 9502720 | consumed tokens: 19461570560 | elapsed time per iteration (s): 4.19 | learning rate: 1.337E-04 | global batch size: 512 | lm loss: 2.049958E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.282 | TFLOPs: 56.99 | 7: iteration 18570/ 44073 | consumed samples: 9507840 | consumed tokens: 19472056320 | elapsed time per iteration (s): 4.21 | learning rate: 1.336E-04 | global batch size: 512 | lm loss: 2.062633E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.701 | TFLOPs: 56.72 | 7: iteration 18580/ 44073 | consumed samples: 9512960 | consumed tokens: 19482542080 | elapsed time per iteration (s): 4.16 | learning rate: 1.336E-04 | global batch size: 512 | lm loss: 2.059371E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.008 | TFLOPs: 57.33 | 7: iteration 18590/ 44073 | consumed samples: 9518080 | consumed tokens: 19493027840 | elapsed time per iteration (s): 4.14 | learning rate: 1.335E-04 | global batch size: 512 | lm loss: 2.041039E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.589 | TFLOPs: 57.60 | 7: iteration 18600/ 44073 | consumed samples: 9523200 | consumed tokens: 19503513600 | elapsed time per iteration (s): 4.13 | learning rate: 1.334E-04 | global batch size: 512 | lm loss: 2.042851E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.848 | TFLOPs: 57.72 | 7: iteration 18610/ 44073 | consumed samples: 9528320 | consumed tokens: 19513999360 | elapsed time per iteration (s): 4.16 | learning rate: 1.334E-04 | global batch size: 512 | lm loss: 2.053907E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.164 | TFLOPs: 57.40 | 7: iteration 18620/ 44073 | consumed samples: 9533440 | consumed tokens: 19524485120 | elapsed time per iteration (s): 4.17 | learning rate: 1.333E-04 | global batch size: 512 | lm loss: 2.044233E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.923 | TFLOPs: 57.29 | 7: iteration 18630/ 44073 | consumed samples: 9538560 | consumed tokens: 19534970880 | elapsed time per iteration (s): 4.15 | learning rate: 1.332E-04 | global batch size: 512 | lm loss: 2.048047E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.448 | TFLOPs: 57.53 | 7: iteration 18640/ 44073 | consumed samples: 9543680 | consumed tokens: 19545456640 | elapsed time per iteration (s): 4.13 | learning rate: 1.332E-04 | global batch size: 512 | lm loss: 2.066306E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.895 | TFLOPs: 57.74 | 7: iteration 18650/ 44073 | consumed samples: 9548800 | consumed tokens: 19555942400 | elapsed time per iteration (s): 4.17 | learning rate: 1.331E-04 | global batch size: 512 | lm loss: 2.049170E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.778 | TFLOPs: 57.22 | 7: iteration 18660/ 44073 | consumed samples: 9553920 | consumed tokens: 19566428160 | elapsed time per iteration (s): 4.14 | learning rate: 1.331E-04 | global batch size: 512 | lm loss: 2.050136E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.616 | TFLOPs: 57.61 | 7: iteration 18670/ 44073 | consumed samples: 9559040 | consumed tokens: 19576913920 | elapsed time per iteration (s): 4.19 | learning rate: 1.330E-04 | global batch size: 512 | lm loss: 2.043649E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.230 | TFLOPs: 56.97 | 7: iteration 18680/ 44073 | consumed samples: 9564160 | consumed tokens: 19587399680 | elapsed time per iteration (s): 4.15 | learning rate: 1.329E-04 | global batch size: 512 | lm loss: 2.057414E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.500 | TFLOPs: 57.56 | 7: iteration 18690/ 44073 | consumed samples: 9569280 | consumed tokens: 19597885440 | elapsed time per iteration (s): 4.16 | learning rate: 1.329E-04 | global batch size: 512 | lm loss: 2.062753E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.003 | TFLOPs: 57.33 | 7: iteration 18700/ 44073 | consumed samples: 9574400 | consumed tokens: 19608371200 | elapsed time per iteration (s): 4.13 | learning rate: 1.328E-04 | global batch size: 512 | lm loss: 2.038845E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.915 | TFLOPs: 57.75 | 7: iteration 18710/ 44073 | consumed samples: 9579520 | consumed tokens: 19618856960 | elapsed time per iteration (s): 4.17 | learning rate: 1.327E-04 | global batch size: 512 | lm loss: 2.024236E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.783 | TFLOPs: 57.22 | 7: iteration 18720/ 44073 | consumed samples: 9584640 | consumed tokens: 19629342720 | elapsed time per iteration (s): 4.15 | learning rate: 1.327E-04 | global batch size: 512 | lm loss: 2.037892E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.508 | TFLOPs: 57.56 | 7: iteration 18730/ 44073 | consumed samples: 9589760 | consumed tokens: 19639828480 | elapsed time per iteration (s): 4.16 | learning rate: 1.326E-04 | global batch size: 512 | lm loss: 2.057119E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.115 | TFLOPs: 57.38 | 7: iteration 18740/ 44073 | consumed samples: 9594880 | consumed tokens: 19650314240 | elapsed time per iteration (s): 4.19 | learning rate: 1.325E-04 | global batch size: 512 | lm loss: 2.059244E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.205 | TFLOPs: 56.95 | 7: iteration 18750/ 44073 | consumed samples: 9600000 | consumed tokens: 19660800000 | elapsed time per iteration (s): 4.17 | learning rate: 1.325E-04 | global batch size: 512 | lm loss: 2.059909E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.695 | TFLOPs: 57.18 | 7: iteration 18760/ 44073 | consumed samples: 9605120 | consumed tokens: 19671285760 | elapsed time per iteration (s): 4.24 | learning rate: 1.324E-04 | global batch size: 512 | lm loss: 2.039931E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.878 | TFLOPs: 56.34 | 7: iteration 18770/ 44073 | consumed samples: 9610240 | consumed tokens: 19681771520 | elapsed time per iteration (s): 4.21 | learning rate: 1.324E-04 | global batch size: 512 | lm loss: 2.018953E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.667 | TFLOPs: 56.70 | 7: iteration 18780/ 44073 | consumed samples: 9615360 | consumed tokens: 19692257280 | elapsed time per iteration (s): 4.17 | learning rate: 1.323E-04 | global batch size: 512 | lm loss: 2.024643E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.745 | TFLOPs: 57.21 | 7: iteration 18790/ 44073 | consumed samples: 9620480 | consumed tokens: 19702743040 | elapsed time per iteration (s): 4.18 | learning rate: 1.322E-04 | global batch size: 512 | lm loss: 2.061236E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.569 | TFLOPs: 57.12 | 7: iteration 18800/ 44073 | consumed samples: 9625600 | consumed tokens: 19713228800 | elapsed time per iteration (s): 4.29 | learning rate: 1.322E-04 | global batch size: 512 | lm loss: 2.050528E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.250 | TFLOPs: 55.58 | 7: iteration 18810/ 44073 | consumed samples: 9630720 | consumed tokens: 19723714560 | elapsed time per iteration (s): 4.20 | learning rate: 1.321E-04 | global batch size: 512 | lm loss: 2.059032E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.997 | TFLOPs: 56.86 | 7: iteration 18820/ 44073 | consumed samples: 9635840 | consumed tokens: 19734200320 | elapsed time per iteration (s): 4.14 | learning rate: 1.320E-04 | global batch size: 512 | lm loss: 2.036967E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.677 | TFLOPs: 57.64 | 7: iteration 18830/ 44073 | consumed samples: 9640960 | consumed tokens: 19744686080 | elapsed time per iteration (s): 4.19 | learning rate: 1.320E-04 | global batch size: 512 | lm loss: 2.039172E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.269 | TFLOPs: 56.98 | 7: iteration 18840/ 44073 | consumed samples: 9646080 | consumed tokens: 19755171840 | elapsed time per iteration (s): 4.20 | learning rate: 1.319E-04 | global batch size: 512 | lm loss: 2.036958E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.879 | TFLOPs: 56.80 | 7: iteration 18850/ 44073 | consumed samples: 9651200 | consumed tokens: 19765657600 | elapsed time per iteration (s): 4.15 | learning rate: 1.319E-04 | global batch size: 512 | lm loss: 2.051097E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.506 | TFLOPs: 57.56 | 7: iteration 18860/ 44073 | consumed samples: 9656320 | consumed tokens: 19776143360 | elapsed time per iteration (s): 4.19 | learning rate: 1.318E-04 | global batch size: 512 | lm loss: 2.078069E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.166 | TFLOPs: 56.94 | 7: iteration 18870/ 44073 | consumed samples: 9661440 | consumed tokens: 19786629120 | elapsed time per iteration (s): 4.15 | learning rate: 1.317E-04 | global batch size: 512 | lm loss: 2.053271E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.274 | TFLOPs: 57.45 | 7: iteration 18880/ 44073 | consumed samples: 9666560 | consumed tokens: 19797114880 | elapsed time per iteration (s): 4.16 | learning rate: 1.317E-04 | global batch size: 512 | lm loss: 2.044154E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.930 | TFLOPs: 57.29 | 7: iteration 18890/ 44073 | consumed samples: 9671680 | consumed tokens: 19807600640 | elapsed time per iteration (s): 4.19 | learning rate: 1.316E-04 | global batch size: 512 | lm loss: 2.044813E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.175 | TFLOPs: 56.94 | 7: iteration 18900/ 44073 | consumed samples: 9676800 | consumed tokens: 19818086400 | elapsed time per iteration (s): 4.15 | learning rate: 1.315E-04 | global batch size: 512 | lm loss: 2.035613E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.264 | TFLOPs: 57.45 | 7: iteration 18910/ 44073 | consumed samples: 9681920 | consumed tokens: 19828572160 | elapsed time per iteration (s): 4.15 | learning rate: 1.315E-04 | global batch size: 512 | lm loss: 2.023757E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.290 | TFLOPs: 57.46 | 7: iteration 18920/ 44073 | consumed samples: 9687040 | consumed tokens: 19839057920 | elapsed time per iteration (s): 4.14 | learning rate: 1.314E-04 | global batch size: 512 | lm loss: 2.048005E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.540 | TFLOPs: 57.58 | 7: iteration 18930/ 44073 | consumed samples: 9692160 | consumed tokens: 19849543680 | elapsed time per iteration (s): 4.14 | learning rate: 1.314E-04 | global batch size: 512 | lm loss: 2.051738E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.739 | TFLOPs: 57.67 | 7: iteration 18940/ 44073 | consumed samples: 9697280 | consumed tokens: 19860029440 | elapsed time per iteration (s): 4.20 | learning rate: 1.313E-04 | global batch size: 512 | lm loss: 2.051523E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.996 | TFLOPs: 56.86 | 7: iteration 18950/ 44073 | consumed samples: 9702400 | consumed tokens: 19870515200 | elapsed time per iteration (s): 4.23 | learning rate: 1.312E-04 | global batch size: 512 | lm loss: 2.043511E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.117 | TFLOPs: 56.45 | 7: iteration 18960/ 44073 | consumed samples: 9707520 | consumed tokens: 19881000960 | elapsed time per iteration (s): 4.14 | learning rate: 1.312E-04 | global batch size: 512 | lm loss: 2.055216E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.738 | TFLOPs: 57.67 | 7: iteration 18970/ 44073 | consumed samples: 9712640 | consumed tokens: 19891486720 | elapsed time per iteration (s): 4.26 | learning rate: 1.311E-04 | global batch size: 512 | lm loss: 2.039305E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.127 | TFLOPs: 55.99 | 7: iteration 18980/ 44073 | consumed samples: 9717760 | consumed tokens: 19901972480 | elapsed time per iteration (s): 4.20 | learning rate: 1.310E-04 | global batch size: 512 | lm loss: 2.059310E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.983 | TFLOPs: 56.85 | 7: iteration 18990/ 44073 | consumed samples: 9722880 | consumed tokens: 19912458240 | elapsed time per iteration (s): 4.15 | learning rate: 1.310E-04 | global batch size: 512 | lm loss: 2.041065E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.277 | TFLOPs: 57.45 | 7: iteration 19000/ 44073 | consumed samples: 9728000 | consumed tokens: 19922944000 | elapsed time per iteration (s): 4.17 | learning rate: 1.309E-04 | global batch size: 512 | lm loss: 2.050219E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.805 | TFLOPs: 57.23 | 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 19000 | lm loss value: 2.057433E+00 | lm loss PPL: 7.825854E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 19000 to checkpoints_2b2 0: [2022-11-26 08:37:39,344] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step19000 is begin to save! 0: [2022-11-26 08:37:39,349] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_01-model_00-model_states.pt... 0: [2022-11-26 08:37:39,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_01-model_00-model_states.pt. 0: [2022-11-26 08:37:39,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_03-model_00-model_states.pt... 0: [2022-11-26 08:37:39,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_03-model_00-model_states.pt. 0: [2022-11-26 08:37:39,810] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_04-model_00-model_states.pt... 0: [2022-11-26 08:37:39,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_04-model_00-model_states.pt. 0: [2022-11-26 08:37:39,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_05-model_00-model_states.pt... 0: [2022-11-26 08:37:40,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_05-model_00-model_states.pt. 0: [2022-11-26 08:37:40,107] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_06-model_00-model_states.pt... 0: [2022-11-26 08:37:40,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_06-model_00-model_states.pt. 0: [2022-11-26 08:37:40,248] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_07-model_00-model_states.pt... 0: [2022-11-26 08:37:40,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_07-model_00-model_states.pt. 0: [2022-11-26 08:37:40,391] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_08-model_00-model_states.pt... 0: [2022-11-26 08:37:40,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_08-model_00-model_states.pt. 0: [2022-11-26 08:37:40,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_09-model_00-model_states.pt... 0: [2022-11-26 08:37:40,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_09-model_00-model_states.pt. 0: [2022-11-26 08:37:40,675] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_10-model_00-model_states.pt... 0: [2022-11-26 08:37:40,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_10-model_00-model_states.pt. 0: [2022-11-26 08:37:40,816] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_11-model_00-model_states.pt... 0: [2022-11-26 08:37:40,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_11-model_00-model_states.pt. 0: [2022-11-26 08:37:40,961] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_12-model_00-model_states.pt... 0: [2022-11-26 08:37:41,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_12-model_00-model_states.pt. 0: [2022-11-26 08:37:41,098] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_13-model_00-model_states.pt... 0: [2022-11-26 08:37:41,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_13-model_00-model_states.pt. 0: [2022-11-26 08:37:41,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_14-model_00-model_states.pt... 0: [2022-11-26 08:37:41,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_14-model_00-model_states.pt. 0: [2022-11-26 08:37:41,360] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_15-model_00-model_states.pt... 0: [2022-11-26 08:37:41,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_15-model_00-model_states.pt. 0: [2022-11-26 08:37:41,483] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_16-model_00-model_states.pt... 0: [2022-11-26 08:37:41,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_16-model_00-model_states.pt. 0: [2022-11-26 08:37:41,606] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_17-model_00-model_states.pt... 0: [2022-11-26 08:37:41,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_17-model_00-model_states.pt. 0: [2022-11-26 08:37:41,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_18-model_00-model_states.pt... 0: [2022-11-26 08:37:41,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_18-model_00-model_states.pt. 0: [2022-11-26 08:37:41,855] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_19-model_00-model_states.pt... 0: [2022-11-26 08:37:41,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_19-model_00-model_states.pt. 0: [2022-11-26 08:37:41,977] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_20-model_00-model_states.pt... 0: [2022-11-26 08:37:42,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_20-model_00-model_states.pt. 0: [2022-11-26 08:37:42,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_21-model_00-model_states.pt... 0: [2022-11-26 08:37:42,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_21-model_00-model_states.pt. 0: [2022-11-26 08:37:42,228] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_22-model_00-model_states.pt... 0: [2022-11-26 08:37:42,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_22-model_00-model_states.pt. 0: [2022-11-26 08:37:42,351] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_23-model_00-model_states.pt... 0: [2022-11-26 08:37:42,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_23-model_00-model_states.pt. 0: [2022-11-26 08:37:42,476] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_24-model_00-model_states.pt... 0: [2022-11-26 08:37:42,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_24-model_00-model_states.pt. 0: [2022-11-26 08:37:42,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_25-model_00-model_states.pt... 0: [2022-11-26 08:37:42,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_25-model_00-model_states.pt. 0: [2022-11-26 08:37:42,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_26-model_00-model_states.pt... 0: [2022-11-26 08:37:42,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_26-model_00-model_states.pt. 0: [2022-11-26 08:37:42,970] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_27-model_00-model_states.pt... 0: [2022-11-26 08:37:43,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_27-model_00-model_states.pt. 0: [2022-11-26 08:37:43,105] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_28-model_00-model_states.pt... 0: [2022-11-26 08:37:43,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_28-model_00-model_states.pt. 0: [2022-11-26 08:37:43,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_29-model_00-model_states.pt... 0: [2022-11-26 08:37:43,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_29-model_00-model_states.pt. 0: [2022-11-26 08:37:43,354] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_30-model_00-model_states.pt... 0: [2022-11-26 08:37:43,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_30-model_00-model_states.pt. 0: [2022-11-26 08:37:43,476] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_31-model_00-model_states.pt... 0: [2022-11-26 08:37:43,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_31-model_00-model_states.pt. 0: [2022-11-26 08:37:43,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_32-model_00-model_states.pt... 0: [2022-11-26 08:37:43,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_32-model_00-model_states.pt. 0: [2022-11-26 08:37:43,723] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_33-model_00-model_states.pt... 0: [2022-11-26 08:37:43,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_33-model_00-model_states.pt. 0: [2022-11-26 08:37:43,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_34-model_00-model_states.pt... 0: [2022-11-26 08:37:43,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_34-model_00-model_states.pt. 0: [2022-11-26 08:37:43,971] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/layer_36-model_00-model_states.pt... 0: [2022-11-26 08:37:43,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/layer_36-model_00-model_states.pt. 0: [2022-11-26 08:37:43,977] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step19000/mp_rank_00_model_states.pt 0: [2022-11-26 08:37:43,977] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/mp_rank_00_model_states.pt... 0: [2022-11-26 08:37:43,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/mp_rank_00_model_states.pt. 5: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:37:44,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step19000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:37:44,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:37:44,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 08:37:44,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: [2022-11-26 08:37:44,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:37:44,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 08:37:44,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: [2022-11-26 08:37:44,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:37:44,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 08:37:44,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 3: [2022-11-26 08:37:44,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:37:44,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 08:37:44,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 3: [2022-11-26 08:37:44,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:37:44,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 08:37:44,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: [2022-11-26 08:37:44,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:37:44,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 08:37:44,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 3: [2022-11-26 08:37:44,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:37:44,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 08:37:44,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 2: [2022-11-26 08:37:44,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:37:44,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 08:37:44,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 2: [2022-11-26 08:37:44,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:37:44,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 08:37:44,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 3: [2022-11-26 08:37:44,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:37:44,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 08:37:44,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 3: [2022-11-26 08:37:44,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:37:44,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 08:37:44,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: [2022-11-26 08:37:44,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:37:44,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 08:37:44,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: [2022-11-26 08:37:44,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:37:44,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:37:44,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 08:37:44,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 2: [2022-11-26 08:37:44,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:37:44,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 08:37:44,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 2: [2022-11-26 08:37:44,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:37:44,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 08:37:44,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 2: [2022-11-26 08:37:44,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:37:44,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 08:37:44,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 2: [2022-11-26 08:37:44,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:37:44,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 08:37:44,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: [2022-11-26 08:37:44,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:37:44,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 08:37:44,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: [2022-11-26 08:37:44,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:37:44,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 08:37:44,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: [2022-11-26 08:37:44,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:37:44,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 08:37:44,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 3: [2022-11-26 08:37:44,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:37:44,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 08:37:44,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 3: [2022-11-26 08:37:44,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:37:44,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 08:37:44,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 2: [2022-11-26 08:37:44,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:37:44,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 08:37:44,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 4: [2022-11-26 08:37:44,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:37:44,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:37:44,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:37:44,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:37:44,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:37:44,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:37:44,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:37:44,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:37:44,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 08:37:44,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 08:37:44,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 08:37:44,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 08:37:44,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 08:37:44,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 08:37:44,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 4: [2022-11-26 08:37:44,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 4: [2022-11-26 08:37:44,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 4: [2022-11-26 08:37:44,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 4: [2022-11-26 08:37:44,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 08:37:44,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 08:37:44,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 4: [2022-11-26 08:37:44,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 4: [2022-11-26 08:37:44,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 4: [2022-11-26 08:37:44,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-26 08:37:44,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:37:44,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:37:44,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:37:44,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 08:37:44,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 08:37:44,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 08:37:44,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-26 08:37:44,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-26 08:37:44,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-26 08:37:44,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:37:44,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:37:44,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:37:44,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:37:44,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:37:44,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 08:37:44,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 08:37:44,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 08:37:44,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 08:37:44,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 08:37:44,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-26 08:37:44,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-26 08:37:44,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-26 08:37:44,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-26 08:37:44,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 6: [2022-11-26 08:37:44,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:37:44,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:37:44,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:37:44,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:37:44,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 08:37:44,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 08:37:44,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 08:37:44,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 6: [2022-11-26 08:37:44,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 6: [2022-11-26 08:37:44,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 08:37:44,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 6: [2022-11-26 08:37:44,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 6: [2022-11-26 08:37:44,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:37:44,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 08:37:44,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:37:44,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 6: [2022-11-26 08:37:44,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 08:37:44,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 6: [2022-11-26 08:37:44,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:37:44,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 08:37:44,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 6: [2022-11-26 08:37:44,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:37:44,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 08:37:44,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: [2022-11-26 08:37:44,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 08:37:44,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 7: [2022-11-26 08:37:44,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:37:44,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:37:44,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:37:44,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:37:44,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:37:44,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 08:37:44,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:37:44,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 08:37:44,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:37:44,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:37:44,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 08:37:44,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 08:37:44,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 7: [2022-11-26 08:37:44,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 7: [2022-11-26 08:37:44,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 08:37:44,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 08:37:44,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 7: [2022-11-26 08:37:44,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 7: [2022-11-26 08:37:44,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 08:37:44,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 08:37:44,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 7: [2022-11-26 08:37:44,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 7: [2022-11-26 08:37:44,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 7: [2022-11-26 08:37:44,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 5: [2022-11-26 08:37:45,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:37:45,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:37:45,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:37:45,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 08:37:45,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:37:45,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:37:45,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:37:45,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:37:45,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 08:37:45,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:37:45,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 08:37:45,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 5: [2022-11-26 08:37:45,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 08:37:45,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 08:37:45,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 08:37:45,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 5: [2022-11-26 08:37:45,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 08:37:45,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step19000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 08:37:45,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 5: [2022-11-26 08:37:45,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 5: [2022-11-26 08:37:45,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 5: [2022-11-26 08:37:45,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 5: [2022-11-26 08:37:45,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 5: [2022-11-26 08:37:45,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: successfully saved checkpoint at iteration 19000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 5999.28 7: iteration 19010/ 44073 | consumed samples: 9733120 | consumed tokens: 19933429760 | elapsed time per iteration (s): 4.90 | learning rate: 1.309E-04 | global batch size: 512 | lm loss: 2.025126E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 104.546 | TFLOPs: 48.72 | 7: iteration 19020/ 44073 | consumed samples: 9738240 | consumed tokens: 19943915520 | elapsed time per iteration (s): 4.18 | learning rate: 1.308E-04 | global batch size: 512 | lm loss: 2.031323E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.361 | TFLOPs: 57.03 | 7: iteration 19030/ 44073 | consumed samples: 9743360 | consumed tokens: 19954401280 | elapsed time per iteration (s): 4.37 | learning rate: 1.307E-04 | global batch size: 512 | lm loss: 2.049410E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.158 | TFLOPs: 54.60 | 7: iteration 19040/ 44073 | consumed samples: 9748480 | consumed tokens: 19964887040 | elapsed time per iteration (s): 4.20 | learning rate: 1.307E-04 | global batch size: 512 | lm loss: 2.032739E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.033 | TFLOPs: 56.87 | 7: iteration 19050/ 44073 | consumed samples: 9753600 | consumed tokens: 19975372800 | elapsed time per iteration (s): 4.23 | learning rate: 1.306E-04 | global batch size: 512 | lm loss: 2.034439E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.976 | TFLOPs: 56.38 | 7: iteration 19060/ 44073 | consumed samples: 9758720 | consumed tokens: 19985858560 | elapsed time per iteration (s): 4.20 | learning rate: 1.305E-04 | global batch size: 512 | lm loss: 2.053539E+00 | grad norm: 0.114 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.959 | TFLOPs: 56.84 | 7: iteration 19070/ 44073 | consumed samples: 9763840 | consumed tokens: 19996344320 | elapsed time per iteration (s): 4.15 | learning rate: 1.305E-04 | global batch size: 512 | lm loss: 2.064045E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.415 | TFLOPs: 57.52 | 7: iteration 19080/ 44073 | consumed samples: 9768960 | consumed tokens: 20006830080 | elapsed time per iteration (s): 4.14 | learning rate: 1.304E-04 | global batch size: 512 | lm loss: 2.052129E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.544 | TFLOPs: 57.58 | 7: iteration 19090/ 44073 | consumed samples: 9774080 | consumed tokens: 20017315840 | elapsed time per iteration (s): 4.27 | learning rate: 1.303E-04 | global batch size: 512 | lm loss: 2.065095E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.991 | TFLOPs: 55.92 | 7: iteration 19100/ 44073 | consumed samples: 9779200 | consumed tokens: 20027801600 | elapsed time per iteration (s): 4.27 | learning rate: 1.303E-04 | global batch size: 512 | lm loss: 2.051080E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.813 | TFLOPs: 55.84 | 7: iteration 19110/ 44073 | consumed samples: 9784320 | consumed tokens: 20038287360 | elapsed time per iteration (s): 4.17 | learning rate: 1.302E-04 | global batch size: 512 | lm loss: 2.051731E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.693 | TFLOPs: 57.18 | 7: iteration 19120/ 44073 | consumed samples: 9789440 | consumed tokens: 20048773120 | elapsed time per iteration (s): 4.14 | learning rate: 1.302E-04 | global batch size: 512 | lm loss: 2.066214E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.668 | TFLOPs: 57.64 | 7: iteration 19130/ 44073 | consumed samples: 9794560 | consumed tokens: 20059258880 | elapsed time per iteration (s): 4.18 | learning rate: 1.301E-04 | global batch size: 512 | lm loss: 2.042474E+00 | grad norm: 0.114 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.509 | TFLOPs: 57.10 | 7: iteration 19140/ 44073 | consumed samples: 9799680 | consumed tokens: 20069744640 | elapsed time per iteration (s): 4.15 | learning rate: 1.300E-04 | global batch size: 512 | lm loss: 2.047617E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.307 | TFLOPs: 57.47 | 7: iteration 19150/ 44073 | consumed samples: 9804800 | consumed tokens: 20080230400 | elapsed time per iteration (s): 4.15 | learning rate: 1.300E-04 | global batch size: 512 | lm loss: 2.043954E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.241 | TFLOPs: 57.44 | 7: iteration 19160/ 44073 | consumed samples: 9809920 | consumed tokens: 20090716160 | elapsed time per iteration (s): 4.19 | learning rate: 1.299E-04 | global batch size: 512 | lm loss: 2.045915E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.327 | TFLOPs: 57.01 | 7: iteration 19170/ 44073 | consumed samples: 9815040 | consumed tokens: 20101201920 | elapsed time per iteration (s): 4.26 | learning rate: 1.298E-04 | global batch size: 512 | lm loss: 2.023158E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.275 | TFLOPs: 56.05 | 7: iteration 19180/ 44073 | consumed samples: 9820160 | consumed tokens: 20111687680 | elapsed time per iteration (s): 4.14 | learning rate: 1.298E-04 | global batch size: 512 | lm loss: 2.048440E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.645 | TFLOPs: 57.62 | 7: iteration 19190/ 44073 | consumed samples: 9825280 | consumed tokens: 20122173440 | elapsed time per iteration (s): 4.15 | learning rate: 1.297E-04 | global batch size: 512 | lm loss: 2.039893E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.285 | TFLOPs: 57.46 | 7: iteration 19200/ 44073 | consumed samples: 9830400 | consumed tokens: 20132659200 | elapsed time per iteration (s): 4.18 | learning rate: 1.297E-04 | global batch size: 512 | lm loss: 2.051119E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.365 | TFLOPs: 57.03 | 7: iteration 19210/ 44073 | consumed samples: 9835520 | consumed tokens: 20143144960 | elapsed time per iteration (s): 4.17 | learning rate: 1.296E-04 | global batch size: 512 | lm loss: 2.053452E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.911 | TFLOPs: 57.28 | 7: iteration 19220/ 44073 | consumed samples: 9840640 | consumed tokens: 20153630720 | elapsed time per iteration (s): 4.21 | learning rate: 1.295E-04 | global batch size: 512 | lm loss: 2.047298E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.639 | TFLOPs: 56.69 | 7: iteration 19230/ 44073 | consumed samples: 9845760 | consumed tokens: 20164116480 | elapsed time per iteration (s): 4.20 | learning rate: 1.295E-04 | global batch size: 512 | lm loss: 2.041190E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.782 | TFLOPs: 56.76 | 7: iteration 19240/ 44073 | consumed samples: 9850880 | consumed tokens: 20174602240 | elapsed time per iteration (s): 4.19 | learning rate: 1.294E-04 | global batch size: 512 | lm loss: 2.034217E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.088 | TFLOPs: 56.90 | 7: iteration 19250/ 44073 | consumed samples: 9856000 | consumed tokens: 20185088000 | elapsed time per iteration (s): 4.19 | learning rate: 1.293E-04 | global batch size: 512 | lm loss: 2.047963E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.233 | TFLOPs: 56.97 | 7: iteration 19260/ 44073 | consumed samples: 9861120 | consumed tokens: 20195573760 | elapsed time per iteration (s): 4.19 | learning rate: 1.293E-04 | global batch size: 512 | lm loss: 2.055765E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.127 | TFLOPs: 56.92 | 7: iteration 19270/ 44073 | consumed samples: 9866240 | consumed tokens: 20206059520 | elapsed time per iteration (s): 4.16 | learning rate: 1.292E-04 | global batch size: 512 | lm loss: 2.048215E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.954 | TFLOPs: 57.30 | 7: iteration 19280/ 44073 | consumed samples: 9871360 | consumed tokens: 20216545280 | elapsed time per iteration (s): 4.17 | learning rate: 1.291E-04 | global batch size: 512 | lm loss: 2.037322E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.815 | TFLOPs: 57.24 | 7: iteration 19290/ 44073 | consumed samples: 9876480 | consumed tokens: 20227031040 | elapsed time per iteration (s): 4.17 | learning rate: 1.291E-04 | global batch size: 512 | lm loss: 2.031912E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.670 | TFLOPs: 57.17 | 7: iteration 19300/ 44073 | consumed samples: 9881600 | consumed tokens: 20237516800 | elapsed time per iteration (s): 4.18 | learning rate: 1.290E-04 | global batch size: 512 | lm loss: 2.048472E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.480 | TFLOPs: 57.08 | 7: iteration 19310/ 44073 | consumed samples: 9886720 | consumed tokens: 20248002560 | elapsed time per iteration (s): 4.20 | learning rate: 1.290E-04 | global batch size: 512 | lm loss: 2.041002E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.924 | TFLOPs: 56.82 | 7: iteration 19320/ 44073 | consumed samples: 9891840 | consumed tokens: 20258488320 | elapsed time per iteration (s): 4.20 | learning rate: 1.289E-04 | global batch size: 512 | lm loss: 2.041824E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.807 | TFLOPs: 56.77 | 7: iteration 19330/ 44073 | consumed samples: 9896960 | consumed tokens: 20268974080 | elapsed time per iteration (s): 4.21 | learning rate: 1.288E-04 | global batch size: 512 | lm loss: 2.038448E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.584 | TFLOPs: 56.66 | 7: iteration 19340/ 44073 | consumed samples: 9902080 | consumed tokens: 20279459840 | elapsed time per iteration (s): 4.23 | learning rate: 1.288E-04 | global batch size: 512 | lm loss: 2.049303E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.989 | TFLOPs: 56.39 | 7: iteration 19350/ 44073 | consumed samples: 9907200 | consumed tokens: 20289945600 | elapsed time per iteration (s): 4.15 | learning rate: 1.287E-04 | global batch size: 512 | lm loss: 2.045187E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.338 | TFLOPs: 57.48 | 7: iteration 19360/ 44073 | consumed samples: 9912320 | consumed tokens: 20300431360 | elapsed time per iteration (s): 4.24 | learning rate: 1.286E-04 | global batch size: 512 | lm loss: 2.061904E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.651 | TFLOPs: 56.23 | 7: iteration 19370/ 44073 | consumed samples: 9917440 | consumed tokens: 20310917120 | elapsed time per iteration (s): 4.22 | learning rate: 1.286E-04 | global batch size: 512 | lm loss: 2.018366E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.193 | TFLOPs: 56.48 | 7: iteration 19380/ 44073 | consumed samples: 9922560 | consumed tokens: 20321402880 | elapsed time per iteration (s): 4.18 | learning rate: 1.285E-04 | global batch size: 512 | lm loss: 2.029347E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.356 | TFLOPs: 57.02 | 7: iteration 19390/ 44073 | consumed samples: 9927680 | consumed tokens: 20331888640 | elapsed time per iteration (s): 4.18 | learning rate: 1.284E-04 | global batch size: 512 | lm loss: 2.026830E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.419 | TFLOPs: 57.05 | 7: iteration 19400/ 44073 | consumed samples: 9932800 | consumed tokens: 20342374400 | elapsed time per iteration (s): 4.17 | learning rate: 1.284E-04 | global batch size: 512 | lm loss: 2.041685E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.757 | TFLOPs: 57.21 | 7: iteration 19410/ 44073 | consumed samples: 9937920 | consumed tokens: 20352860160 | elapsed time per iteration (s): 4.20 | learning rate: 1.283E-04 | global batch size: 512 | lm loss: 2.041748E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.835 | TFLOPs: 56.78 | 7: iteration 19420/ 44073 | consumed samples: 9943040 | consumed tokens: 20363345920 | elapsed time per iteration (s): 4.20 | learning rate: 1.283E-04 | global batch size: 512 | lm loss: 2.042929E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.005 | TFLOPs: 56.86 | 7: iteration 19430/ 44073 | consumed samples: 9948160 | consumed tokens: 20373831680 | elapsed time per iteration (s): 4.25 | learning rate: 1.282E-04 | global batch size: 512 | lm loss: 2.028389E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.507 | TFLOPs: 56.16 | 7: iteration 19440/ 44073 | consumed samples: 9953280 | consumed tokens: 20384317440 | elapsed time per iteration (s): 4.18 | learning rate: 1.281E-04 | global batch size: 512 | lm loss: 2.059458E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.606 | TFLOPs: 57.14 | 7: iteration 19450/ 44073 | consumed samples: 9958400 | consumed tokens: 20394803200 | elapsed time per iteration (s): 4.17 | learning rate: 1.281E-04 | global batch size: 512 | lm loss: 2.038061E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.883 | TFLOPs: 57.27 | 7: iteration 19460/ 44073 | consumed samples: 9963520 | consumed tokens: 20405288960 | elapsed time per iteration (s): 5.77 | learning rate: 1.280E-04 | global batch size: 512 | lm loss: 2.043346E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 88.686 | TFLOPs: 41.33 | 7: iteration 19470/ 44073 | consumed samples: 9968640 | consumed tokens: 20415774720 | elapsed time per iteration (s): 4.17 | learning rate: 1.279E-04 | global batch size: 512 | lm loss: 2.040653E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.810 | TFLOPs: 57.24 | 7: iteration 19480/ 44073 | consumed samples: 9973760 | consumed tokens: 20426260480 | elapsed time per iteration (s): 4.16 | learning rate: 1.279E-04 | global batch size: 512 | lm loss: 2.041336E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.186 | TFLOPs: 57.41 | 7: iteration 19490/ 44073 | consumed samples: 9978880 | consumed tokens: 20436746240 | elapsed time per iteration (s): 4.19 | learning rate: 1.278E-04 | global batch size: 512 | lm loss: 2.043922E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.321 | TFLOPs: 57.01 | 7: iteration 19500/ 44073 | consumed samples: 9984000 | consumed tokens: 20447232000 | elapsed time per iteration (s): 4.13 | learning rate: 1.277E-04 | global batch size: 512 | lm loss: 2.036398E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.827 | TFLOPs: 57.71 | 7: iteration 19510/ 44073 | consumed samples: 9989120 | consumed tokens: 20457717760 | elapsed time per iteration (s): 4.15 | learning rate: 1.277E-04 | global batch size: 512 | lm loss: 2.073637E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.308 | TFLOPs: 57.47 | 7: iteration 19520/ 44073 | consumed samples: 9994240 | consumed tokens: 20468203520 | elapsed time per iteration (s): 4.13 | learning rate: 1.276E-04 | global batch size: 512 | lm loss: 2.031320E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.866 | TFLOPs: 57.73 | 7: iteration 19530/ 44073 | consumed samples: 9999360 | consumed tokens: 20478689280 | elapsed time per iteration (s): 4.16 | learning rate: 1.276E-04 | global batch size: 512 | lm loss: 2.038241E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.107 | TFLOPs: 57.37 | 7: iteration 19540/ 44073 | consumed samples: 10004480 | consumed tokens: 20489175040 | elapsed time per iteration (s): 4.27 | learning rate: 1.275E-04 | global batch size: 512 | lm loss: 2.033424E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.936 | TFLOPs: 55.90 | 7: iteration 19550/ 44073 | consumed samples: 10009600 | consumed tokens: 20499660800 | elapsed time per iteration (s): 4.14 | learning rate: 1.274E-04 | global batch size: 512 | lm loss: 2.039722E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.704 | TFLOPs: 57.65 | 7: iteration 19560/ 44073 | consumed samples: 10014720 | consumed tokens: 20510146560 | elapsed time per iteration (s): 4.14 | learning rate: 1.274E-04 | global batch size: 512 | lm loss: 2.037222E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.661 | TFLOPs: 57.63 | 7: iteration 19570/ 44073 | consumed samples: 10019840 | consumed tokens: 20520632320 | elapsed time per iteration (s): 4.15 | learning rate: 1.273E-04 | global batch size: 512 | lm loss: 2.037933E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.447 | TFLOPs: 57.53 | 7: iteration 19580/ 44073 | consumed samples: 10024960 | consumed tokens: 20531118080 | elapsed time per iteration (s): 4.13 | learning rate: 1.272E-04 | global batch size: 512 | lm loss: 2.034282E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.908 | TFLOPs: 57.75 | 7: iteration 19590/ 44073 | consumed samples: 10030080 | consumed tokens: 20541603840 | elapsed time per iteration (s): 4.13 | learning rate: 1.272E-04 | global batch size: 512 | lm loss: 2.049246E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.879 | TFLOPs: 57.73 | 7: iteration 19600/ 44073 | consumed samples: 10035200 | consumed tokens: 20552089600 | elapsed time per iteration (s): 4.14 | learning rate: 1.271E-04 | global batch size: 512 | lm loss: 2.028840E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.785 | TFLOPs: 57.69 | 7: iteration 19610/ 44073 | consumed samples: 10040320 | consumed tokens: 20562575360 | elapsed time per iteration (s): 4.15 | learning rate: 1.271E-04 | global batch size: 512 | lm loss: 2.033594E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.437 | TFLOPs: 57.53 | 7: iteration 19620/ 44073 | consumed samples: 10045440 | consumed tokens: 20573061120 | elapsed time per iteration (s): 4.14 | learning rate: 1.270E-04 | global batch size: 512 | lm loss: 2.025076E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.731 | TFLOPs: 57.67 | 7: iteration 19630/ 44073 | consumed samples: 10050560 | consumed tokens: 20583546880 | elapsed time per iteration (s): 4.15 | learning rate: 1.269E-04 | global batch size: 512 | lm loss: 2.015616E+00 | grad norm: 0.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.488 | TFLOPs: 57.55 | 7: iteration 19640/ 44073 | consumed samples: 10055680 | consumed tokens: 20594032640 | elapsed time per iteration (s): 4.15 | learning rate: 1.269E-04 | global batch size: 512 | lm loss: 2.039052E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.236 | TFLOPs: 57.43 | 7: iteration 19650/ 44073 | consumed samples: 10060800 | consumed tokens: 20604518400 | elapsed time per iteration (s): 4.14 | learning rate: 1.268E-04 | global batch size: 512 | lm loss: 2.050446E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.535 | TFLOPs: 57.57 | 7: iteration 19660/ 44073 | consumed samples: 10065920 | consumed tokens: 20615004160 | elapsed time per iteration (s): 4.16 | learning rate: 1.267E-04 | global batch size: 512 | lm loss: 2.043885E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.192 | TFLOPs: 57.41 | 7: iteration 19670/ 44073 | consumed samples: 10071040 | consumed tokens: 20625489920 | elapsed time per iteration (s): 4.15 | learning rate: 1.267E-04 | global batch size: 512 | lm loss: 2.044536E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.401 | TFLOPs: 57.51 | 7: iteration 19680/ 44073 | consumed samples: 10076160 | consumed tokens: 20635975680 | elapsed time per iteration (s): 4.16 | learning rate: 1.266E-04 | global batch size: 512 | lm loss: 2.036310E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.948 | TFLOPs: 57.30 | 7: iteration 19690/ 44073 | consumed samples: 10081280 | consumed tokens: 20646461440 | elapsed time per iteration (s): 4.15 | learning rate: 1.265E-04 | global batch size: 512 | lm loss: 2.033955E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.501 | TFLOPs: 57.56 | 7: iteration 19700/ 44073 | consumed samples: 10086400 | consumed tokens: 20656947200 | elapsed time per iteration (s): 4.14 | learning rate: 1.265E-04 | global batch size: 512 | lm loss: 2.041266E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.561 | TFLOPs: 57.59 | 7: iteration 19710/ 44073 | consumed samples: 10091520 | consumed tokens: 20667432960 | elapsed time per iteration (s): 4.13 | learning rate: 1.264E-04 | global batch size: 512 | lm loss: 2.014787E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.873 | TFLOPs: 57.73 | 7: iteration 19720/ 44073 | consumed samples: 10096640 | consumed tokens: 20677918720 | elapsed time per iteration (s): 4.15 | learning rate: 1.263E-04 | global batch size: 512 | lm loss: 2.045131E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.229 | TFLOPs: 57.43 | 7: iteration 19730/ 44073 | consumed samples: 10101760 | consumed tokens: 20688404480 | elapsed time per iteration (s): 4.16 | learning rate: 1.263E-04 | global batch size: 512 | lm loss: 2.056635E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.952 | TFLOPs: 57.30 | 7: iteration 19740/ 44073 | consumed samples: 10106880 | consumed tokens: 20698890240 | elapsed time per iteration (s): 4.13 | learning rate: 1.262E-04 | global batch size: 512 | lm loss: 2.045168E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.914 | TFLOPs: 57.75 | 7: iteration 19750/ 44073 | consumed samples: 10112000 | consumed tokens: 20709376000 | elapsed time per iteration (s): 4.14 | learning rate: 1.262E-04 | global batch size: 512 | lm loss: 2.054549E+00 | grad norm: 0.114 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.715 | TFLOPs: 57.66 | 7: iteration 19760/ 44073 | consumed samples: 10117120 | consumed tokens: 20719861760 | elapsed time per iteration (s): 4.13 | learning rate: 1.261E-04 | global batch size: 512 | lm loss: 2.025558E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.875 | TFLOPs: 57.73 | 7: iteration 19770/ 44073 | consumed samples: 10122240 | consumed tokens: 20730347520 | elapsed time per iteration (s): 4.14 | learning rate: 1.260E-04 | global batch size: 512 | lm loss: 2.034163E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.724 | TFLOPs: 57.66 | 7: iteration 19780/ 44073 | consumed samples: 10127360 | consumed tokens: 20740833280 | elapsed time per iteration (s): 4.14 | learning rate: 1.260E-04 | global batch size: 512 | lm loss: 2.021260E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.599 | TFLOPs: 57.60 | 7: iteration 19790/ 44073 | consumed samples: 10132480 | consumed tokens: 20751319040 | elapsed time per iteration (s): 4.15 | learning rate: 1.259E-04 | global batch size: 512 | lm loss: 2.037808E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.408 | TFLOPs: 57.51 | 7: iteration 19800/ 44073 | consumed samples: 10137600 | consumed tokens: 20761804800 | elapsed time per iteration (s): 4.15 | learning rate: 1.258E-04 | global batch size: 512 | lm loss: 2.034810E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.441 | TFLOPs: 57.53 | 7: iteration 19810/ 44073 | consumed samples: 10142720 | consumed tokens: 20772290560 | elapsed time per iteration (s): 4.16 | learning rate: 1.258E-04 | global batch size: 512 | lm loss: 2.053690E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.042 | TFLOPs: 57.34 | 7: iteration 19820/ 44073 | consumed samples: 10147840 | consumed tokens: 20782776320 | elapsed time per iteration (s): 4.16 | learning rate: 1.257E-04 | global batch size: 512 | lm loss: 2.041891E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.021 | TFLOPs: 57.33 | 7: iteration 19830/ 44073 | consumed samples: 10152960 | consumed tokens: 20793262080 | elapsed time per iteration (s): 4.14 | learning rate: 1.256E-04 | global batch size: 512 | lm loss: 2.049796E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.751 | TFLOPs: 57.67 | 7: iteration 19840/ 44073 | consumed samples: 10158080 | consumed tokens: 20803747840 | elapsed time per iteration (s): 4.17 | learning rate: 1.256E-04 | global batch size: 512 | lm loss: 2.023129E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.908 | TFLOPs: 57.28 | 7: iteration 19850/ 44073 | consumed samples: 10163200 | consumed tokens: 20814233600 | elapsed time per iteration (s): 4.19 | learning rate: 1.255E-04 | global batch size: 512 | lm loss: 2.043207E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.320 | TFLOPs: 57.01 | 7: iteration 19860/ 44073 | consumed samples: 10168320 | consumed tokens: 20824719360 | elapsed time per iteration (s): 4.20 | learning rate: 1.255E-04 | global batch size: 512 | lm loss: 2.073208E+00 | grad norm: 0.425 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.813 | TFLOPs: 56.77 | 7: iteration 19870/ 44073 | consumed samples: 10173440 | consumed tokens: 20835205120 | elapsed time per iteration (s): 4.20 | learning rate: 1.254E-04 | global batch size: 512 | lm loss: 2.057338E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.821 | TFLOPs: 56.77 | 7: iteration 19880/ 44073 | consumed samples: 10178560 | consumed tokens: 20845690880 | elapsed time per iteration (s): 4.18 | learning rate: 1.253E-04 | global batch size: 512 | lm loss: 2.036457E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.379 | TFLOPs: 57.04 | 7: iteration 19890/ 44073 | consumed samples: 10183680 | consumed tokens: 20856176640 | elapsed time per iteration (s): 4.14 | learning rate: 1.253E-04 | global batch size: 512 | lm loss: 2.040960E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.528 | TFLOPs: 57.57 | 7: iteration 19900/ 44073 | consumed samples: 10188800 | consumed tokens: 20866662400 | elapsed time per iteration (s): 4.26 | learning rate: 1.252E-04 | global batch size: 512 | lm loss: 2.046071E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.205 | TFLOPs: 56.02 | 7: iteration 19910/ 44073 | consumed samples: 10193920 | consumed tokens: 20877148160 | elapsed time per iteration (s): 4.14 | learning rate: 1.251E-04 | global batch size: 512 | lm loss: 2.035531E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.540 | TFLOPs: 57.58 | 7: iteration 19920/ 44073 | consumed samples: 10199040 | consumed tokens: 20887633920 | elapsed time per iteration (s): 4.14 | learning rate: 1.251E-04 | global batch size: 512 | lm loss: 2.047626E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.657 | TFLOPs: 57.63 | 7: iteration 19930/ 44073 | consumed samples: 10204160 | consumed tokens: 20898119680 | elapsed time per iteration (s): 4.15 | learning rate: 1.250E-04 | global batch size: 512 | lm loss: 2.036721E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.326 | TFLOPs: 57.48 | 7: iteration 19940/ 44073 | consumed samples: 10209280 | consumed tokens: 20908605440 | elapsed time per iteration (s): 4.22 | learning rate: 1.249E-04 | global batch size: 512 | lm loss: 2.041448E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.281 | TFLOPs: 56.52 | 7: iteration 19950/ 44073 | consumed samples: 10214400 | consumed tokens: 20919091200 | elapsed time per iteration (s): 4.14 | learning rate: 1.249E-04 | global batch size: 512 | lm loss: 2.009694E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.803 | TFLOPs: 57.70 | 7: iteration 19960/ 44073 | consumed samples: 10219520 | consumed tokens: 20929576960 | elapsed time per iteration (s): 4.41 | learning rate: 1.248E-04 | global batch size: 512 | lm loss: 2.037553E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 116.154 | TFLOPs: 54.13 | 7: iteration 19970/ 44073 | consumed samples: 10224640 | consumed tokens: 20940062720 | elapsed time per iteration (s): 4.18 | learning rate: 1.248E-04 | global batch size: 512 | lm loss: 2.034253E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.531 | TFLOPs: 57.11 | 7: iteration 19980/ 44073 | consumed samples: 10229760 | consumed tokens: 20950548480 | elapsed time per iteration (s): 4.15 | learning rate: 1.247E-04 | global batch size: 512 | lm loss: 2.029914E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.225 | TFLOPs: 57.43 | 7: iteration 19990/ 44073 | consumed samples: 10234880 | consumed tokens: 20961034240 | elapsed time per iteration (s): 4.17 | learning rate: 1.246E-04 | global batch size: 512 | lm loss: 2.018545E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.671 | TFLOPs: 57.17 | 0: [2022-11-26 09:47:38,007] [INFO] [logging.py:68:log_dist] [Rank 0] step=20000, skipped=0, lr=[0.00012456232677539634, 0.00012456232677539634, 0.00012456232677539634], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 7: iteration 20000/ 44073 | consumed samples: 10240000 | consumed tokens: 20971520000 | elapsed time per iteration (s): 4.14 | learning rate: 1.246E-04 | global batch size: 512 | lm loss: 2.046767E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.799 | TFLOPs: 57.70 | 0: steps: 20000 loss: 2.0177 iter time (s): 4.186 samples/sec: 122.311 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 20000 | lm loss value: 1.977081E+00 | lm loss PPL: 7.221634E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 20000 to checkpoints_2b2 0: [2022-11-26 09:47:39,523] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step20000 is begin to save! 0: [2022-11-26 09:47:39,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_01-model_00-model_states.pt... 0: [2022-11-26 09:47:39,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_01-model_00-model_states.pt. 0: [2022-11-26 09:47:39,836] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_03-model_00-model_states.pt... 0: [2022-11-26 09:47:39,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_03-model_00-model_states.pt. 0: [2022-11-26 09:47:39,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_04-model_00-model_states.pt... 0: [2022-11-26 09:47:40,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_04-model_00-model_states.pt. 0: [2022-11-26 09:47:40,139] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_05-model_00-model_states.pt... 0: [2022-11-26 09:47:40,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_05-model_00-model_states.pt. 0: [2022-11-26 09:47:40,281] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_06-model_00-model_states.pt... 0: [2022-11-26 09:47:40,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_06-model_00-model_states.pt. 0: [2022-11-26 09:47:40,427] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_07-model_00-model_states.pt... 0: [2022-11-26 09:47:40,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_07-model_00-model_states.pt. 0: [2022-11-26 09:47:40,551] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_08-model_00-model_states.pt... 0: [2022-11-26 09:47:40,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_08-model_00-model_states.pt. 0: [2022-11-26 09:47:40,694] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_09-model_00-model_states.pt... 0: [2022-11-26 09:47:40,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_09-model_00-model_states.pt. 0: [2022-11-26 09:47:40,838] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_10-model_00-model_states.pt... 0: [2022-11-26 09:47:40,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_10-model_00-model_states.pt. 0: [2022-11-26 09:47:40,976] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_11-model_00-model_states.pt... 0: [2022-11-26 09:47:41,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_11-model_00-model_states.pt. 0: [2022-11-26 09:47:41,118] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_12-model_00-model_states.pt... 0: [2022-11-26 09:47:41,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_12-model_00-model_states.pt. 0: [2022-11-26 09:47:41,262] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_13-model_00-model_states.pt... 0: [2022-11-26 09:47:41,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_13-model_00-model_states.pt. 0: [2022-11-26 09:47:41,400] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_14-model_00-model_states.pt... 0: [2022-11-26 09:47:41,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_14-model_00-model_states.pt. 0: [2022-11-26 09:47:41,541] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_15-model_00-model_states.pt... 0: [2022-11-26 09:47:41,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_15-model_00-model_states.pt. 0: [2022-11-26 09:47:41,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_16-model_00-model_states.pt... 0: [2022-11-26 09:47:41,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_16-model_00-model_states.pt. 0: [2022-11-26 09:47:41,816] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_17-model_00-model_states.pt... 0: [2022-11-26 09:47:41,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_17-model_00-model_states.pt. 0: [2022-11-26 09:47:41,955] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_18-model_00-model_states.pt... 0: [2022-11-26 09:47:42,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_18-model_00-model_states.pt. 0: [2022-11-26 09:47:42,091] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_19-model_00-model_states.pt... 0: [2022-11-26 09:47:42,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_19-model_00-model_states.pt. 0: [2022-11-26 09:47:42,229] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_20-model_00-model_states.pt... 0: [2022-11-26 09:47:42,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_20-model_00-model_states.pt. 0: [2022-11-26 09:47:42,371] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_21-model_00-model_states.pt... 0: [2022-11-26 09:47:42,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_21-model_00-model_states.pt. 0: [2022-11-26 09:47:42,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_22-model_00-model_states.pt... 0: [2022-11-26 09:47:42,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_22-model_00-model_states.pt. 0: [2022-11-26 09:47:42,646] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_23-model_00-model_states.pt... 0: [2022-11-26 09:47:42,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_23-model_00-model_states.pt. 0: [2022-11-26 09:47:42,781] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_24-model_00-model_states.pt... 0: [2022-11-26 09:47:42,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_24-model_00-model_states.pt. 0: [2022-11-26 09:47:42,919] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_25-model_00-model_states.pt... 0: [2022-11-26 09:47:43,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_25-model_00-model_states.pt. 0: [2022-11-26 09:47:43,059] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_26-model_00-model_states.pt... 0: [2022-11-26 09:47:43,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_26-model_00-model_states.pt. 0: [2022-11-26 09:47:43,192] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_27-model_00-model_states.pt... 0: [2022-11-26 09:47:43,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_27-model_00-model_states.pt. 0: [2022-11-26 09:47:43,330] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_28-model_00-model_states.pt... 0: [2022-11-26 09:47:43,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_28-model_00-model_states.pt. 0: [2022-11-26 09:47:43,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_29-model_00-model_states.pt... 0: [2022-11-26 09:47:43,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_29-model_00-model_states.pt. 0: [2022-11-26 09:47:43,598] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_30-model_00-model_states.pt... 0: [2022-11-26 09:47:43,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_30-model_00-model_states.pt. 0: [2022-11-26 09:47:43,739] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_31-model_00-model_states.pt... 0: [2022-11-26 09:47:43,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_31-model_00-model_states.pt. 0: [2022-11-26 09:47:43,874] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_32-model_00-model_states.pt... 0: [2022-11-26 09:47:44,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_32-model_00-model_states.pt. 0: [2022-11-26 09:47:44,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_33-model_00-model_states.pt... 0: [2022-11-26 09:47:44,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_33-model_00-model_states.pt. 0: [2022-11-26 09:47:44,142] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_34-model_00-model_states.pt... 0: [2022-11-26 09:47:44,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_34-model_00-model_states.pt. 0: [2022-11-26 09:47:44,277] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/layer_36-model_00-model_states.pt... 0: [2022-11-26 09:47:44,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/layer_36-model_00-model_states.pt. 0: [2022-11-26 09:47:44,281] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step20000/mp_rank_00_model_states.pt 0: [2022-11-26 09:47:44,281] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/mp_rank_00_model_states.pt... 0: [2022-11-26 09:47:44,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/mp_rank_00_model_states.pt. 0: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:47:44,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step20000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:47:44,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:47:44,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:47:44,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 09:47:44,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-26 09:47:44,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:47:44,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 09:47:44,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 0: [2022-11-26 09:47:44,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:47:44,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 09:47:44,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-26 09:47:44,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:47:44,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 09:47:44,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 0: [2022-11-26 09:47:44,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:47:44,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:47:44,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 09:47:44,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 09:47:44,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 0: [2022-11-26 09:47:44,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-26 09:47:44,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:47:44,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 09:47:44,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-26 09:47:44,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:47:44,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 09:47:44,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-26 09:47:44,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:47:44,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 09:47:44,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 0: [2022-11-26 09:47:44,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:47:44,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:47:44,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:47:44,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 09:47:44,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 09:47:44,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 09:47:44,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 0: [2022-11-26 09:47:44,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 0: [2022-11-26 09:47:44,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-26 09:47:44,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:47:44,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 09:47:44,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-26 09:47:44,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:47:44,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 09:47:44,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 2: [2022-11-26 09:47:44,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:47:44,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:47:44,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:47:44,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:47:44,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 09:47:44,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 09:47:44,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 09:47:44,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 09:47:44,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 2: [2022-11-26 09:47:44,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 2: [2022-11-26 09:47:44,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-26 09:47:44,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:47:44,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 09:47:44,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 2: [2022-11-26 09:47:44,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 2: [2022-11-26 09:47:44,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:47:44,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 09:47:44,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 6: [2022-11-26 09:47:45,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:47:45,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 09:47:45,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 6: [2022-11-26 09:47:45,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:47:45,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 09:47:45,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 6: [2022-11-26 09:47:45,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:47:45,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 09:47:45,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 6: [2022-11-26 09:47:45,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:47:45,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 09:47:45,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 2: [2022-11-26 09:47:45,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:47:45,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 09:47:45,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 2: [2022-11-26 09:47:45,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:47:45,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 09:47:45,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 2: [2022-11-26 09:47:45,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:47:45,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 09:47:45,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 3: [2022-11-26 09:47:45,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:47:45,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 09:47:45,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 3: [2022-11-26 09:47:45,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:47:45,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 09:47:45,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 3: [2022-11-26 09:47:45,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:47:45,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 09:47:45,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 3: [2022-11-26 09:47:45,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:47:45,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 09:47:45,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 6: [2022-11-26 09:47:45,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:47:45,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 09:47:45,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 6: [2022-11-26 09:47:45,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:47:45,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 09:47:45,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 6: [2022-11-26 09:47:45,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:47:45,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 09:47:45,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 6: [2022-11-26 09:47:45,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:47:45,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 09:47:45,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 3: [2022-11-26 09:47:45,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:47:45,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 09:47:45,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 3: [2022-11-26 09:47:45,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:47:45,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 09:47:45,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:47:45,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 3: [2022-11-26 09:47:45,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 09:47:45,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:47:45,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 3: [2022-11-26 09:47:45,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 09:47:45,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 4: [2022-11-26 09:47:45,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:47:45,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:47:45,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:47:45,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:47:45,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 09:47:45,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:47:45,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 09:47:45,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 09:47:45,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 09:47:45,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 4: [2022-11-26 09:47:45,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 4: [2022-11-26 09:47:45,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 09:47:45,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 4: [2022-11-26 09:47:45,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 4: [2022-11-26 09:47:45,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 4: [2022-11-26 09:47:45,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:47:45,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:47:45,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:47:45,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 09:47:45,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 09:47:45,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 09:47:45,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 4: [2022-11-26 09:47:45,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 4: [2022-11-26 09:47:45,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 0: [2022-11-26 09:47:45,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 09:47:45,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 7: [2022-11-26 09:47:45,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:47:45,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:47:45,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:47:45,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:47:45,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 09:47:45,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 09:47:45,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 09:47:45,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 09:47:45,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 7: [2022-11-26 09:47:45,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 7: [2022-11-26 09:47:45,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 7: [2022-11-26 09:47:45,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 7: [2022-11-26 09:47:45,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:47:45,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:47:45,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 09:47:45,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:47:45,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:47:45,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 7: [2022-11-26 09:47:45,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 09:47:45,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 09:47:45,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 09:47:45,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 7: [2022-11-26 09:47:45,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 7: [2022-11-26 09:47:45,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 5: [2022-11-26 09:47:45,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:47:45,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:47:45,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 09:47:45,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 09:47:45,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:47:45,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 5: [2022-11-26 09:47:45,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 5: [2022-11-26 09:47:45,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 09:47:45,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 5: [2022-11-26 09:47:45,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:47:45,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:47:45,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:47:45,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:47:45,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:47:45,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 09:47:45,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 09:47:45,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 09:47:45,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 09:47:45,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step20000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 09:47:45,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 5: [2022-11-26 09:47:45,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 5: [2022-11-26 09:47:45,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 5: [2022-11-26 09:47:45,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 5: [2022-11-26 09:47:45,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 0: successfully saved checkpoint at iteration 20000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 6119.21 7: iteration 20010/ 44073 | consumed samples: 10245120 | consumed tokens: 20982005760 | elapsed time per iteration (s): 4.94 | learning rate: 1.245E-04 | global batch size: 512 | lm loss: 2.061019E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 103.699 | TFLOPs: 48.33 | 7: iteration 20020/ 44073 | consumed samples: 10250240 | consumed tokens: 20992491520 | elapsed time per iteration (s): 4.30 | learning rate: 1.244E-04 | global batch size: 512 | lm loss: 2.049533E+00 | grad norm: 0.284 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.145 | TFLOPs: 55.53 | 7: iteration 20030/ 44073 | consumed samples: 10255360 | consumed tokens: 21002977280 | elapsed time per iteration (s): 4.18 | learning rate: 1.244E-04 | global batch size: 512 | lm loss: 2.050448E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.563 | TFLOPs: 57.12 | 7: iteration 20040/ 44073 | consumed samples: 10260480 | consumed tokens: 21013463040 | elapsed time per iteration (s): 4.20 | learning rate: 1.243E-04 | global batch size: 512 | lm loss: 2.025711E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.786 | TFLOPs: 56.76 | 7: iteration 20050/ 44073 | consumed samples: 10265600 | consumed tokens: 21023948800 | elapsed time per iteration (s): 4.24 | learning rate: 1.242E-04 | global batch size: 512 | lm loss: 2.328758E+00 | grad norm: 12.363 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.844 | TFLOPs: 56.32 | 7: iteration 20060/ 44073 | consumed samples: 10270720 | consumed tokens: 21034434560 | elapsed time per iteration (s): 4.19 | learning rate: 1.242E-04 | global batch size: 512 | lm loss: 2.778331E+00 | grad norm: 1.430 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.221 | TFLOPs: 56.96 | 7: iteration 20070/ 44073 | consumed samples: 10275840 | consumed tokens: 21044920320 | elapsed time per iteration (s): 4.15 | learning rate: 1.241E-04 | global batch size: 512 | lm loss: 2.201871E+00 | grad norm: 0.292 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.366 | TFLOPs: 57.49 | 7: iteration 20080/ 44073 | consumed samples: 10280960 | consumed tokens: 21055406080 | elapsed time per iteration (s): 9.32 | learning rate: 1.241E-04 | global batch size: 512 | lm loss: 2.090180E+00 | grad norm: 0.236 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 54.934 | TFLOPs: 25.60 | 7: iteration 20090/ 44073 | consumed samples: 10286080 | consumed tokens: 21065891840 | elapsed time per iteration (s): 4.19 | learning rate: 1.240E-04 | global batch size: 512 | lm loss: 2.078860E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.203 | TFLOPs: 56.95 | 7: iteration 20100/ 44073 | consumed samples: 10291200 | consumed tokens: 21076377600 | elapsed time per iteration (s): 4.26 | learning rate: 1.239E-04 | global batch size: 512 | lm loss: 2.068161E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.286 | TFLOPs: 56.06 | 7: iteration 20110/ 44073 | consumed samples: 10296320 | consumed tokens: 21086863360 | elapsed time per iteration (s): 4.14 | learning rate: 1.239E-04 | global batch size: 512 | lm loss: 2.063552E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.739 | TFLOPs: 57.67 | 7: iteration 20120/ 44073 | consumed samples: 10301440 | consumed tokens: 21097349120 | elapsed time per iteration (s): 4.16 | learning rate: 1.238E-04 | global batch size: 512 | lm loss: 2.045284E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.954 | TFLOPs: 57.30 | 7: iteration 20130/ 44073 | consumed samples: 10306560 | consumed tokens: 21107834880 | elapsed time per iteration (s): 4.18 | learning rate: 1.237E-04 | global batch size: 512 | lm loss: 2.052179E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.425 | TFLOPs: 57.06 | 7: iteration 20140/ 44073 | consumed samples: 10311680 | consumed tokens: 21118320640 | elapsed time per iteration (s): 4.22 | learning rate: 1.237E-04 | global batch size: 512 | lm loss: 2.041693E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.394 | TFLOPs: 56.58 | 7: iteration 20150/ 44073 | consumed samples: 10316800 | consumed tokens: 21128806400 | elapsed time per iteration (s): 4.18 | learning rate: 1.236E-04 | global batch size: 512 | lm loss: 2.046368E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.560 | TFLOPs: 57.12 | 7: iteration 20160/ 44073 | consumed samples: 10321920 | consumed tokens: 21139292160 | elapsed time per iteration (s): 4.19 | learning rate: 1.235E-04 | global batch size: 512 | lm loss: 2.054891E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.229 | TFLOPs: 56.96 | 7: iteration 20170/ 44073 | consumed samples: 10327040 | consumed tokens: 21149777920 | elapsed time per iteration (s): 4.22 | learning rate: 1.235E-04 | global batch size: 512 | lm loss: 2.048952E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.425 | TFLOPs: 56.59 | 7: iteration 20180/ 44073 | consumed samples: 10332160 | consumed tokens: 21160263680 | elapsed time per iteration (s): 4.19 | learning rate: 1.234E-04 | global batch size: 512 | lm loss: 2.040958E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.089 | TFLOPs: 56.90 | 7: iteration 20190/ 44073 | consumed samples: 10337280 | consumed tokens: 21170749440 | elapsed time per iteration (s): 4.21 | learning rate: 1.233E-04 | global batch size: 512 | lm loss: 2.037601E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.750 | TFLOPs: 56.74 | 7: iteration 20200/ 44073 | consumed samples: 10342400 | consumed tokens: 21181235200 | elapsed time per iteration (s): 4.20 | learning rate: 1.233E-04 | global batch size: 512 | lm loss: 2.045430E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.914 | TFLOPs: 56.82 | 7: iteration 20210/ 44073 | consumed samples: 10347520 | consumed tokens: 21191720960 | elapsed time per iteration (s): 4.18 | learning rate: 1.232E-04 | global batch size: 512 | lm loss: 2.049537E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.512 | TFLOPs: 57.10 | 7: iteration 20220/ 44073 | consumed samples: 10352640 | consumed tokens: 21202206720 | elapsed time per iteration (s): 4.16 | learning rate: 1.232E-04 | global batch size: 512 | lm loss: 2.045837E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.212 | TFLOPs: 57.42 | 7: iteration 20230/ 44073 | consumed samples: 10357760 | consumed tokens: 21212692480 | elapsed time per iteration (s): 4.18 | learning rate: 1.231E-04 | global batch size: 512 | lm loss: 2.014589E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.434 | TFLOPs: 57.06 | 7: iteration 20240/ 44073 | consumed samples: 10362880 | consumed tokens: 21223178240 | elapsed time per iteration (s): 4.18 | learning rate: 1.230E-04 | global batch size: 512 | lm loss: 2.063896E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.416 | TFLOPs: 57.05 | 7: iteration 20250/ 44073 | consumed samples: 10368000 | consumed tokens: 21233664000 | elapsed time per iteration (s): 4.16 | learning rate: 1.230E-04 | global batch size: 512 | lm loss: 2.038480E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.998 | TFLOPs: 57.32 | 7: iteration 20260/ 44073 | consumed samples: 10373120 | consumed tokens: 21244149760 | elapsed time per iteration (s): 4.15 | learning rate: 1.229E-04 | global batch size: 512 | lm loss: 2.035294E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.416 | TFLOPs: 57.52 | 7: iteration 20270/ 44073 | consumed samples: 10378240 | consumed tokens: 21254635520 | elapsed time per iteration (s): 4.19 | learning rate: 1.228E-04 | global batch size: 512 | lm loss: 2.064917E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.144 | TFLOPs: 56.93 | 7: iteration 20280/ 44073 | consumed samples: 10383360 | consumed tokens: 21265121280 | elapsed time per iteration (s): 4.24 | learning rate: 1.228E-04 | global batch size: 512 | lm loss: 2.051265E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.889 | TFLOPs: 56.34 | 7: iteration 20290/ 44073 | consumed samples: 10388480 | consumed tokens: 21275607040 | elapsed time per iteration (s): 4.15 | learning rate: 1.227E-04 | global batch size: 512 | lm loss: 2.047082E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.376 | TFLOPs: 57.50 | 7: iteration 20300/ 44073 | consumed samples: 10393600 | consumed tokens: 21286092800 | elapsed time per iteration (s): 4.14 | learning rate: 1.226E-04 | global batch size: 512 | lm loss: 2.033408E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.553 | TFLOPs: 57.58 | 7: iteration 20310/ 44073 | consumed samples: 10398720 | consumed tokens: 21296578560 | elapsed time per iteration (s): 4.13 | learning rate: 1.226E-04 | global batch size: 512 | lm loss: 2.038226E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.870 | TFLOPs: 57.73 | 7: iteration 20320/ 44073 | consumed samples: 10403840 | consumed tokens: 21307064320 | elapsed time per iteration (s): 4.14 | learning rate: 1.225E-04 | global batch size: 512 | lm loss: 2.042274E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.617 | TFLOPs: 57.61 | 7: iteration 20330/ 44073 | consumed samples: 10408960 | consumed tokens: 21317550080 | elapsed time per iteration (s): 4.16 | learning rate: 1.224E-04 | global batch size: 512 | lm loss: 2.047858E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.200 | TFLOPs: 57.42 | 7: iteration 20340/ 44073 | consumed samples: 10414080 | consumed tokens: 21328035840 | elapsed time per iteration (s): 4.13 | learning rate: 1.224E-04 | global batch size: 512 | lm loss: 2.028837E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.859 | TFLOPs: 57.72 | 7: iteration 20350/ 44073 | consumed samples: 10419200 | consumed tokens: 21338521600 | elapsed time per iteration (s): 4.15 | learning rate: 1.223E-04 | global batch size: 512 | lm loss: 2.027843E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.484 | TFLOPs: 57.55 | 7: iteration 20360/ 44073 | consumed samples: 10424320 | consumed tokens: 21349007360 | elapsed time per iteration (s): 4.14 | learning rate: 1.223E-04 | global batch size: 512 | lm loss: 2.029157E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.726 | TFLOPs: 57.66 | 7: iteration 20370/ 44073 | consumed samples: 10429440 | consumed tokens: 21359493120 | elapsed time per iteration (s): 4.14 | learning rate: 1.222E-04 | global batch size: 512 | lm loss: 2.041462E+00 | grad norm: 0.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.665 | TFLOPs: 57.63 | 7: iteration 20380/ 44073 | consumed samples: 10434560 | consumed tokens: 21369978880 | elapsed time per iteration (s): 4.14 | learning rate: 1.221E-04 | global batch size: 512 | lm loss: 2.020076E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.692 | TFLOPs: 57.65 | 7: iteration 20390/ 44073 | consumed samples: 10439680 | consumed tokens: 21380464640 | elapsed time per iteration (s): 4.14 | learning rate: 1.221E-04 | global batch size: 512 | lm loss: 2.016118E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.727 | TFLOPs: 57.66 | 7: iteration 20400/ 44073 | consumed samples: 10444800 | consumed tokens: 21390950400 | elapsed time per iteration (s): 4.16 | learning rate: 1.220E-04 | global batch size: 512 | lm loss: 2.047171E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.102 | TFLOPs: 57.37 | 7: iteration 20410/ 44073 | consumed samples: 10449920 | consumed tokens: 21401436160 | elapsed time per iteration (s): 4.16 | learning rate: 1.219E-04 | global batch size: 512 | lm loss: 2.025972E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.988 | TFLOPs: 57.32 | 7: iteration 20420/ 44073 | consumed samples: 10455040 | consumed tokens: 21411921920 | elapsed time per iteration (s): 4.14 | learning rate: 1.219E-04 | global batch size: 512 | lm loss: 2.041705E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.694 | TFLOPs: 57.65 | 7: iteration 20430/ 44073 | consumed samples: 10460160 | consumed tokens: 21422407680 | elapsed time per iteration (s): 4.19 | learning rate: 1.218E-04 | global batch size: 512 | lm loss: 2.030466E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.144 | TFLOPs: 56.93 | 7: iteration 20440/ 44073 | consumed samples: 10465280 | consumed tokens: 21432893440 | elapsed time per iteration (s): 4.13 | learning rate: 1.217E-04 | global batch size: 512 | lm loss: 2.042419E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.833 | TFLOPs: 57.71 | 7: iteration 20450/ 44073 | consumed samples: 10470400 | consumed tokens: 21443379200 | elapsed time per iteration (s): 4.15 | learning rate: 1.217E-04 | global batch size: 512 | lm loss: 2.028586E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.410 | TFLOPs: 57.52 | 7: iteration 20460/ 44073 | consumed samples: 10475520 | consumed tokens: 21453864960 | elapsed time per iteration (s): 4.17 | learning rate: 1.216E-04 | global batch size: 512 | lm loss: 2.038422E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.881 | TFLOPs: 57.27 | 7: iteration 20470/ 44073 | consumed samples: 10480640 | consumed tokens: 21464350720 | elapsed time per iteration (s): 4.16 | learning rate: 1.215E-04 | global batch size: 512 | lm loss: 2.033038E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.123 | TFLOPs: 57.38 | 7: iteration 20480/ 44073 | consumed samples: 10485760 | consumed tokens: 21474836480 | elapsed time per iteration (s): 4.19 | learning rate: 1.215E-04 | global batch size: 512 | lm loss: 2.053348E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.299 | TFLOPs: 57.00 | 7: iteration 20490/ 44073 | consumed samples: 10490880 | consumed tokens: 21485322240 | elapsed time per iteration (s): 4.62 | learning rate: 1.214E-04 | global batch size: 512 | lm loss: 2.015442E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 110.767 | TFLOPs: 51.62 | 7: iteration 20500/ 44073 | consumed samples: 10496000 | consumed tokens: 21495808000 | elapsed time per iteration (s): 4.17 | learning rate: 1.214E-04 | global batch size: 512 | lm loss: 2.036973E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.849 | TFLOPs: 57.25 | 7: iteration 20510/ 44073 | consumed samples: 10501120 | consumed tokens: 21506293760 | elapsed time per iteration (s): 4.16 | learning rate: 1.213E-04 | global batch size: 512 | lm loss: 2.045824E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.003 | TFLOPs: 57.33 | 7: iteration 20520/ 44073 | consumed samples: 10506240 | consumed tokens: 21516779520 | elapsed time per iteration (s): 4.16 | learning rate: 1.212E-04 | global batch size: 512 | lm loss: 2.044776E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.060 | TFLOPs: 57.35 | 7: iteration 20530/ 44073 | consumed samples: 10511360 | consumed tokens: 21527265280 | elapsed time per iteration (s): 4.17 | learning rate: 1.212E-04 | global batch size: 512 | lm loss: 2.025224E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.813 | TFLOPs: 57.24 | 7: iteration 20540/ 44073 | consumed samples: 10516480 | consumed tokens: 21537751040 | elapsed time per iteration (s): 4.17 | learning rate: 1.211E-04 | global batch size: 512 | lm loss: 2.048129E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.822 | TFLOPs: 57.24 | 7: iteration 20550/ 44073 | consumed samples: 10521600 | consumed tokens: 21548236800 | elapsed time per iteration (s): 4.17 | learning rate: 1.210E-04 | global batch size: 512 | lm loss: 2.011836E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.769 | TFLOPs: 57.22 | 7: iteration 20560/ 44073 | consumed samples: 10526720 | consumed tokens: 21558722560 | elapsed time per iteration (s): 4.16 | learning rate: 1.210E-04 | global batch size: 512 | lm loss: 2.032344E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.165 | TFLOPs: 57.40 | 7: iteration 20570/ 44073 | consumed samples: 10531840 | consumed tokens: 21569208320 | elapsed time per iteration (s): 4.15 | learning rate: 1.209E-04 | global batch size: 512 | lm loss: 2.044308E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.435 | TFLOPs: 57.53 | 7: iteration 20580/ 44073 | consumed samples: 10536960 | consumed tokens: 21579694080 | elapsed time per iteration (s): 4.16 | learning rate: 1.208E-04 | global batch size: 512 | lm loss: 2.050671E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.934 | TFLOPs: 57.29 | 7: iteration 20590/ 44073 | consumed samples: 10542080 | consumed tokens: 21590179840 | elapsed time per iteration (s): 4.16 | learning rate: 1.208E-04 | global batch size: 512 | lm loss: 2.003022E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.047 | TFLOPs: 57.35 | 7: iteration 20600/ 44073 | consumed samples: 10547200 | consumed tokens: 21600665600 | elapsed time per iteration (s): 4.17 | learning rate: 1.207E-04 | global batch size: 512 | lm loss: 2.020011E+00 | grad norm: 0.113 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.831 | TFLOPs: 57.25 | 7: iteration 20610/ 44073 | consumed samples: 10552320 | consumed tokens: 21611151360 | elapsed time per iteration (s): 4.17 | learning rate: 1.206E-04 | global batch size: 512 | lm loss: 2.037202E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.746 | TFLOPs: 57.21 | 7: iteration 20620/ 44073 | consumed samples: 10557440 | consumed tokens: 21621637120 | elapsed time per iteration (s): 4.18 | learning rate: 1.206E-04 | global batch size: 512 | lm loss: 2.031546E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.377 | TFLOPs: 57.03 | 7: iteration 20630/ 44073 | consumed samples: 10562560 | consumed tokens: 21632122880 | elapsed time per iteration (s): 4.14 | learning rate: 1.205E-04 | global batch size: 512 | lm loss: 2.024023E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.598 | TFLOPs: 57.60 | 7: iteration 20640/ 44073 | consumed samples: 10567680 | consumed tokens: 21642608640 | elapsed time per iteration (s): 4.14 | learning rate: 1.205E-04 | global batch size: 512 | lm loss: 2.038143E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.748 | TFLOPs: 57.67 | 7: iteration 20650/ 44073 | consumed samples: 10572800 | consumed tokens: 21653094400 | elapsed time per iteration (s): 4.15 | learning rate: 1.204E-04 | global batch size: 512 | lm loss: 2.026835E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.286 | TFLOPs: 57.46 | 7: iteration 20660/ 44073 | consumed samples: 10577920 | consumed tokens: 21663580160 | elapsed time per iteration (s): 4.15 | learning rate: 1.203E-04 | global batch size: 512 | lm loss: 2.044021E+00 | grad norm: 0.113 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.280 | TFLOPs: 57.45 | 7: iteration 20670/ 44073 | consumed samples: 10583040 | consumed tokens: 21674065920 | elapsed time per iteration (s): 4.14 | learning rate: 1.203E-04 | global batch size: 512 | lm loss: 2.037287E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.667 | TFLOPs: 57.64 | 7: iteration 20680/ 44073 | consumed samples: 10588160 | consumed tokens: 21684551680 | elapsed time per iteration (s): 4.15 | learning rate: 1.202E-04 | global batch size: 512 | lm loss: 2.033514E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.316 | TFLOPs: 57.47 | 7: iteration 20690/ 44073 | consumed samples: 10593280 | consumed tokens: 21695037440 | elapsed time per iteration (s): 4.17 | learning rate: 1.201E-04 | global batch size: 512 | lm loss: 2.023586E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.839 | TFLOPs: 57.25 | 7: iteration 20700/ 44073 | consumed samples: 10598400 | consumed tokens: 21705523200 | elapsed time per iteration (s): 4.14 | learning rate: 1.201E-04 | global batch size: 512 | lm loss: 2.018759E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.533 | TFLOPs: 57.57 | 7: iteration 20710/ 44073 | consumed samples: 10603520 | consumed tokens: 21716008960 | elapsed time per iteration (s): 4.15 | learning rate: 1.200E-04 | global batch size: 512 | lm loss: 2.017188E+00 | grad norm: 0.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.284 | TFLOPs: 57.46 | 7: iteration 20720/ 44073 | consumed samples: 10608640 | consumed tokens: 21726494720 | elapsed time per iteration (s): 4.18 | learning rate: 1.199E-04 | global batch size: 512 | lm loss: 2.019132E+00 | grad norm: 0.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.511 | TFLOPs: 57.10 | 7: iteration 20730/ 44073 | consumed samples: 10613760 | consumed tokens: 21736980480 | elapsed time per iteration (s): 4.22 | learning rate: 1.199E-04 | global batch size: 512 | lm loss: 2.019461E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.348 | TFLOPs: 56.55 | 7: iteration 20740/ 44073 | consumed samples: 10618880 | consumed tokens: 21747466240 | elapsed time per iteration (s): 4.16 | learning rate: 1.198E-04 | global batch size: 512 | lm loss: 2.023248E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.035 | TFLOPs: 57.34 | 7: iteration 20750/ 44073 | consumed samples: 10624000 | consumed tokens: 21757952000 | elapsed time per iteration (s): 4.15 | learning rate: 1.197E-04 | global batch size: 512 | lm loss: 2.048646E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.322 | TFLOPs: 57.47 | 7: iteration 20760/ 44073 | consumed samples: 10629120 | consumed tokens: 21768437760 | elapsed time per iteration (s): 4.18 | learning rate: 1.197E-04 | global batch size: 512 | lm loss: 2.050701E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.433 | TFLOPs: 57.06 | 7: iteration 20770/ 44073 | consumed samples: 10634240 | consumed tokens: 21778923520 | elapsed time per iteration (s): 4.18 | learning rate: 1.196E-04 | global batch size: 512 | lm loss: 2.053798E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.547 | TFLOPs: 57.11 | 7: iteration 20780/ 44073 | consumed samples: 10639360 | consumed tokens: 21789409280 | elapsed time per iteration (s): 4.19 | learning rate: 1.196E-04 | global batch size: 512 | lm loss: 2.033288E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.158 | TFLOPs: 56.93 | 7: iteration 20790/ 44073 | consumed samples: 10644480 | consumed tokens: 21799895040 | elapsed time per iteration (s): 4.16 | learning rate: 1.195E-04 | global batch size: 512 | lm loss: 2.011921E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.954 | TFLOPs: 57.30 | 7: iteration 20800/ 44073 | consumed samples: 10649600 | consumed tokens: 21810380800 | elapsed time per iteration (s): 4.20 | learning rate: 1.194E-04 | global batch size: 512 | lm loss: 2.035860E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.004 | TFLOPs: 56.86 | 7: iteration 20810/ 44073 | consumed samples: 10654720 | consumed tokens: 21820866560 | elapsed time per iteration (s): 4.17 | learning rate: 1.194E-04 | global batch size: 512 | lm loss: 2.025258E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.667 | TFLOPs: 57.17 | 7: iteration 20820/ 44073 | consumed samples: 10659840 | consumed tokens: 21831352320 | elapsed time per iteration (s): 4.17 | learning rate: 1.193E-04 | global batch size: 512 | lm loss: 2.029506E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.824 | TFLOPs: 57.24 | 7: iteration 20830/ 44073 | consumed samples: 10664960 | consumed tokens: 21841838080 | elapsed time per iteration (s): 4.17 | learning rate: 1.192E-04 | global batch size: 512 | lm loss: 2.046487E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.890 | TFLOPs: 57.27 | 7: iteration 20840/ 44073 | consumed samples: 10670080 | consumed tokens: 21852323840 | elapsed time per iteration (s): 4.26 | learning rate: 1.192E-04 | global batch size: 512 | lm loss: 2.041764E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.179 | TFLOPs: 56.01 | 7: iteration 20850/ 44073 | consumed samples: 10675200 | consumed tokens: 21862809600 | elapsed time per iteration (s): 4.15 | learning rate: 1.191E-04 | global batch size: 512 | lm loss: 2.044148E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.428 | TFLOPs: 57.52 | 7: iteration 20860/ 44073 | consumed samples: 10680320 | consumed tokens: 21873295360 | elapsed time per iteration (s): 4.42 | learning rate: 1.190E-04 | global batch size: 512 | lm loss: 2.029172E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 115.853 | TFLOPs: 53.99 | 7: iteration 20870/ 44073 | consumed samples: 10685440 | consumed tokens: 21883781120 | elapsed time per iteration (s): 4.15 | learning rate: 1.190E-04 | global batch size: 512 | lm loss: 2.040945E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.410 | TFLOPs: 57.52 | 7: iteration 20880/ 44073 | consumed samples: 10690560 | consumed tokens: 21894266880 | elapsed time per iteration (s): 4.28 | learning rate: 1.189E-04 | global batch size: 512 | lm loss: 2.039779E+00 | grad norm: 0.113 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.759 | TFLOPs: 55.81 | 7: iteration 20890/ 44073 | consumed samples: 10695680 | consumed tokens: 21904752640 | elapsed time per iteration (s): 4.18 | learning rate: 1.188E-04 | global batch size: 512 | lm loss: 2.043612E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.388 | TFLOPs: 57.04 | 7: iteration 20900/ 44073 | consumed samples: 10700800 | consumed tokens: 21915238400 | elapsed time per iteration (s): 4.17 | learning rate: 1.188E-04 | global batch size: 512 | lm loss: 2.015572E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.857 | TFLOPs: 57.26 | 7: iteration 20910/ 44073 | consumed samples: 10705920 | consumed tokens: 21925724160 | elapsed time per iteration (s): 4.18 | learning rate: 1.187E-04 | global batch size: 512 | lm loss: 2.029626E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.582 | TFLOPs: 57.13 | 7: iteration 20920/ 44073 | consumed samples: 10711040 | consumed tokens: 21936209920 | elapsed time per iteration (s): 4.15 | learning rate: 1.187E-04 | global batch size: 512 | lm loss: 2.022239E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.379 | TFLOPs: 57.50 | 7: iteration 20930/ 44073 | consumed samples: 10716160 | consumed tokens: 21946695680 | elapsed time per iteration (s): 4.17 | learning rate: 1.186E-04 | global batch size: 512 | lm loss: 2.031092E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.635 | TFLOPs: 57.15 | 7: iteration 20940/ 44073 | consumed samples: 10721280 | consumed tokens: 21957181440 | elapsed time per iteration (s): 4.46 | learning rate: 1.185E-04 | global batch size: 512 | lm loss: 2.032723E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 114.901 | TFLOPs: 53.55 | 7: iteration 20950/ 44073 | consumed samples: 10726400 | consumed tokens: 21967667200 | elapsed time per iteration (s): 4.18 | learning rate: 1.185E-04 | global batch size: 512 | lm loss: 2.045791E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.602 | TFLOPs: 57.14 | 7: iteration 20960/ 44073 | consumed samples: 10731520 | consumed tokens: 21978152960 | elapsed time per iteration (s): 4.28 | learning rate: 1.184E-04 | global batch size: 512 | lm loss: 2.035476E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.574 | TFLOPs: 55.73 | 7: iteration 20970/ 44073 | consumed samples: 10736640 | consumed tokens: 21988638720 | elapsed time per iteration (s): 4.30 | learning rate: 1.183E-04 | global batch size: 512 | lm loss: 2.019584E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.967 | TFLOPs: 55.44 | 7: iteration 20980/ 44073 | consumed samples: 10741760 | consumed tokens: 21999124480 | elapsed time per iteration (s): 4.16 | learning rate: 1.183E-04 | global batch size: 512 | lm loss: 2.046509E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.976 | TFLOPs: 57.31 | 7: iteration 20990/ 44073 | consumed samples: 10746880 | consumed tokens: 22009610240 | elapsed time per iteration (s): 4.19 | learning rate: 1.182E-04 | global batch size: 512 | lm loss: 2.024197E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.314 | TFLOPs: 57.00 | 7: iteration 21000/ 44073 | consumed samples: 10752000 | consumed tokens: 22020096000 | elapsed time per iteration (s): 4.18 | learning rate: 1.181E-04 | global batch size: 512 | lm loss: 2.023274E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.557 | TFLOPs: 57.12 | 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 21000 | lm loss value: 1.973224E+00 | lm loss PPL: 7.193833E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 21000 to checkpoints_2b2 0: [2022-11-26 10:58:23,226] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step21000 is begin to save! 0: [2022-11-26 10:58:23,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_01-model_00-model_states.pt... 0: [2022-11-26 10:58:23,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_01-model_00-model_states.pt. 0: [2022-11-26 10:58:23,562] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_03-model_00-model_states.pt... 0: [2022-11-26 10:58:23,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_03-model_00-model_states.pt. 0: [2022-11-26 10:58:23,711] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_04-model_00-model_states.pt... 0: [2022-11-26 10:58:23,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_04-model_00-model_states.pt. 0: [2022-11-26 10:58:23,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_05-model_00-model_states.pt... 0: [2022-11-26 10:58:23,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_05-model_00-model_states.pt. 0: [2022-11-26 10:58:23,992] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_06-model_00-model_states.pt... 0: [2022-11-26 10:58:24,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_06-model_00-model_states.pt. 0: [2022-11-26 10:58:24,123] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_07-model_00-model_states.pt... 0: [2022-11-26 10:58:24,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_07-model_00-model_states.pt. 0: [2022-11-26 10:58:24,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_08-model_00-model_states.pt... 0: [2022-11-26 10:58:24,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_08-model_00-model_states.pt. 0: [2022-11-26 10:58:24,379] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_09-model_00-model_states.pt... 0: [2022-11-26 10:58:24,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_09-model_00-model_states.pt. 0: [2022-11-26 10:58:24,505] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_10-model_00-model_states.pt... 0: [2022-11-26 10:58:24,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_10-model_00-model_states.pt. 0: [2022-11-26 10:58:24,630] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_11-model_00-model_states.pt... 0: [2022-11-26 10:58:24,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_11-model_00-model_states.pt. 0: [2022-11-26 10:58:24,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_12-model_00-model_states.pt... 0: [2022-11-26 10:58:24,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_12-model_00-model_states.pt. 0: [2022-11-26 10:58:24,878] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_13-model_00-model_states.pt... 0: [2022-11-26 10:58:25,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_13-model_00-model_states.pt. 0: [2022-11-26 10:58:25,003] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_14-model_00-model_states.pt... 0: [2022-11-26 10:58:25,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_14-model_00-model_states.pt. 0: [2022-11-26 10:58:25,127] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_15-model_00-model_states.pt... 0: [2022-11-26 10:58:25,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_15-model_00-model_states.pt. 0: [2022-11-26 10:58:25,250] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_16-model_00-model_states.pt... 0: [2022-11-26 10:58:25,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_16-model_00-model_states.pt. 0: [2022-11-26 10:58:25,373] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_17-model_00-model_states.pt... 0: [2022-11-26 10:58:25,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_17-model_00-model_states.pt. 0: [2022-11-26 10:58:25,496] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_18-model_00-model_states.pt... 0: [2022-11-26 10:58:25,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_18-model_00-model_states.pt. 0: [2022-11-26 10:58:25,619] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_19-model_00-model_states.pt... 0: [2022-11-26 10:58:25,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_19-model_00-model_states.pt. 0: [2022-11-26 10:58:25,741] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_20-model_00-model_states.pt... 0: [2022-11-26 10:58:25,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_20-model_00-model_states.pt. 0: [2022-11-26 10:58:25,863] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_21-model_00-model_states.pt... 0: [2022-11-26 10:58:25,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_21-model_00-model_states.pt. 0: [2022-11-26 10:58:25,986] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_22-model_00-model_states.pt... 0: [2022-11-26 10:58:26,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_22-model_00-model_states.pt. 0: [2022-11-26 10:58:26,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_23-model_00-model_states.pt... 0: [2022-11-26 10:58:26,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_23-model_00-model_states.pt. 0: [2022-11-26 10:58:26,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_24-model_00-model_states.pt... 0: [2022-11-26 10:58:26,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_24-model_00-model_states.pt. 0: [2022-11-26 10:58:26,361] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_25-model_00-model_states.pt... 0: [2022-11-26 10:58:26,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_25-model_00-model_states.pt. 0: [2022-11-26 10:58:26,483] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_26-model_00-model_states.pt... 0: [2022-11-26 10:58:26,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_26-model_00-model_states.pt. 0: [2022-11-26 10:58:26,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_27-model_00-model_states.pt... 0: [2022-11-26 10:58:26,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_27-model_00-model_states.pt. 0: [2022-11-26 10:58:26,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_28-model_00-model_states.pt... 0: [2022-11-26 10:58:26,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_28-model_00-model_states.pt. 0: [2022-11-26 10:58:26,850] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_29-model_00-model_states.pt... 0: [2022-11-26 10:58:26,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_29-model_00-model_states.pt. 0: [2022-11-26 10:58:26,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_30-model_00-model_states.pt... 0: [2022-11-26 10:58:27,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_30-model_00-model_states.pt. 0: [2022-11-26 10:58:27,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_31-model_00-model_states.pt... 0: [2022-11-26 10:58:27,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_31-model_00-model_states.pt. 0: [2022-11-26 10:58:27,218] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_32-model_00-model_states.pt... 0: [2022-11-26 10:58:27,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_32-model_00-model_states.pt. 0: [2022-11-26 10:58:27,340] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_33-model_00-model_states.pt... 0: [2022-11-26 10:58:27,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_33-model_00-model_states.pt. 0: [2022-11-26 10:58:27,462] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_34-model_00-model_states.pt... 0: [2022-11-26 10:58:27,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_34-model_00-model_states.pt. 0: [2022-11-26 10:58:27,585] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/layer_36-model_00-model_states.pt... 0: [2022-11-26 10:58:27,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/layer_36-model_00-model_states.pt. 0: [2022-11-26 10:58:27,590] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step21000/mp_rank_00_model_states.pt 0: [2022-11-26 10:58:27,590] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/mp_rank_00_model_states.pt... 0: [2022-11-26 10:58:27,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/mp_rank_00_model_states.pt. 0: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:58:27,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step21000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:58:28,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:58:28,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:58:28,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 10:58:28,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-26 10:58:28,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:58:28,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 10:58:28,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-26 10:58:28,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:58:28,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 10:58:28,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-26 10:58:28,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:58:28,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 10:58:28,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 4: [2022-11-26 10:58:28,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:58:28,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 10:58:28,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 4: [2022-11-26 10:58:28,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:58:28,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 10:58:28,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 4: [2022-11-26 10:58:28,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:58:28,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 10:58:28,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 4: [2022-11-26 10:58:28,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:58:28,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 10:58:28,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 4: [2022-11-26 10:58:28,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:58:28,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 10:58:28,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-26 10:58:28,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:58:28,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 10:58:28,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:58:28,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-26 10:58:28,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 10:58:28,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-26 10:58:28,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:58:28,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 10:58:28,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 4: [2022-11-26 10:58:28,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:58:28,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 10:58:28,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 7: [2022-11-26 10:58:28,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:58:28,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:58:28,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:58:28,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:58:28,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:58:28,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:58:28,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:58:28,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:58:28,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 10:58:28,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 10:58:28,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 10:58:28,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 10:58:28,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 10:58:28,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 10:58:28,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 10:58:28,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 10:58:28,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 7: [2022-11-26 10:58:28,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 7: [2022-11-26 10:58:28,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 7: [2022-11-26 10:58:28,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 7: [2022-11-26 10:58:28,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 7: [2022-11-26 10:58:28,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 7: [2022-11-26 10:58:28,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 7: [2022-11-26 10:58:28,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 4: [2022-11-26 10:58:28,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:58:28,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 10:58:28,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 4: [2022-11-26 10:58:28,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:58:28,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 10:58:28,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 3: [2022-11-26 10:58:28,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:58:28,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:58:28,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:58:28,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 10:58:28,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 10:58:28,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 3: [2022-11-26 10:58:28,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 10:58:28,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 3: [2022-11-26 10:58:28,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 3: [2022-11-26 10:58:28,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:58:28,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 10:58:28,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 3: [2022-11-26 10:58:28,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:58:28,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 10:58:28,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 3: [2022-11-26 10:58:28,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:58:28,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 10:58:28,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 3: [2022-11-26 10:58:28,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:58:28,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 10:58:28,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 3: [2022-11-26 10:58:28,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:58:28,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 10:58:28,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 6: [2022-11-26 10:58:28,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:58:28,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:58:28,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:58:28,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:58:28,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 10:58:28,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 10:58:28,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 10:58:28,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 10:58:28,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:58:28,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 6: [2022-11-26 10:58:28,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 6: [2022-11-26 10:58:28,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 6: [2022-11-26 10:58:28,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 6: [2022-11-26 10:58:28,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 10:58:28,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 6: [2022-11-26 10:58:28,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:58:28,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:58:28,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:58:28,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 10:58:28,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 10:58:28,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 10:58:28,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 6: [2022-11-26 10:58:28,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 6: [2022-11-26 10:58:28,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-26 10:58:28,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 10:58:28,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 5: [2022-11-26 10:58:28,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:58:28,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:58:28,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:58:28,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:58:28,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:58:28,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 10:58:28,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 10:58:28,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 10:58:28,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 10:58:28,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 5: [2022-11-26 10:58:28,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 10:58:28,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 5: [2022-11-26 10:58:28,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 5: [2022-11-26 10:58:28,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 5: [2022-11-26 10:58:28,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 5: [2022-11-26 10:58:28,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:58:28,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:58:28,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:58:28,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 10:58:28,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 10:58:28,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 10:58:28,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 5: [2022-11-26 10:58:28,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 5: [2022-11-26 10:58:28,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 2: [2022-11-26 10:58:28,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:58:28,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:58:28,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:58:28,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:58:28,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:58:28,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:58:28,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 10:58:28,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:58:28,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:58:28,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 10:58:28,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 10:58:28,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 10:58:28,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 2: [2022-11-26 10:58:28,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 10:58:28,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 10:58:28,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 10:58:28,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 10:58:28,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 2: [2022-11-26 10:58:28,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 2: [2022-11-26 10:58:28,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 2: [2022-11-26 10:58:28,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 2: [2022-11-26 10:58:28,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 2: [2022-11-26 10:58:28,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 2: [2022-11-26 10:58:28,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 1: [2022-11-26 10:58:28,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:58:28,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:58:28,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:58:28,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:58:28,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:58:28,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:58:28,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:58:28,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 10:58:28,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 10:58:28,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 10:58:28,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 10:58:28,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 10:58:28,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 10:58:28,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 10:58:28,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 1: [2022-11-26 10:58:28,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 1: [2022-11-26 10:58:28,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 1: [2022-11-26 10:58:28,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 1: [2022-11-26 10:58:28,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 1: [2022-11-26 10:58:28,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 1: [2022-11-26 10:58:28,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 1: [2022-11-26 10:58:28,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:58:28,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step21000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 10:58:28,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: successfully saved checkpoint at iteration 21000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 5695.34 7: iteration 21010/ 44073 | consumed samples: 10757120 | consumed tokens: 22030581760 | elapsed time per iteration (s): 4.85 | learning rate: 1.181E-04 | global batch size: 512 | lm loss: 2.036519E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 105.650 | TFLOPs: 49.24 | 7: iteration 21020/ 44073 | consumed samples: 10762240 | consumed tokens: 22041067520 | elapsed time per iteration (s): 4.16 | learning rate: 1.180E-04 | global batch size: 512 | lm loss: 2.035259E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.028 | TFLOPs: 57.34 | 7: iteration 21030/ 44073 | consumed samples: 10767360 | consumed tokens: 22051553280 | elapsed time per iteration (s): 4.20 | learning rate: 1.179E-04 | global batch size: 512 | lm loss: 2.036703E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.938 | TFLOPs: 56.83 | 7: iteration 21040/ 44073 | consumed samples: 10772480 | consumed tokens: 22062039040 | elapsed time per iteration (s): 4.17 | learning rate: 1.179E-04 | global batch size: 512 | lm loss: 2.034653E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.803 | TFLOPs: 57.23 | 7: iteration 21050/ 44073 | consumed samples: 10777600 | consumed tokens: 22072524800 | elapsed time per iteration (s): 4.17 | learning rate: 1.178E-04 | global batch size: 512 | lm loss: 2.037935E+00 | grad norm: 0.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.740 | TFLOPs: 57.20 | 7: iteration 21060/ 44073 | consumed samples: 10782720 | consumed tokens: 22083010560 | elapsed time per iteration (s): 4.14 | learning rate: 1.177E-04 | global batch size: 512 | lm loss: 2.034840E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.596 | TFLOPs: 57.60 | 7: iteration 21070/ 44073 | consumed samples: 10787840 | consumed tokens: 22093496320 | elapsed time per iteration (s): 4.16 | learning rate: 1.177E-04 | global batch size: 512 | lm loss: 2.034842E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.080 | TFLOPs: 57.36 | 7: iteration 21080/ 44073 | consumed samples: 10792960 | consumed tokens: 22103982080 | elapsed time per iteration (s): 4.17 | learning rate: 1.176E-04 | global batch size: 512 | lm loss: 2.038183E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.711 | TFLOPs: 57.19 | 7: iteration 21090/ 44073 | consumed samples: 10798080 | consumed tokens: 22114467840 | elapsed time per iteration (s): 4.17 | learning rate: 1.176E-04 | global batch size: 512 | lm loss: 2.024009E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.647 | TFLOPs: 57.16 | 7: iteration 21100/ 44073 | consumed samples: 10803200 | consumed tokens: 22124953600 | elapsed time per iteration (s): 4.19 | learning rate: 1.175E-04 | global batch size: 512 | lm loss: 2.044567E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.129 | TFLOPs: 56.92 | 7: iteration 21110/ 44073 | consumed samples: 10808320 | consumed tokens: 22135439360 | elapsed time per iteration (s): 4.15 | learning rate: 1.174E-04 | global batch size: 512 | lm loss: 2.011850E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.482 | TFLOPs: 57.55 | 7: iteration 21120/ 44073 | consumed samples: 10813440 | consumed tokens: 22145925120 | elapsed time per iteration (s): 4.15 | learning rate: 1.174E-04 | global batch size: 512 | lm loss: 2.013822E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.279 | TFLOPs: 57.45 | 7: iteration 21130/ 44073 | consumed samples: 10818560 | consumed tokens: 22156410880 | elapsed time per iteration (s): 4.14 | learning rate: 1.173E-04 | global batch size: 512 | lm loss: 2.040521E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.706 | TFLOPs: 57.65 | 7: iteration 21140/ 44073 | consumed samples: 10823680 | consumed tokens: 22166896640 | elapsed time per iteration (s): 4.18 | learning rate: 1.172E-04 | global batch size: 512 | lm loss: 2.027654E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.563 | TFLOPs: 57.12 | 7: iteration 21150/ 44073 | consumed samples: 10828800 | consumed tokens: 22177382400 | elapsed time per iteration (s): 4.17 | learning rate: 1.172E-04 | global batch size: 512 | lm loss: 2.021224E+00 | grad norm: 0.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.904 | TFLOPs: 57.28 | 7: iteration 21160/ 44073 | consumed samples: 10833920 | consumed tokens: 22187868160 | elapsed time per iteration (s): 4.48 | learning rate: 1.171E-04 | global batch size: 512 | lm loss: 2.004399E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 114.164 | TFLOPs: 53.21 | 7: iteration 21170/ 44073 | consumed samples: 10839040 | consumed tokens: 22198353920 | elapsed time per iteration (s): 4.16 | learning rate: 1.170E-04 | global batch size: 512 | lm loss: 2.029424E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.119 | TFLOPs: 57.38 | 7: iteration 21180/ 44073 | consumed samples: 10844160 | consumed tokens: 22208839680 | elapsed time per iteration (s): 4.17 | learning rate: 1.170E-04 | global batch size: 512 | lm loss: 2.028957E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.817 | TFLOPs: 57.24 | 7: iteration 21190/ 44073 | consumed samples: 10849280 | consumed tokens: 22219325440 | elapsed time per iteration (s): 4.14 | learning rate: 1.169E-04 | global batch size: 512 | lm loss: 2.025803E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.734 | TFLOPs: 57.67 | 7: iteration 21200/ 44073 | consumed samples: 10854400 | consumed tokens: 22229811200 | elapsed time per iteration (s): 4.17 | learning rate: 1.168E-04 | global batch size: 512 | lm loss: 2.047637E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.801 | TFLOPs: 57.23 | 7: iteration 21210/ 44073 | consumed samples: 10859520 | consumed tokens: 22240296960 | elapsed time per iteration (s): 4.16 | learning rate: 1.168E-04 | global batch size: 512 | lm loss: 1.997017E+00 | grad norm: 0.115 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.068 | TFLOPs: 57.36 | 7: iteration 21220/ 44073 | consumed samples: 10864640 | consumed tokens: 22250782720 | elapsed time per iteration (s): 4.19 | learning rate: 1.167E-04 | global batch size: 512 | lm loss: 2.017566E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.060 | TFLOPs: 56.89 | 7: iteration 21230/ 44073 | consumed samples: 10869760 | consumed tokens: 22261268480 | elapsed time per iteration (s): 4.20 | learning rate: 1.167E-04 | global batch size: 512 | lm loss: 2.033591E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.019 | TFLOPs: 56.87 | 7: iteration 21240/ 44073 | consumed samples: 10874880 | consumed tokens: 22271754240 | elapsed time per iteration (s): 4.16 | learning rate: 1.166E-04 | global batch size: 512 | lm loss: 2.012628E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.222 | TFLOPs: 57.43 | 7: iteration 21250/ 44073 | consumed samples: 10880000 | consumed tokens: 22282240000 | elapsed time per iteration (s): 4.46 | learning rate: 1.165E-04 | global batch size: 512 | lm loss: 2.045450E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 114.690 | TFLOPs: 53.45 | 7: iteration 21260/ 44073 | consumed samples: 10885120 | consumed tokens: 22292725760 | elapsed time per iteration (s): 4.16 | learning rate: 1.165E-04 | global batch size: 512 | lm loss: 2.022631E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.149 | TFLOPs: 57.39 | 7: iteration 21270/ 44073 | consumed samples: 10890240 | consumed tokens: 22303211520 | elapsed time per iteration (s): 4.15 | learning rate: 1.164E-04 | global batch size: 512 | lm loss: 2.014307E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.472 | TFLOPs: 57.54 | 7: iteration 21280/ 44073 | consumed samples: 10895360 | consumed tokens: 22313697280 | elapsed time per iteration (s): 4.15 | learning rate: 1.163E-04 | global batch size: 512 | lm loss: 2.025614E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.294 | TFLOPs: 57.46 | 7: iteration 21290/ 44073 | consumed samples: 10900480 | consumed tokens: 22324183040 | elapsed time per iteration (s): 4.17 | learning rate: 1.163E-04 | global batch size: 512 | lm loss: 2.025008E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.743 | TFLOPs: 57.20 | 7: iteration 21300/ 44073 | consumed samples: 10905600 | consumed tokens: 22334668800 | elapsed time per iteration (s): 4.19 | learning rate: 1.162E-04 | global batch size: 512 | lm loss: 2.037914E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.108 | TFLOPs: 56.91 | 7: iteration 21310/ 44073 | consumed samples: 10910720 | consumed tokens: 22345154560 | elapsed time per iteration (s): 4.20 | learning rate: 1.161E-04 | global batch size: 512 | lm loss: 2.008415E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.931 | TFLOPs: 56.83 | 7: iteration 21320/ 44073 | consumed samples: 10915840 | consumed tokens: 22355640320 | elapsed time per iteration (s): 4.14 | learning rate: 1.161E-04 | global batch size: 512 | lm loss: 2.013055E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.709 | TFLOPs: 57.65 | 7: iteration 21330/ 44073 | consumed samples: 10920960 | consumed tokens: 22366126080 | elapsed time per iteration (s): 4.19 | learning rate: 1.160E-04 | global batch size: 512 | lm loss: 2.009224E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.189 | TFLOPs: 56.95 | 7: iteration 21340/ 44073 | consumed samples: 10926080 | consumed tokens: 22376611840 | elapsed time per iteration (s): 4.20 | learning rate: 1.159E-04 | global batch size: 512 | lm loss: 2.028854E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.960 | TFLOPs: 56.84 | 7: iteration 21350/ 44073 | consumed samples: 10931200 | consumed tokens: 22387097600 | elapsed time per iteration (s): 4.16 | learning rate: 1.159E-04 | global batch size: 512 | lm loss: 2.020617E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.042 | TFLOPs: 57.34 | 7: iteration 21360/ 44073 | consumed samples: 10936320 | consumed tokens: 22397583360 | elapsed time per iteration (s): 4.23 | learning rate: 1.158E-04 | global batch size: 512 | lm loss: 2.031797E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.937 | TFLOPs: 56.36 | 7: iteration 21370/ 44073 | consumed samples: 10941440 | consumed tokens: 22408069120 | elapsed time per iteration (s): 4.15 | learning rate: 1.157E-04 | global batch size: 512 | lm loss: 2.003519E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.324 | TFLOPs: 57.48 | 7: iteration 21380/ 44073 | consumed samples: 10946560 | consumed tokens: 22418554880 | elapsed time per iteration (s): 4.16 | learning rate: 1.157E-04 | global batch size: 512 | lm loss: 1.997846E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.052 | TFLOPs: 57.35 | 7: iteration 21390/ 44073 | consumed samples: 10951680 | consumed tokens: 22429040640 | elapsed time per iteration (s): 4.16 | learning rate: 1.156E-04 | global batch size: 512 | lm loss: 2.020991E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.970 | TFLOPs: 57.31 | 7: iteration 21400/ 44073 | consumed samples: 10956800 | consumed tokens: 22439526400 | elapsed time per iteration (s): 4.15 | learning rate: 1.156E-04 | global batch size: 512 | lm loss: 2.022766E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.282 | TFLOPs: 57.46 | 7: iteration 21410/ 44073 | consumed samples: 10961920 | consumed tokens: 22450012160 | elapsed time per iteration (s): 4.16 | learning rate: 1.155E-04 | global batch size: 512 | lm loss: 2.001005E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.988 | TFLOPs: 57.32 | 7: iteration 21420/ 44073 | consumed samples: 10967040 | consumed tokens: 22460497920 | elapsed time per iteration (s): 4.17 | learning rate: 1.154E-04 | global batch size: 512 | lm loss: 2.033981E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.925 | TFLOPs: 57.29 | 7: iteration 21430/ 44073 | consumed samples: 10972160 | consumed tokens: 22470983680 | elapsed time per iteration (s): 4.21 | learning rate: 1.154E-04 | global batch size: 512 | lm loss: 2.016283E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.586 | TFLOPs: 56.67 | 7: iteration 21440/ 44073 | consumed samples: 10977280 | consumed tokens: 22481469440 | elapsed time per iteration (s): 4.14 | learning rate: 1.153E-04 | global batch size: 512 | lm loss: 2.001528E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.621 | TFLOPs: 57.61 | 7: iteration 21450/ 44073 | consumed samples: 10982400 | consumed tokens: 22491955200 | elapsed time per iteration (s): 4.16 | learning rate: 1.152E-04 | global batch size: 512 | lm loss: 2.030884E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.133 | TFLOPs: 57.39 | 7: iteration 21460/ 44073 | consumed samples: 10987520 | consumed tokens: 22502440960 | elapsed time per iteration (s): 4.16 | learning rate: 1.152E-04 | global batch size: 512 | lm loss: 2.025230E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.972 | TFLOPs: 57.31 | 7: iteration 21470/ 44073 | consumed samples: 10992640 | consumed tokens: 22512926720 | elapsed time per iteration (s): 4.15 | learning rate: 1.151E-04 | global batch size: 512 | lm loss: 2.040185E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.311 | TFLOPs: 57.47 | 7: iteration 21480/ 44073 | consumed samples: 10997760 | consumed tokens: 22523412480 | elapsed time per iteration (s): 4.17 | learning rate: 1.150E-04 | global batch size: 512 | lm loss: 2.018182E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.766 | TFLOPs: 57.21 | 7: iteration 21490/ 44073 | consumed samples: 11002880 | consumed tokens: 22533898240 | elapsed time per iteration (s): 4.21 | learning rate: 1.150E-04 | global batch size: 512 | lm loss: 2.014196E+00 | grad norm: 0.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.607 | TFLOPs: 56.68 | 7: iteration 21500/ 44073 | consumed samples: 11008000 | consumed tokens: 22544384000 | elapsed time per iteration (s): 4.18 | learning rate: 1.149E-04 | global batch size: 512 | lm loss: 2.024420E+00 | grad norm: 0.115 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.425 | TFLOPs: 57.06 | 7: iteration 21510/ 44073 | consumed samples: 11013120 | consumed tokens: 22554869760 | elapsed time per iteration (s): 4.16 | learning rate: 1.148E-04 | global batch size: 512 | lm loss: 2.041768E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.040 | TFLOPs: 57.34 | 7: iteration 21520/ 44073 | consumed samples: 11018240 | consumed tokens: 22565355520 | elapsed time per iteration (s): 4.14 | learning rate: 1.148E-04 | global batch size: 512 | lm loss: 2.016430E+00 | grad norm: 0.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.667 | TFLOPs: 57.64 | 7: iteration 21530/ 44073 | consumed samples: 11023360 | consumed tokens: 22575841280 | elapsed time per iteration (s): 4.17 | learning rate: 1.147E-04 | global batch size: 512 | lm loss: 2.026495E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.666 | TFLOPs: 57.17 | 7: iteration 21540/ 44073 | consumed samples: 11028480 | consumed tokens: 22586327040 | elapsed time per iteration (s): 4.15 | learning rate: 1.146E-04 | global batch size: 512 | lm loss: 2.027954E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.503 | TFLOPs: 57.56 | 7: iteration 21550/ 44073 | consumed samples: 11033600 | consumed tokens: 22596812800 | elapsed time per iteration (s): 4.16 | learning rate: 1.146E-04 | global batch size: 512 | lm loss: 2.033090E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.104 | TFLOPs: 57.37 | 7: iteration 21560/ 44073 | consumed samples: 11038720 | consumed tokens: 22607298560 | elapsed time per iteration (s): 4.15 | learning rate: 1.145E-04 | global batch size: 512 | lm loss: 2.026342E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.430 | TFLOPs: 57.52 | 7: iteration 21570/ 44073 | consumed samples: 11043840 | consumed tokens: 22617784320 | elapsed time per iteration (s): 4.15 | learning rate: 1.145E-04 | global batch size: 512 | lm loss: 2.008097E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.413 | TFLOPs: 57.52 | 7: iteration 21580/ 44073 | consumed samples: 11048960 | consumed tokens: 22628270080 | elapsed time per iteration (s): 4.14 | learning rate: 1.144E-04 | global batch size: 512 | lm loss: 2.026450E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.673 | TFLOPs: 57.64 | 7: iteration 21590/ 44073 | consumed samples: 11054080 | consumed tokens: 22638755840 | elapsed time per iteration (s): 4.14 | learning rate: 1.143E-04 | global batch size: 512 | lm loss: 2.014687E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.752 | TFLOPs: 57.67 | 7: iteration 21600/ 44073 | consumed samples: 11059200 | consumed tokens: 22649241600 | elapsed time per iteration (s): 4.13 | learning rate: 1.143E-04 | global batch size: 512 | lm loss: 2.014686E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.903 | TFLOPs: 57.75 | 7: iteration 21610/ 44073 | consumed samples: 11064320 | consumed tokens: 22659727360 | elapsed time per iteration (s): 40.74 | learning rate: 1.142E-04 | global batch size: 512 | lm loss: 2.037979E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 12.566 | TFLOPs: 5.86 | 7: iteration 21620/ 44073 | consumed samples: 11069440 | consumed tokens: 22670213120 | elapsed time per iteration (s): 4.16 | learning rate: 1.141E-04 | global batch size: 512 | lm loss: 2.034486E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.092 | TFLOPs: 57.37 | 7: iteration 21630/ 44073 | consumed samples: 11074560 | consumed tokens: 22680698880 | elapsed time per iteration (s): 15.84 | learning rate: 1.141E-04 | global batch size: 512 | lm loss: 2.021700E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 32.327 | TFLOPs: 15.07 | 7: iteration 21640/ 44073 | consumed samples: 11079680 | consumed tokens: 22691184640 | elapsed time per iteration (s): 4.15 | learning rate: 1.140E-04 | global batch size: 512 | lm loss: 2.019646E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.404 | TFLOPs: 57.51 | 7: iteration 21650/ 44073 | consumed samples: 11084800 | consumed tokens: 22701670400 | elapsed time per iteration (s): 4.18 | learning rate: 1.139E-04 | global batch size: 512 | lm loss: 2.008461E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.430 | TFLOPs: 57.06 | 7: iteration 21660/ 44073 | consumed samples: 11089920 | consumed tokens: 22712156160 | elapsed time per iteration (s): 4.13 | learning rate: 1.139E-04 | global batch size: 512 | lm loss: 2.018162E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.827 | TFLOPs: 57.71 | 7: iteration 21670/ 44073 | consumed samples: 11095040 | consumed tokens: 22722641920 | elapsed time per iteration (s): 4.31 | learning rate: 1.138E-04 | global batch size: 512 | lm loss: 2.000284E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.854 | TFLOPs: 55.39 | 7: iteration 21680/ 44073 | consumed samples: 11100160 | consumed tokens: 22733127680 | elapsed time per iteration (s): 4.14 | learning rate: 1.137E-04 | global batch size: 512 | lm loss: 2.019344E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.548 | TFLOPs: 57.58 | 7: iteration 21690/ 44073 | consumed samples: 11105280 | consumed tokens: 22743613440 | elapsed time per iteration (s): 4.19 | learning rate: 1.137E-04 | global batch size: 512 | lm loss: 2.024187E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.115 | TFLOPs: 56.91 | 7: iteration 21700/ 44073 | consumed samples: 11110400 | consumed tokens: 22754099200 | elapsed time per iteration (s): 4.19 | learning rate: 1.136E-04 | global batch size: 512 | lm loss: 2.028292E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.341 | TFLOPs: 57.02 | 7: iteration 21710/ 44073 | consumed samples: 11115520 | consumed tokens: 22764584960 | elapsed time per iteration (s): 4.19 | learning rate: 1.135E-04 | global batch size: 512 | lm loss: 2.021894E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.093 | TFLOPs: 56.90 | 7: iteration 21720/ 44073 | consumed samples: 11120640 | consumed tokens: 22775070720 | elapsed time per iteration (s): 4.16 | learning rate: 1.135E-04 | global batch size: 512 | lm loss: 1.999026E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.072 | TFLOPs: 57.36 | 7: iteration 21730/ 44073 | consumed samples: 11125760 | consumed tokens: 22785556480 | elapsed time per iteration (s): 4.14 | learning rate: 1.134E-04 | global batch size: 512 | lm loss: 2.023335E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.540 | TFLOPs: 57.58 | 7: iteration 21740/ 44073 | consumed samples: 11130880 | consumed tokens: 22796042240 | elapsed time per iteration (s): 4.31 | learning rate: 1.134E-04 | global batch size: 512 | lm loss: 2.035692E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.735 | TFLOPs: 55.34 | 7: iteration 21750/ 44073 | consumed samples: 11136000 | consumed tokens: 22806528000 | elapsed time per iteration (s): 4.19 | learning rate: 1.133E-04 | global batch size: 512 | lm loss: 2.052800E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.084 | TFLOPs: 56.90 | 7: iteration 21760/ 44073 | consumed samples: 11141120 | consumed tokens: 22817013760 | elapsed time per iteration (s): 4.30 | learning rate: 1.132E-04 | global batch size: 512 | lm loss: 2.002716E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.976 | TFLOPs: 55.45 | 7: iteration 21770/ 44073 | consumed samples: 11146240 | consumed tokens: 22827499520 | elapsed time per iteration (s): 4.33 | learning rate: 1.132E-04 | global batch size: 512 | lm loss: 2.037241E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.254 | TFLOPs: 55.11 | 7: iteration 21780/ 44073 | consumed samples: 11151360 | consumed tokens: 22837985280 | elapsed time per iteration (s): 4.19 | learning rate: 1.131E-04 | global batch size: 512 | lm loss: 2.003278E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.216 | TFLOPs: 56.96 | 7: iteration 21790/ 44073 | consumed samples: 11156480 | consumed tokens: 22848471040 | elapsed time per iteration (s): 4.16 | learning rate: 1.130E-04 | global batch size: 512 | lm loss: 2.020346E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.166 | TFLOPs: 57.40 | 7: iteration 21800/ 44073 | consumed samples: 11161600 | consumed tokens: 22858956800 | elapsed time per iteration (s): 4.16 | learning rate: 1.130E-04 | global batch size: 512 | lm loss: 2.017544E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.149 | TFLOPs: 57.39 | 7: iteration 21810/ 44073 | consumed samples: 11166720 | consumed tokens: 22869442560 | elapsed time per iteration (s): 4.16 | learning rate: 1.129E-04 | global batch size: 512 | lm loss: 2.004850E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.094 | TFLOPs: 57.37 | 7: iteration 21820/ 44073 | consumed samples: 11171840 | consumed tokens: 22879928320 | elapsed time per iteration (s): 4.18 | learning rate: 1.128E-04 | global batch size: 512 | lm loss: 2.025101E+00 | grad norm: 0.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.470 | TFLOPs: 57.08 | 7: iteration 21830/ 44073 | consumed samples: 11176960 | consumed tokens: 22890414080 | elapsed time per iteration (s): 4.21 | learning rate: 1.128E-04 | global batch size: 512 | lm loss: 2.018611E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.714 | TFLOPs: 56.73 | 7: iteration 21840/ 44073 | consumed samples: 11182080 | consumed tokens: 22900899840 | elapsed time per iteration (s): 4.15 | learning rate: 1.127E-04 | global batch size: 512 | lm loss: 2.018278E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.307 | TFLOPs: 57.47 | 7: iteration 21850/ 44073 | consumed samples: 11187200 | consumed tokens: 22911385600 | elapsed time per iteration (s): 4.15 | learning rate: 1.126E-04 | global batch size: 512 | lm loss: 2.048876E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.324 | TFLOPs: 57.48 | 7: iteration 21860/ 44073 | consumed samples: 11192320 | consumed tokens: 22921871360 | elapsed time per iteration (s): 4.16 | learning rate: 1.126E-04 | global batch size: 512 | lm loss: 2.036713E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.938 | TFLOPs: 57.30 | 7: iteration 21870/ 44073 | consumed samples: 11197440 | consumed tokens: 22932357120 | elapsed time per iteration (s): 4.15 | learning rate: 1.125E-04 | global batch size: 512 | lm loss: 2.016937E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.364 | TFLOPs: 57.49 | 7: iteration 21880/ 44073 | consumed samples: 11202560 | consumed tokens: 22942842880 | elapsed time per iteration (s): 4.16 | learning rate: 1.124E-04 | global batch size: 512 | lm loss: 2.025172E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.009 | TFLOPs: 57.33 | 7: iteration 21890/ 44073 | consumed samples: 11207680 | consumed tokens: 22953328640 | elapsed time per iteration (s): 4.14 | learning rate: 1.124E-04 | global batch size: 512 | lm loss: 2.020633E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.787 | TFLOPs: 57.69 | 7: iteration 21900/ 44073 | consumed samples: 11212800 | consumed tokens: 22963814400 | elapsed time per iteration (s): 4.15 | learning rate: 1.123E-04 | global batch size: 512 | lm loss: 2.023770E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.505 | TFLOPs: 57.56 | 7: iteration 21910/ 44073 | consumed samples: 11217920 | consumed tokens: 22974300160 | elapsed time per iteration (s): 5.66 | learning rate: 1.122E-04 | global batch size: 512 | lm loss: 2.039141E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 90.526 | TFLOPs: 42.19 | 7: iteration 21920/ 44073 | consumed samples: 11223040 | consumed tokens: 22984785920 | elapsed time per iteration (s): 4.14 | learning rate: 1.122E-04 | global batch size: 512 | lm loss: 2.021843E+00 | grad norm: 0.305 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.661 | TFLOPs: 57.63 | 7: iteration 21930/ 44073 | consumed samples: 11228160 | consumed tokens: 22995271680 | elapsed time per iteration (s): 4.21 | learning rate: 1.121E-04 | global batch size: 512 | lm loss: 2.019159E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.515 | TFLOPs: 56.63 | 7: iteration 21940/ 44073 | consumed samples: 11233280 | consumed tokens: 23005757440 | elapsed time per iteration (s): 4.17 | learning rate: 1.121E-04 | global batch size: 512 | lm loss: 2.004750E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.775 | TFLOPs: 57.22 | 7: iteration 21950/ 44073 | consumed samples: 11238400 | consumed tokens: 23016243200 | elapsed time per iteration (s): 4.17 | learning rate: 1.120E-04 | global batch size: 512 | lm loss: 2.019926E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.641 | TFLOPs: 57.16 | 7: iteration 21960/ 44073 | consumed samples: 11243520 | consumed tokens: 23026728960 | elapsed time per iteration (s): 4.15 | learning rate: 1.119E-04 | global batch size: 512 | lm loss: 2.010007E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.281 | TFLOPs: 57.46 | 7: iteration 21970/ 44073 | consumed samples: 11248640 | consumed tokens: 23037214720 | elapsed time per iteration (s): 4.18 | learning rate: 1.119E-04 | global batch size: 512 | lm loss: 2.043565E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.473 | TFLOPs: 57.08 | 7: iteration 21980/ 44073 | consumed samples: 11253760 | consumed tokens: 23047700480 | elapsed time per iteration (s): 4.15 | learning rate: 1.118E-04 | global batch size: 512 | lm loss: 2.035730E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.359 | TFLOPs: 57.49 | 7: iteration 21990/ 44073 | consumed samples: 11258880 | consumed tokens: 23058186240 | elapsed time per iteration (s): 4.14 | learning rate: 1.117E-04 | global batch size: 512 | lm loss: 2.014254E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.675 | TFLOPs: 57.64 | 0: [2022-11-26 12:16:23,429] [INFO] [logging.py:68:log_dist] [Rank 0] step=22000, skipped=0, lr=[0.00011166642979490337, 0.00011166642979490337, 0.00011166642979490337], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 7: iteration 22000/ 44073 | consumed samples: 11264000 | consumed tokens: 23068672000 | elapsed time per iteration (s): 4.15 | learning rate: 1.117E-04 | global batch size: 512 | lm loss: 2.004125E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.347 | TFLOPs: 57.49 | 0: steps: 22000 loss: 1.9484 iter time (s): 4.452 samples/sec: 115.013 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 22000 | lm loss value: 1.983518E+00 | lm loss PPL: 7.268266E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 22000 to checkpoints_2b2 0: [2022-11-26 12:16:24,763] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step22000 is begin to save! 0: [2022-11-26 12:16:24,769] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_01-model_00-model_states.pt... 0: [2022-11-26 12:16:25,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_01-model_00-model_states.pt. 0: [2022-11-26 12:16:25,085] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_03-model_00-model_states.pt... 0: [2022-11-26 12:16:25,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_03-model_00-model_states.pt. 0: [2022-11-26 12:16:25,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_04-model_00-model_states.pt... 0: [2022-11-26 12:16:25,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_04-model_00-model_states.pt. 0: [2022-11-26 12:16:25,379] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_05-model_00-model_states.pt... 0: [2022-11-26 12:16:25,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_05-model_00-model_states.pt. 0: [2022-11-26 12:16:25,521] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_06-model_00-model_states.pt... 0: [2022-11-26 12:16:25,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_06-model_00-model_states.pt. 0: [2022-11-26 12:16:25,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_07-model_00-model_states.pt... 0: [2022-11-26 12:16:25,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_07-model_00-model_states.pt. 0: [2022-11-26 12:16:25,810] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_08-model_00-model_states.pt... 0: [2022-11-26 12:16:25,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_08-model_00-model_states.pt. 0: [2022-11-26 12:16:25,935] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_09-model_00-model_states.pt... 0: [2022-11-26 12:16:26,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_09-model_00-model_states.pt. 0: [2022-11-26 12:16:26,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_10-model_00-model_states.pt... 0: [2022-11-26 12:16:26,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_10-model_00-model_states.pt. 0: [2022-11-26 12:16:26,188] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_11-model_00-model_states.pt... 0: [2022-11-26 12:16:26,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_11-model_00-model_states.pt. 0: [2022-11-26 12:16:26,334] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_12-model_00-model_states.pt... 0: [2022-11-26 12:16:26,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_12-model_00-model_states.pt. 0: [2022-11-26 12:16:26,477] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_13-model_00-model_states.pt... 0: [2022-11-26 12:16:26,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_13-model_00-model_states.pt. 0: [2022-11-26 12:16:26,616] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_14-model_00-model_states.pt... 0: [2022-11-26 12:16:26,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_14-model_00-model_states.pt. 0: [2022-11-26 12:16:26,756] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_15-model_00-model_states.pt... 0: [2022-11-26 12:16:26,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_15-model_00-model_states.pt. 0: [2022-11-26 12:16:26,896] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_16-model_00-model_states.pt... 0: [2022-11-26 12:16:27,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_16-model_00-model_states.pt. 0: [2022-11-26 12:16:27,036] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_17-model_00-model_states.pt... 0: [2022-11-26 12:16:27,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_17-model_00-model_states.pt. 0: [2022-11-26 12:16:27,173] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_18-model_00-model_states.pt... 0: [2022-11-26 12:16:27,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_18-model_00-model_states.pt. 0: [2022-11-26 12:16:27,312] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_19-model_00-model_states.pt... 0: [2022-11-26 12:16:27,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_19-model_00-model_states.pt. 0: [2022-11-26 12:16:27,455] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_20-model_00-model_states.pt... 0: [2022-11-26 12:16:27,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_20-model_00-model_states.pt. 0: [2022-11-26 12:16:27,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_21-model_00-model_states.pt... 0: [2022-11-26 12:16:27,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_21-model_00-model_states.pt. 0: [2022-11-26 12:16:27,730] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_22-model_00-model_states.pt... 0: [2022-11-26 12:16:27,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_22-model_00-model_states.pt. 0: [2022-11-26 12:16:27,873] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_23-model_00-model_states.pt... 0: [2022-11-26 12:16:28,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_23-model_00-model_states.pt. 0: [2022-11-26 12:16:28,008] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_24-model_00-model_states.pt... 0: [2022-11-26 12:16:28,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_24-model_00-model_states.pt. 0: [2022-11-26 12:16:28,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_25-model_00-model_states.pt... 0: [2022-11-26 12:16:28,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_25-model_00-model_states.pt. 0: [2022-11-26 12:16:28,281] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_26-model_00-model_states.pt... 0: [2022-11-26 12:16:28,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_26-model_00-model_states.pt. 0: [2022-11-26 12:16:28,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_27-model_00-model_states.pt... 0: [2022-11-26 12:16:28,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_27-model_00-model_states.pt. 0: [2022-11-26 12:16:28,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_28-model_00-model_states.pt... 0: [2022-11-26 12:16:28,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_28-model_00-model_states.pt. 0: [2022-11-26 12:16:28,695] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_29-model_00-model_states.pt... 0: [2022-11-26 12:16:28,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_29-model_00-model_states.pt. 0: [2022-11-26 12:16:28,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_30-model_00-model_states.pt... 0: [2022-11-26 12:16:28,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_30-model_00-model_states.pt. 0: [2022-11-26 12:16:28,969] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_31-model_00-model_states.pt... 0: [2022-11-26 12:16:29,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_31-model_00-model_states.pt. 0: [2022-11-26 12:16:29,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_32-model_00-model_states.pt... 0: [2022-11-26 12:16:29,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_32-model_00-model_states.pt. 0: [2022-11-26 12:16:29,240] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_33-model_00-model_states.pt... 0: [2022-11-26 12:16:29,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_33-model_00-model_states.pt. 0: [2022-11-26 12:16:29,375] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_34-model_00-model_states.pt... 0: [2022-11-26 12:16:29,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_34-model_00-model_states.pt. 0: [2022-11-26 12:16:29,513] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/layer_36-model_00-model_states.pt... 0: [2022-11-26 12:16:29,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/layer_36-model_00-model_states.pt. 0: [2022-11-26 12:16:29,518] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step22000/mp_rank_00_model_states.pt 0: [2022-11-26 12:16:29,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/mp_rank_00_model_states.pt... 0: [2022-11-26 12:16:29,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/mp_rank_00_model_states.pt. 0: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:16:29,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step22000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:16:30,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:16:30,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 12:16:30,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 0: [2022-11-26 12:16:30,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:16:30,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 12:16:30,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 0: [2022-11-26 12:16:30,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:16:30,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 12:16:30,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 0: [2022-11-26 12:16:30,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:16:30,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:16:30,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 12:16:30,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 5: [2022-11-26 12:16:30,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:16:30,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 12:16:30,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 5: [2022-11-26 12:16:30,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:16:30,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 12:16:30,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 0: [2022-11-26 12:16:30,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:16:30,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 12:16:30,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 0: [2022-11-26 12:16:30,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:16:30,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 12:16:30,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 2: [2022-11-26 12:16:30,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:16:30,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 12:16:30,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 2: [2022-11-26 12:16:30,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:16:30,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:16:30,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 12:16:30,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 12:16:30,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 2: [2022-11-26 12:16:30,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 0: [2022-11-26 12:16:30,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:16:30,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 12:16:30,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 2: [2022-11-26 12:16:30,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:16:30,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 12:16:30,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 5: [2022-11-26 12:16:30,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:16:30,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 12:16:30,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 5: [2022-11-26 12:16:30,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:16:30,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 12:16:30,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 0: [2022-11-26 12:16:30,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:16:30,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 12:16:30,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 2: [2022-11-26 12:16:30,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:16:30,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 12:16:30,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 7: [2022-11-26 12:16:30,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:16:30,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:16:30,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:16:30,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:16:30,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 12:16:30,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 12:16:30,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 12:16:30,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 12:16:30,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 7: [2022-11-26 12:16:30,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 7: [2022-11-26 12:16:30,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 7: [2022-11-26 12:16:30,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 5: [2022-11-26 12:16:30,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:16:30,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 12:16:30,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 5: [2022-11-26 12:16:30,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:16:30,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 12:16:30,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 2: [2022-11-26 12:16:30,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:16:30,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 12:16:30,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 4: [2022-11-26 12:16:30,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:16:30,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:16:30,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:16:30,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:16:30,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 12:16:30,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 12:16:30,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 12:16:30,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 12:16:30,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 4: [2022-11-26 12:16:30,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 4: [2022-11-26 12:16:30,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 4: [2022-11-26 12:16:30,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 7: [2022-11-26 12:16:30,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:16:30,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 12:16:30,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 7: [2022-11-26 12:16:30,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:16:30,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 12:16:30,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 7: [2022-11-26 12:16:30,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:16:30,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 12:16:30,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 2: [2022-11-26 12:16:30,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:16:30,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 12:16:30,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 5: [2022-11-26 12:16:30,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:16:30,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 12:16:30,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 4: [2022-11-26 12:16:30,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:16:30,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 12:16:30,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 4: [2022-11-26 12:16:30,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:16:30,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 12:16:30,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 7: [2022-11-26 12:16:30,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:16:30,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 12:16:30,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 5: [2022-11-26 12:16:30,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:16:30,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 12:16:30,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 3: [2022-11-26 12:16:30,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:16:30,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 12:16:30,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 3: [2022-11-26 12:16:30,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:16:30,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 12:16:30,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 3: [2022-11-26 12:16:30,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:16:30,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 12:16:30,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 3: [2022-11-26 12:16:30,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:16:30,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 12:16:30,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 3: [2022-11-26 12:16:30,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:16:30,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 12:16:30,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 3: [2022-11-26 12:16:30,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:16:30,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 12:16:30,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 3: [2022-11-26 12:16:30,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:16:30,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 12:16:30,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 3: [2022-11-26 12:16:30,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:16:30,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 12:16:30,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 4: [2022-11-26 12:16:30,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:16:30,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:16:30,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 12:16:30,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 12:16:30,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 4: [2022-11-26 12:16:30,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 1: [2022-11-26 12:16:30,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:16:30,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:16:30,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:16:30,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:16:30,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 12:16:30,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 12:16:30,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 12:16:30,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:16:30,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 12:16:30,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 1: [2022-11-26 12:16:30,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 1: [2022-11-26 12:16:30,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 1: [2022-11-26 12:16:30,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 1: [2022-11-26 12:16:30,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 12:16:30,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 1: [2022-11-26 12:16:30,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:16:30,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:16:30,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:16:30,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 12:16:30,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 12:16:30,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 12:16:30,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 1: [2022-11-26 12:16:30,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 1: [2022-11-26 12:16:30,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 0: [2022-11-26 12:16:30,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 12:16:30,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 6: [2022-11-26 12:16:30,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:16:30,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:16:30,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:16:30,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:16:30,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:16:30,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:16:30,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 12:16:30,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 12:16:30,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 12:16:30,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 12:16:30,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 12:16:30,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 12:16:30,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 6: [2022-11-26 12:16:30,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 6: [2022-11-26 12:16:30,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 6: [2022-11-26 12:16:30,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 6: [2022-11-26 12:16:30,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 6: [2022-11-26 12:16:30,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 6: [2022-11-26 12:16:30,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:16:30,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:16:30,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 12:16:30,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step22000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 12:16:30,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 6: [2022-11-26 12:16:30,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 0: successfully saved checkpoint at iteration 22000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 5980.46 7: iteration 22010/ 44073 | consumed samples: 11269120 | consumed tokens: 23079157760 | elapsed time per iteration (s): 4.88 | learning rate: 1.116E-04 | global batch size: 512 | lm loss: 2.011443E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 104.868 | TFLOPs: 48.87 | 7: iteration 22020/ 44073 | consumed samples: 11274240 | consumed tokens: 23089643520 | elapsed time per iteration (s): 4.14 | learning rate: 1.115E-04 | global batch size: 512 | lm loss: 2.028509E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.582 | TFLOPs: 57.60 | 7: iteration 22030/ 44073 | consumed samples: 11279360 | consumed tokens: 23100129280 | elapsed time per iteration (s): 4.16 | learning rate: 1.115E-04 | global batch size: 512 | lm loss: 2.012884E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.207 | TFLOPs: 57.42 | 7: iteration 22040/ 44073 | consumed samples: 11284480 | consumed tokens: 23110615040 | elapsed time per iteration (s): 4.15 | learning rate: 1.114E-04 | global batch size: 512 | lm loss: 2.021301E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.476 | TFLOPs: 57.55 | 7: iteration 22050/ 44073 | consumed samples: 11289600 | consumed tokens: 23121100800 | elapsed time per iteration (s): 4.14 | learning rate: 1.113E-04 | global batch size: 512 | lm loss: 2.024092E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.553 | TFLOPs: 57.58 | 7: iteration 22060/ 44073 | consumed samples: 11294720 | consumed tokens: 23131586560 | elapsed time per iteration (s): 4.14 | learning rate: 1.113E-04 | global batch size: 512 | lm loss: 2.029943E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.685 | TFLOPs: 57.64 | 7: iteration 22070/ 44073 | consumed samples: 11299840 | consumed tokens: 23142072320 | elapsed time per iteration (s): 4.16 | learning rate: 1.112E-04 | global batch size: 512 | lm loss: 2.006829E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.072 | TFLOPs: 57.36 | 7: iteration 22080/ 44073 | consumed samples: 11304960 | consumed tokens: 23152558080 | elapsed time per iteration (s): 4.19 | learning rate: 1.111E-04 | global batch size: 512 | lm loss: 2.031873E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.058 | TFLOPs: 56.89 | 7: iteration 22090/ 44073 | consumed samples: 11310080 | consumed tokens: 23163043840 | elapsed time per iteration (s): 4.14 | learning rate: 1.111E-04 | global batch size: 512 | lm loss: 1.997287E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.608 | TFLOPs: 57.61 | 7: iteration 22100/ 44073 | consumed samples: 11315200 | consumed tokens: 23173529600 | elapsed time per iteration (s): 4.14 | learning rate: 1.110E-04 | global batch size: 512 | lm loss: 2.003308E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.659 | TFLOPs: 57.63 | 7: iteration 22110/ 44073 | consumed samples: 11320320 | consumed tokens: 23184015360 | elapsed time per iteration (s): 4.14 | learning rate: 1.110E-04 | global batch size: 512 | lm loss: 2.020484E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.763 | TFLOPs: 57.68 | 7: iteration 22120/ 44073 | consumed samples: 11325440 | consumed tokens: 23194501120 | elapsed time per iteration (s): 4.13 | learning rate: 1.109E-04 | global batch size: 512 | lm loss: 2.015197E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.887 | TFLOPs: 57.74 | 7: iteration 22130/ 44073 | consumed samples: 11330560 | consumed tokens: 23204986880 | elapsed time per iteration (s): 4.13 | learning rate: 1.108E-04 | global batch size: 512 | lm loss: 2.020516E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.879 | TFLOPs: 57.73 | 7: iteration 22140/ 44073 | consumed samples: 11335680 | consumed tokens: 23215472640 | elapsed time per iteration (s): 4.17 | learning rate: 1.108E-04 | global batch size: 512 | lm loss: 2.017019E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.920 | TFLOPs: 57.29 | 7: iteration 22150/ 44073 | consumed samples: 11340800 | consumed tokens: 23225958400 | elapsed time per iteration (s): 4.15 | learning rate: 1.107E-04 | global batch size: 512 | lm loss: 2.004035E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.407 | TFLOPs: 57.51 | 7: iteration 22160/ 44073 | consumed samples: 11345920 | consumed tokens: 23236444160 | elapsed time per iteration (s): 4.14 | learning rate: 1.106E-04 | global batch size: 512 | lm loss: 2.015079E+00 | grad norm: 0.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.684 | TFLOPs: 57.64 | 7: iteration 22170/ 44073 | consumed samples: 11351040 | consumed tokens: 23246929920 | elapsed time per iteration (s): 4.17 | learning rate: 1.106E-04 | global batch size: 512 | lm loss: 2.008994E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.711 | TFLOPs: 57.19 | 7: iteration 22180/ 44073 | consumed samples: 11356160 | consumed tokens: 23257415680 | elapsed time per iteration (s): 4.13 | learning rate: 1.105E-04 | global batch size: 512 | lm loss: 2.016941E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.825 | TFLOPs: 57.71 | 7: iteration 22190/ 44073 | consumed samples: 11361280 | consumed tokens: 23267901440 | elapsed time per iteration (s): 4.14 | learning rate: 1.104E-04 | global batch size: 512 | lm loss: 2.022685E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.759 | TFLOPs: 57.68 | 7: iteration 22200/ 44073 | consumed samples: 11366400 | consumed tokens: 23278387200 | elapsed time per iteration (s): 4.15 | learning rate: 1.104E-04 | global batch size: 512 | lm loss: 2.018837E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.296 | TFLOPs: 57.46 | 7: iteration 22210/ 44073 | consumed samples: 11371520 | consumed tokens: 23288872960 | elapsed time per iteration (s): 4.17 | learning rate: 1.103E-04 | global batch size: 512 | lm loss: 2.024447E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.694 | TFLOPs: 57.18 | 7: iteration 22220/ 44073 | consumed samples: 11376640 | consumed tokens: 23299358720 | elapsed time per iteration (s): 4.14 | learning rate: 1.102E-04 | global batch size: 512 | lm loss: 1.999099E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.680 | TFLOPs: 57.64 | 7: iteration 22230/ 44073 | consumed samples: 11381760 | consumed tokens: 23309844480 | elapsed time per iteration (s): 4.19 | learning rate: 1.102E-04 | global batch size: 512 | lm loss: 2.016943E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.212 | TFLOPs: 56.96 | 7: iteration 22240/ 44073 | consumed samples: 11386880 | consumed tokens: 23320330240 | elapsed time per iteration (s): 4.16 | learning rate: 1.101E-04 | global batch size: 512 | lm loss: 2.019688E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.184 | TFLOPs: 57.41 | 7: iteration 22250/ 44073 | consumed samples: 11392000 | consumed tokens: 23330816000 | elapsed time per iteration (s): 4.13 | learning rate: 1.100E-04 | global batch size: 512 | lm loss: 2.001457E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.851 | TFLOPs: 57.72 | 7: iteration 22260/ 44073 | consumed samples: 11397120 | consumed tokens: 23341301760 | elapsed time per iteration (s): 4.16 | learning rate: 1.100E-04 | global batch size: 512 | lm loss: 2.023615E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.993 | TFLOPs: 57.32 | 7: iteration 22270/ 44073 | consumed samples: 11402240 | consumed tokens: 23351787520 | elapsed time per iteration (s): 4.17 | learning rate: 1.099E-04 | global batch size: 512 | lm loss: 2.010758E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.861 | TFLOPs: 57.26 | 7: iteration 22280/ 44073 | consumed samples: 11407360 | consumed tokens: 23362273280 | elapsed time per iteration (s): 4.13 | learning rate: 1.099E-04 | global batch size: 512 | lm loss: 2.001471E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.893 | TFLOPs: 57.74 | 7: iteration 22290/ 44073 | consumed samples: 11412480 | consumed tokens: 23372759040 | elapsed time per iteration (s): 4.15 | learning rate: 1.098E-04 | global batch size: 512 | lm loss: 2.020566E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.487 | TFLOPs: 57.55 | 7: iteration 22300/ 44073 | consumed samples: 11417600 | consumed tokens: 23383244800 | elapsed time per iteration (s): 4.14 | learning rate: 1.097E-04 | global batch size: 512 | lm loss: 2.000721E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.619 | TFLOPs: 57.61 | 7: iteration 22310/ 44073 | consumed samples: 11422720 | consumed tokens: 23393730560 | elapsed time per iteration (s): 4.15 | learning rate: 1.097E-04 | global batch size: 512 | lm loss: 2.011820E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.379 | TFLOPs: 57.50 | 7: iteration 22320/ 44073 | consumed samples: 11427840 | consumed tokens: 23404216320 | elapsed time per iteration (s): 4.14 | learning rate: 1.096E-04 | global batch size: 512 | lm loss: 2.017674E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.738 | TFLOPs: 57.67 | 7: iteration 22330/ 44073 | consumed samples: 11432960 | consumed tokens: 23414702080 | elapsed time per iteration (s): 4.14 | learning rate: 1.095E-04 | global batch size: 512 | lm loss: 1.995812E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.614 | TFLOPs: 57.61 | 7: iteration 22340/ 44073 | consumed samples: 11438080 | consumed tokens: 23425187840 | elapsed time per iteration (s): 4.17 | learning rate: 1.095E-04 | global batch size: 512 | lm loss: 2.003435E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.819 | TFLOPs: 57.24 | 7: iteration 22350/ 44073 | consumed samples: 11443200 | consumed tokens: 23435673600 | elapsed time per iteration (s): 4.20 | learning rate: 1.094E-04 | global batch size: 512 | lm loss: 2.021084E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.840 | TFLOPs: 56.78 | 7: iteration 22360/ 44073 | consumed samples: 11448320 | consumed tokens: 23446159360 | elapsed time per iteration (s): 4.16 | learning rate: 1.093E-04 | global batch size: 512 | lm loss: 2.022436E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.040 | TFLOPs: 57.34 | 7: iteration 22370/ 44073 | consumed samples: 11453440 | consumed tokens: 23456645120 | elapsed time per iteration (s): 4.16 | learning rate: 1.093E-04 | global batch size: 512 | lm loss: 2.014914E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.198 | TFLOPs: 57.42 | 7: iteration 22380/ 44073 | consumed samples: 11458560 | consumed tokens: 23467130880 | elapsed time per iteration (s): 4.17 | learning rate: 1.092E-04 | global batch size: 512 | lm loss: 2.017152E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.853 | TFLOPs: 57.26 | 7: iteration 22390/ 44073 | consumed samples: 11463680 | consumed tokens: 23477616640 | elapsed time per iteration (s): 4.16 | learning rate: 1.091E-04 | global batch size: 512 | lm loss: 2.022523E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.929 | TFLOPs: 57.29 | 7: iteration 22400/ 44073 | consumed samples: 11468800 | consumed tokens: 23488102400 | elapsed time per iteration (s): 4.14 | learning rate: 1.091E-04 | global batch size: 512 | lm loss: 2.021502E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.682 | TFLOPs: 57.64 | 7: iteration 22410/ 44073 | consumed samples: 11473920 | consumed tokens: 23498588160 | elapsed time per iteration (s): 4.21 | learning rate: 1.090E-04 | global batch size: 512 | lm loss: 2.012400E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.724 | TFLOPs: 56.73 | 7: iteration 22420/ 44073 | consumed samples: 11479040 | consumed tokens: 23509073920 | elapsed time per iteration (s): 4.15 | learning rate: 1.089E-04 | global batch size: 512 | lm loss: 2.008639E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.270 | TFLOPs: 57.45 | 7: iteration 22430/ 44073 | consumed samples: 11484160 | consumed tokens: 23519559680 | elapsed time per iteration (s): 4.16 | learning rate: 1.089E-04 | global batch size: 512 | lm loss: 2.006824E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.996 | TFLOPs: 57.32 | 7: iteration 22440/ 44073 | consumed samples: 11489280 | consumed tokens: 23530045440 | elapsed time per iteration (s): 4.16 | learning rate: 1.088E-04 | global batch size: 512 | lm loss: 2.034348E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.102 | TFLOPs: 57.37 | 7: iteration 22450/ 44073 | consumed samples: 11494400 | consumed tokens: 23540531200 | elapsed time per iteration (s): 4.17 | learning rate: 1.088E-04 | global batch size: 512 | lm loss: 2.024005E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.819 | TFLOPs: 57.24 | 7: iteration 22460/ 44073 | consumed samples: 11499520 | consumed tokens: 23551016960 | elapsed time per iteration (s): 4.19 | learning rate: 1.087E-04 | global batch size: 512 | lm loss: 2.023272E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.309 | TFLOPs: 57.00 | 7: iteration 22470/ 44073 | consumed samples: 11504640 | consumed tokens: 23561502720 | elapsed time per iteration (s): 4.15 | learning rate: 1.086E-04 | global batch size: 512 | lm loss: 2.020023E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.343 | TFLOPs: 57.48 | 7: iteration 22480/ 44073 | consumed samples: 11509760 | consumed tokens: 23571988480 | elapsed time per iteration (s): 4.15 | learning rate: 1.086E-04 | global batch size: 512 | lm loss: 2.016833E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.418 | TFLOPs: 57.52 | 7: iteration 22490/ 44073 | consumed samples: 11514880 | consumed tokens: 23582474240 | elapsed time per iteration (s): 4.17 | learning rate: 1.085E-04 | global batch size: 512 | lm loss: 2.004689E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.821 | TFLOPs: 57.24 | 7: iteration 22500/ 44073 | consumed samples: 11520000 | consumed tokens: 23592960000 | elapsed time per iteration (s): 4.15 | learning rate: 1.084E-04 | global batch size: 512 | lm loss: 2.002239E+00 | grad norm: 0.115 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.379 | TFLOPs: 57.50 | 7: iteration 22510/ 44073 | consumed samples: 11525120 | consumed tokens: 23603445760 | elapsed time per iteration (s): 4.17 | learning rate: 1.084E-04 | global batch size: 512 | lm loss: 2.008435E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.822 | TFLOPs: 57.24 | 7: iteration 22520/ 44073 | consumed samples: 11530240 | consumed tokens: 23613931520 | elapsed time per iteration (s): 4.18 | learning rate: 1.083E-04 | global batch size: 512 | lm loss: 2.001750E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.486 | TFLOPs: 57.08 | 7: iteration 22530/ 44073 | consumed samples: 11535360 | consumed tokens: 23624417280 | elapsed time per iteration (s): 4.16 | learning rate: 1.082E-04 | global batch size: 512 | lm loss: 2.020505E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.997 | TFLOPs: 57.32 | 7: iteration 22540/ 44073 | consumed samples: 11540480 | consumed tokens: 23634903040 | elapsed time per iteration (s): 4.17 | learning rate: 1.082E-04 | global batch size: 512 | lm loss: 2.003156E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.884 | TFLOPs: 57.27 | 7: iteration 22550/ 44073 | consumed samples: 11545600 | consumed tokens: 23645388800 | elapsed time per iteration (s): 4.20 | learning rate: 1.081E-04 | global batch size: 512 | lm loss: 2.012672E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.817 | TFLOPs: 56.77 | 7: iteration 22560/ 44073 | consumed samples: 11550720 | consumed tokens: 23655874560 | elapsed time per iteration (s): 4.17 | learning rate: 1.080E-04 | global batch size: 512 | lm loss: 2.026363E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.671 | TFLOPs: 57.17 | 7: iteration 22570/ 44073 | consumed samples: 11555840 | consumed tokens: 23666360320 | elapsed time per iteration (s): 4.15 | learning rate: 1.080E-04 | global batch size: 512 | lm loss: 2.021286E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.348 | TFLOPs: 57.49 | 7: iteration 22580/ 44073 | consumed samples: 11560960 | consumed tokens: 23676846080 | elapsed time per iteration (s): 4.15 | learning rate: 1.079E-04 | global batch size: 512 | lm loss: 2.017044E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.497 | TFLOPs: 57.56 | 7: iteration 22590/ 44073 | consumed samples: 11566080 | consumed tokens: 23687331840 | elapsed time per iteration (s): 4.14 | learning rate: 1.078E-04 | global batch size: 512 | lm loss: 2.015938E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.697 | TFLOPs: 57.65 | 7: iteration 22600/ 44073 | consumed samples: 11571200 | consumed tokens: 23697817600 | elapsed time per iteration (s): 4.15 | learning rate: 1.078E-04 | global batch size: 512 | lm loss: 2.003674E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.324 | TFLOPs: 57.48 | 7: iteration 22610/ 44073 | consumed samples: 11576320 | consumed tokens: 23708303360 | elapsed time per iteration (s): 4.16 | learning rate: 1.077E-04 | global batch size: 512 | lm loss: 2.009325E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.174 | TFLOPs: 57.41 | 7: iteration 22620/ 44073 | consumed samples: 11581440 | consumed tokens: 23718789120 | elapsed time per iteration (s): 4.18 | learning rate: 1.076E-04 | global batch size: 512 | lm loss: 2.008966E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.580 | TFLOPs: 57.13 | 7: iteration 22630/ 44073 | consumed samples: 11586560 | consumed tokens: 23729274880 | elapsed time per iteration (s): 4.17 | learning rate: 1.076E-04 | global batch size: 512 | lm loss: 2.004242E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.823 | TFLOPs: 57.24 | 7: iteration 22640/ 44073 | consumed samples: 11591680 | consumed tokens: 23739760640 | elapsed time per iteration (s): 4.19 | learning rate: 1.075E-04 | global batch size: 512 | lm loss: 2.014823E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.155 | TFLOPs: 56.93 | 7: iteration 22650/ 44073 | consumed samples: 11596800 | consumed tokens: 23750246400 | elapsed time per iteration (s): 4.19 | learning rate: 1.075E-04 | global batch size: 512 | lm loss: 2.027652E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.162 | TFLOPs: 56.93 | 7: iteration 22660/ 44073 | consumed samples: 11601920 | consumed tokens: 23760732160 | elapsed time per iteration (s): 4.13 | learning rate: 1.074E-04 | global batch size: 512 | lm loss: 1.989577E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.847 | TFLOPs: 57.72 | 7: iteration 22670/ 44073 | consumed samples: 11607040 | consumed tokens: 23771217920 | elapsed time per iteration (s): 4.20 | learning rate: 1.073E-04 | global batch size: 512 | lm loss: 2.010593E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.983 | TFLOPs: 56.85 | 7: iteration 22680/ 44073 | consumed samples: 11612160 | consumed tokens: 23781703680 | elapsed time per iteration (s): 4.19 | learning rate: 1.073E-04 | global batch size: 512 | lm loss: 2.003947E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.336 | TFLOPs: 57.01 | 7: iteration 22690/ 44073 | consumed samples: 11617280 | consumed tokens: 23792189440 | elapsed time per iteration (s): 4.31 | learning rate: 1.072E-04 | global batch size: 512 | lm loss: 2.013085E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.772 | TFLOPs: 55.35 | 7: iteration 22700/ 44073 | consumed samples: 11622400 | consumed tokens: 23802675200 | elapsed time per iteration (s): 4.14 | learning rate: 1.071E-04 | global batch size: 512 | lm loss: 2.032783E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.741 | TFLOPs: 57.67 | 7: iteration 22710/ 44073 | consumed samples: 11627520 | consumed tokens: 23813160960 | elapsed time per iteration (s): 4.16 | learning rate: 1.071E-04 | global batch size: 512 | lm loss: 2.006086E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.983 | TFLOPs: 57.32 | 7: iteration 22720/ 44073 | consumed samples: 11632640 | consumed tokens: 23823646720 | elapsed time per iteration (s): 4.16 | learning rate: 1.070E-04 | global batch size: 512 | lm loss: 2.002592E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.147 | TFLOPs: 57.39 | 7: iteration 22730/ 44073 | consumed samples: 11637760 | consumed tokens: 23834132480 | elapsed time per iteration (s): 4.18 | learning rate: 1.069E-04 | global batch size: 512 | lm loss: 2.026056E+00 | grad norm: 0.115 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.605 | TFLOPs: 57.14 | 7: iteration 22740/ 44073 | consumed samples: 11642880 | consumed tokens: 23844618240 | elapsed time per iteration (s): 4.17 | learning rate: 1.069E-04 | global batch size: 512 | lm loss: 2.020663E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.879 | TFLOPs: 57.27 | 7: iteration 22750/ 44073 | consumed samples: 11648000 | consumed tokens: 23855104000 | elapsed time per iteration (s): 4.17 | learning rate: 1.068E-04 | global batch size: 512 | lm loss: 2.011404E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.897 | TFLOPs: 57.28 | 7: iteration 22760/ 44073 | consumed samples: 11653120 | consumed tokens: 23865589760 | elapsed time per iteration (s): 4.33 | learning rate: 1.067E-04 | global batch size: 512 | lm loss: 2.018812E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.351 | TFLOPs: 55.16 | 7: iteration 22770/ 44073 | consumed samples: 11658240 | consumed tokens: 23876075520 | elapsed time per iteration (s): 4.16 | learning rate: 1.067E-04 | global batch size: 512 | lm loss: 1.995820E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.201 | TFLOPs: 57.42 | 7: iteration 22780/ 44073 | consumed samples: 11663360 | consumed tokens: 23886561280 | elapsed time per iteration (s): 4.19 | learning rate: 1.066E-04 | global batch size: 512 | lm loss: 2.021244E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.147 | TFLOPs: 56.93 | 7: iteration 22790/ 44073 | consumed samples: 11668480 | consumed tokens: 23897047040 | elapsed time per iteration (s): 4.16 | learning rate: 1.065E-04 | global batch size: 512 | lm loss: 2.019924E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.074 | TFLOPs: 57.36 | 7: iteration 22800/ 44073 | consumed samples: 11673600 | consumed tokens: 23907532800 | elapsed time per iteration (s): 4.14 | learning rate: 1.065E-04 | global batch size: 512 | lm loss: 2.010024E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.551 | TFLOPs: 57.58 | 7: iteration 22810/ 44073 | consumed samples: 11678720 | consumed tokens: 23918018560 | elapsed time per iteration (s): 4.14 | learning rate: 1.064E-04 | global batch size: 512 | lm loss: 1.994013E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.713 | TFLOPs: 57.66 | 7: iteration 22820/ 44073 | consumed samples: 11683840 | consumed tokens: 23928504320 | elapsed time per iteration (s): 4.17 | learning rate: 1.064E-04 | global batch size: 512 | lm loss: 2.019008E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.849 | TFLOPs: 57.25 | 7: iteration 22830/ 44073 | consumed samples: 11688960 | consumed tokens: 23938990080 | elapsed time per iteration (s): 4.15 | learning rate: 1.063E-04 | global batch size: 512 | lm loss: 2.012361E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.517 | TFLOPs: 57.57 | 7: iteration 22840/ 44073 | consumed samples: 11694080 | consumed tokens: 23949475840 | elapsed time per iteration (s): 4.17 | learning rate: 1.062E-04 | global batch size: 512 | lm loss: 1.997840E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.686 | TFLOPs: 57.18 | 7: iteration 22850/ 44073 | consumed samples: 11699200 | consumed tokens: 23959961600 | elapsed time per iteration (s): 4.15 | learning rate: 1.062E-04 | global batch size: 512 | lm loss: 2.022999E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.460 | TFLOPs: 57.54 | 7: iteration 22860/ 44073 | consumed samples: 11704320 | consumed tokens: 23970447360 | elapsed time per iteration (s): 4.17 | learning rate: 1.061E-04 | global batch size: 512 | lm loss: 1.986295E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.916 | TFLOPs: 57.29 | 7: iteration 22870/ 44073 | consumed samples: 11709440 | consumed tokens: 23980933120 | elapsed time per iteration (s): 4.17 | learning rate: 1.060E-04 | global batch size: 512 | lm loss: 1.990937E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.722 | TFLOPs: 57.19 | 7: iteration 22880/ 44073 | consumed samples: 11714560 | consumed tokens: 23991418880 | elapsed time per iteration (s): 4.15 | learning rate: 1.060E-04 | global batch size: 512 | lm loss: 2.014546E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.370 | TFLOPs: 57.50 | 7: iteration 22890/ 44073 | consumed samples: 11719680 | consumed tokens: 24001904640 | elapsed time per iteration (s): 4.16 | learning rate: 1.059E-04 | global batch size: 512 | lm loss: 2.017120E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.038 | TFLOPs: 57.34 | 7: iteration 22900/ 44073 | consumed samples: 11724800 | consumed tokens: 24012390400 | elapsed time per iteration (s): 4.16 | learning rate: 1.058E-04 | global batch size: 512 | lm loss: 2.031536E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.149 | TFLOPs: 57.39 | 7: iteration 22910/ 44073 | consumed samples: 11729920 | consumed tokens: 24022876160 | elapsed time per iteration (s): 4.17 | learning rate: 1.058E-04 | global batch size: 512 | lm loss: 2.005829E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.701 | TFLOPs: 57.18 | 7: iteration 22920/ 44073 | consumed samples: 11735040 | consumed tokens: 24033361920 | elapsed time per iteration (s): 4.18 | learning rate: 1.057E-04 | global batch size: 512 | lm loss: 2.011987E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.635 | TFLOPs: 57.15 | 7: iteration 22930/ 44073 | consumed samples: 11740160 | consumed tokens: 24043847680 | elapsed time per iteration (s): 4.15 | learning rate: 1.056E-04 | global batch size: 512 | lm loss: 2.006442E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.374 | TFLOPs: 57.50 | 7: iteration 22940/ 44073 | consumed samples: 11745280 | consumed tokens: 24054333440 | elapsed time per iteration (s): 4.19 | learning rate: 1.056E-04 | global batch size: 512 | lm loss: 2.015259E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.242 | TFLOPs: 56.97 | 7: iteration 22950/ 44073 | consumed samples: 11750400 | consumed tokens: 24064819200 | elapsed time per iteration (s): 4.22 | learning rate: 1.055E-04 | global batch size: 512 | lm loss: 2.034294E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.455 | TFLOPs: 56.60 | 7: iteration 22960/ 44073 | consumed samples: 11755520 | consumed tokens: 24075304960 | elapsed time per iteration (s): 4.14 | learning rate: 1.054E-04 | global batch size: 512 | lm loss: 2.018471E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.527 | TFLOPs: 57.57 | 7: iteration 22970/ 44073 | consumed samples: 11760640 | consumed tokens: 24085790720 | elapsed time per iteration (s): 4.14 | learning rate: 1.054E-04 | global batch size: 512 | lm loss: 2.011012E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.621 | TFLOPs: 57.61 | 7: iteration 22980/ 44073 | consumed samples: 11765760 | consumed tokens: 24096276480 | elapsed time per iteration (s): 4.15 | learning rate: 1.053E-04 | global batch size: 512 | lm loss: 2.023096E+00 | grad norm: 0.220 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.428 | TFLOPs: 57.52 | 7: iteration 22990/ 44073 | consumed samples: 11770880 | consumed tokens: 24106762240 | elapsed time per iteration (s): 4.14 | learning rate: 1.053E-04 | global batch size: 512 | lm loss: 2.004017E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.758 | TFLOPs: 57.68 | 7: iteration 23000/ 44073 | consumed samples: 11776000 | consumed tokens: 24117248000 | elapsed time per iteration (s): 4.14 | learning rate: 1.052E-04 | global batch size: 512 | lm loss: 2.022625E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.567 | TFLOPs: 57.59 | 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 23000 | lm loss value: 1.994643E+00 | lm loss PPL: 7.349580E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 23000 to checkpoints_2b2 0: [2022-11-26 13:25:54,627] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step23000 is begin to save! 0: [2022-11-26 13:25:54,632] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_01-model_00-model_states.pt... 0: [2022-11-26 13:25:54,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_01-model_00-model_states.pt. 0: [2022-11-26 13:25:54,960] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_03-model_00-model_states.pt... 0: [2022-11-26 13:25:55,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_03-model_00-model_states.pt. 0: [2022-11-26 13:25:55,104] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_04-model_00-model_states.pt... 0: [2022-11-26 13:25:55,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_04-model_00-model_states.pt. 0: [2022-11-26 13:25:55,250] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_05-model_00-model_states.pt... 0: [2022-11-26 13:25:55,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_05-model_00-model_states.pt. 0: [2022-11-26 13:25:55,390] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_06-model_00-model_states.pt... 0: [2022-11-26 13:25:55,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_06-model_00-model_states.pt. 0: [2022-11-26 13:25:55,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_07-model_00-model_states.pt... 0: [2022-11-26 13:25:55,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_07-model_00-model_states.pt. 0: [2022-11-26 13:25:55,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_08-model_00-model_states.pt... 0: [2022-11-26 13:25:55,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_08-model_00-model_states.pt. 0: [2022-11-26 13:25:55,797] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_09-model_00-model_states.pt... 0: [2022-11-26 13:25:55,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_09-model_00-model_states.pt. 0: [2022-11-26 13:25:55,922] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_10-model_00-model_states.pt... 0: [2022-11-26 13:25:56,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_10-model_00-model_states.pt. 0: [2022-11-26 13:25:56,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_11-model_00-model_states.pt... 0: [2022-11-26 13:25:56,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_11-model_00-model_states.pt. 0: [2022-11-26 13:25:56,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_12-model_00-model_states.pt... 0: [2022-11-26 13:25:56,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_12-model_00-model_states.pt. 0: [2022-11-26 13:25:56,294] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_13-model_00-model_states.pt... 0: [2022-11-26 13:25:56,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_13-model_00-model_states.pt. 0: [2022-11-26 13:25:56,419] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_14-model_00-model_states.pt... 0: [2022-11-26 13:25:56,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_14-model_00-model_states.pt. 0: [2022-11-26 13:25:56,541] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_15-model_00-model_states.pt... 0: [2022-11-26 13:25:56,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_15-model_00-model_states.pt. 0: [2022-11-26 13:25:56,665] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_16-model_00-model_states.pt... 0: [2022-11-26 13:25:56,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_16-model_00-model_states.pt. 0: [2022-11-26 13:25:56,786] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_17-model_00-model_states.pt... 0: [2022-11-26 13:25:56,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_17-model_00-model_states.pt. 0: [2022-11-26 13:25:56,908] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_18-model_00-model_states.pt... 0: [2022-11-26 13:25:57,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_18-model_00-model_states.pt. 0: [2022-11-26 13:25:57,030] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_19-model_00-model_states.pt... 0: [2022-11-26 13:25:57,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_19-model_00-model_states.pt. 0: [2022-11-26 13:25:57,153] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_20-model_00-model_states.pt... 0: [2022-11-26 13:25:57,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_20-model_00-model_states.pt. 0: [2022-11-26 13:25:57,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_21-model_00-model_states.pt... 0: [2022-11-26 13:25:57,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_21-model_00-model_states.pt. 0: [2022-11-26 13:25:57,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_22-model_00-model_states.pt... 0: [2022-11-26 13:25:57,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_22-model_00-model_states.pt. 0: [2022-11-26 13:25:57,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_23-model_00-model_states.pt... 0: [2022-11-26 13:25:57,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_23-model_00-model_states.pt. 0: [2022-11-26 13:25:57,641] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_24-model_00-model_states.pt... 0: [2022-11-26 13:25:57,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_24-model_00-model_states.pt. 0: [2022-11-26 13:25:57,765] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_25-model_00-model_states.pt... 0: [2022-11-26 13:25:57,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_25-model_00-model_states.pt. 0: [2022-11-26 13:25:57,889] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_26-model_00-model_states.pt... 0: [2022-11-26 13:25:58,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_26-model_00-model_states.pt. 0: [2022-11-26 13:25:58,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_27-model_00-model_states.pt... 0: [2022-11-26 13:25:58,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_27-model_00-model_states.pt. 0: [2022-11-26 13:25:58,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_28-model_00-model_states.pt... 0: [2022-11-26 13:25:58,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_28-model_00-model_states.pt. 0: [2022-11-26 13:25:58,259] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_29-model_00-model_states.pt... 0: [2022-11-26 13:25:58,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_29-model_00-model_states.pt. 0: [2022-11-26 13:25:58,380] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_30-model_00-model_states.pt... 0: [2022-11-26 13:25:58,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_30-model_00-model_states.pt. 0: [2022-11-26 13:25:58,502] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_31-model_00-model_states.pt... 0: [2022-11-26 13:25:58,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_31-model_00-model_states.pt. 0: [2022-11-26 13:25:58,624] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_32-model_00-model_states.pt... 0: [2022-11-26 13:25:58,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_32-model_00-model_states.pt. 0: [2022-11-26 13:25:58,747] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_33-model_00-model_states.pt... 0: [2022-11-26 13:25:58,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_33-model_00-model_states.pt. 0: [2022-11-26 13:25:58,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_34-model_00-model_states.pt... 0: [2022-11-26 13:25:58,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_34-model_00-model_states.pt. 0: [2022-11-26 13:25:58,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/layer_36-model_00-model_states.pt... 0: [2022-11-26 13:25:58,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/layer_36-model_00-model_states.pt. 0: [2022-11-26 13:25:58,995] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step23000/mp_rank_00_model_states.pt 0: [2022-11-26 13:25:58,995] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/mp_rank_00_model_states.pt... 0: [2022-11-26 13:25:59,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/mp_rank_00_model_states.pt. 0: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:25:59,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step23000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:25:59,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:25:59,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:25:59,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 13:25:59,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: [2022-11-26 13:25:59,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:25:59,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 13:25:59,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:25:59,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: [2022-11-26 13:25:59,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 13:25:59,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: [2022-11-26 13:25:59,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:25:59,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 13:25:59,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: [2022-11-26 13:25:59,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:25:59,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 13:25:59,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: [2022-11-26 13:25:59,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:25:59,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 13:25:59,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: [2022-11-26 13:25:59,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:25:59,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 13:25:59,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 5: [2022-11-26 13:25:59,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:25:59,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 13:25:59,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 5: [2022-11-26 13:25:59,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:25:59,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 13:25:59,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: [2022-11-26 13:25:59,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:25:59,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 13:25:59,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 5: [2022-11-26 13:25:59,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:25:59,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 13:25:59,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 5: [2022-11-26 13:25:59,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:25:59,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 13:25:59,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 5: [2022-11-26 13:25:59,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:25:59,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 13:25:59,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 5: [2022-11-26 13:25:59,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:25:59,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 13:25:59,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 5: [2022-11-26 13:25:59,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:25:59,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 13:25:59,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 1: [2022-11-26 13:25:59,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:25:59,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 13:25:59,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 1: [2022-11-26 13:25:59,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:25:59,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:25:59,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 13:25:59,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 13:25:59,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 1: [2022-11-26 13:25:59,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 1: [2022-11-26 13:25:59,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:25:59,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 13:25:59,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 1: [2022-11-26 13:25:59,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:25:59,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:25:59,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:25:59,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 13:25:59,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 13:25:59,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 1: [2022-11-26 13:25:59,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 13:25:59,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 1: [2022-11-26 13:25:59,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 1: [2022-11-26 13:25:59,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:25:59,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 13:25:59,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 3: [2022-11-26 13:25:59,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:25:59,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:25:59,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:25:59,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 13:25:59,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 13:25:59,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 13:25:59,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 3: [2022-11-26 13:25:59,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 3: [2022-11-26 13:25:59,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 3: [2022-11-26 13:25:59,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:25:59,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 13:25:59,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 3: [2022-11-26 13:25:59,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:25:59,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 13:25:59,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 3: [2022-11-26 13:25:59,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:25:59,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 13:25:59,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 3: [2022-11-26 13:25:59,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:25:59,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 13:25:59,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 3: [2022-11-26 13:25:59,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:25:59,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 13:25:59,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 4: [2022-11-26 13:25:59,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:25:59,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:25:59,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 13:25:59,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 13:25:59,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:25:59,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:25:59,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 4: [2022-11-26 13:25:59,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 4: [2022-11-26 13:25:59,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 13:25:59,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 13:25:59,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 4: [2022-11-26 13:25:59,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 4: [2022-11-26 13:26:00,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:26:00,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 13:26:00,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 4: [2022-11-26 13:26:00,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:26:00,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 13:26:00,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 4: [2022-11-26 13:26:00,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:26:00,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 13:26:00,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 4: [2022-11-26 13:26:00,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:26:00,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 13:26:00,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 6: [2022-11-26 13:26:00,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:26:00,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:26:00,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:26:00,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:26:00,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 13:26:00,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 13:26:00,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 13:26:00,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 6: [2022-11-26 13:26:00,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 13:26:00,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 6: [2022-11-26 13:26:00,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 6: [2022-11-26 13:26:00,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:26:00,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 6: [2022-11-26 13:26:00,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 13:26:00,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 6: [2022-11-26 13:26:00,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:26:00,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:26:00,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 13:26:00,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 13:26:00,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 6: [2022-11-26 13:26:00,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:26:00,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 6: [2022-11-26 13:26:00,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 13:26:00,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 2: [2022-11-26 13:26:00,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:26:00,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:26:00,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:26:00,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 13:26:00,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 13:26:00,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 13:26:00,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:26:00,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:26:00,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:26:00,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 2: [2022-11-26 13:26:00,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 2: [2022-11-26 13:26:00,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 2: [2022-11-26 13:26:00,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 13:26:00,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 13:26:00,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 13:26:00,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 2: [2022-11-26 13:26:00,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 2: [2022-11-26 13:26:00,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 2: [2022-11-26 13:26:00,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:26:00,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:26:00,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 13:26:00,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 13:26:00,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 2: [2022-11-26 13:26:00,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: [2022-11-26 13:26:00,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 13:26:00,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 7: [2022-11-26 13:26:00,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:26:00,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:26:00,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:26:00,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:26:00,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 13:26:00,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 13:26:00,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:26:00,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:26:00,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:26:00,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 13:26:00,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 7: [2022-11-26 13:26:00,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 7: [2022-11-26 13:26:00,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 13:26:00,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 13:26:00,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 13:26:00,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 13:26:00,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 7: [2022-11-26 13:26:00,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 7: [2022-11-26 13:26:00,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 7: [2022-11-26 13:26:00,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 7: [2022-11-26 13:26:00,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 7: [2022-11-26 13:26:00,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:26:00,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step23000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 13:26:00,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: successfully saved checkpoint at iteration 23000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 5764.84 7: iteration 23010/ 44073 | consumed samples: 11781120 | consumed tokens: 24127733760 | elapsed time per iteration (s): 4.86 | learning rate: 1.051E-04 | global batch size: 512 | lm loss: 1.998550E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 105.408 | TFLOPs: 49.13 | 7: iteration 23020/ 44073 | consumed samples: 11786240 | consumed tokens: 24138219520 | elapsed time per iteration (s): 4.16 | learning rate: 1.051E-04 | global batch size: 512 | lm loss: 2.013584E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.118 | TFLOPs: 57.38 | 7: iteration 23030/ 44073 | consumed samples: 11791360 | consumed tokens: 24148705280 | elapsed time per iteration (s): 4.32 | learning rate: 1.050E-04 | global batch size: 512 | lm loss: 2.018017E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.496 | TFLOPs: 55.22 | 7: iteration 23040/ 44073 | consumed samples: 11796480 | consumed tokens: 24159191040 | elapsed time per iteration (s): 4.20 | learning rate: 1.049E-04 | global batch size: 512 | lm loss: 1.993885E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.032 | TFLOPs: 56.87 | 7: iteration 23050/ 44073 | consumed samples: 11801600 | consumed tokens: 24169676800 | elapsed time per iteration (s): 4.20 | learning rate: 1.049E-04 | global batch size: 512 | lm loss: 2.009434E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.014 | TFLOPs: 56.86 | 7: iteration 23060/ 44073 | consumed samples: 11806720 | consumed tokens: 24180162560 | elapsed time per iteration (s): 4.14 | learning rate: 1.048E-04 | global batch size: 512 | lm loss: 2.015660E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.635 | TFLOPs: 57.62 | 7: iteration 23070/ 44073 | consumed samples: 11811840 | consumed tokens: 24190648320 | elapsed time per iteration (s): 4.14 | learning rate: 1.047E-04 | global batch size: 512 | lm loss: 2.019528E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.678 | TFLOPs: 57.64 | 7: iteration 23080/ 44073 | consumed samples: 11816960 | consumed tokens: 24201134080 | elapsed time per iteration (s): 4.17 | learning rate: 1.047E-04 | global batch size: 512 | lm loss: 2.006654E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.784 | TFLOPs: 57.22 | 7: iteration 23090/ 44073 | consumed samples: 11822080 | consumed tokens: 24211619840 | elapsed time per iteration (s): 4.14 | learning rate: 1.046E-04 | global batch size: 512 | lm loss: 1.997778E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.790 | TFLOPs: 57.69 | 7: iteration 23100/ 44073 | consumed samples: 11827200 | consumed tokens: 24222105600 | elapsed time per iteration (s): 4.15 | learning rate: 1.045E-04 | global batch size: 512 | lm loss: 2.004415E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.474 | TFLOPs: 57.55 | 7: iteration 23110/ 44073 | consumed samples: 11832320 | consumed tokens: 24232591360 | elapsed time per iteration (s): 4.20 | learning rate: 1.045E-04 | global batch size: 512 | lm loss: 2.026559E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.953 | TFLOPs: 56.84 | 7: iteration 23120/ 44073 | consumed samples: 11837440 | consumed tokens: 24243077120 | elapsed time per iteration (s): 4.16 | learning rate: 1.044E-04 | global batch size: 512 | lm loss: 2.011818E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.006 | TFLOPs: 57.33 | 7: iteration 23130/ 44073 | consumed samples: 11842560 | consumed tokens: 24253562880 | elapsed time per iteration (s): 4.19 | learning rate: 1.043E-04 | global batch size: 512 | lm loss: 2.000621E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.184 | TFLOPs: 56.94 | 7: iteration 23140/ 44073 | consumed samples: 11847680 | consumed tokens: 24264048640 | elapsed time per iteration (s): 4.19 | learning rate: 1.043E-04 | global batch size: 512 | lm loss: 1.994909E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.240 | TFLOPs: 56.97 | 7: iteration 23150/ 44073 | consumed samples: 11852800 | consumed tokens: 24274534400 | elapsed time per iteration (s): 4.14 | learning rate: 1.042E-04 | global batch size: 512 | lm loss: 1.989852E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.793 | TFLOPs: 57.69 | 7: iteration 23160/ 44073 | consumed samples: 11857920 | consumed tokens: 24285020160 | elapsed time per iteration (s): 4.16 | learning rate: 1.042E-04 | global batch size: 512 | lm loss: 1.992836E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.053 | TFLOPs: 57.35 | 7: iteration 23170/ 44073 | consumed samples: 11863040 | consumed tokens: 24295505920 | elapsed time per iteration (s): 4.16 | learning rate: 1.041E-04 | global batch size: 512 | lm loss: 2.008173E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.205 | TFLOPs: 57.42 | 7: iteration 23180/ 44073 | consumed samples: 11868160 | consumed tokens: 24305991680 | elapsed time per iteration (s): 4.13 | learning rate: 1.040E-04 | global batch size: 512 | lm loss: 2.018360E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.882 | TFLOPs: 57.74 | 7: iteration 23190/ 44073 | consumed samples: 11873280 | consumed tokens: 24316477440 | elapsed time per iteration (s): 4.13 | learning rate: 1.040E-04 | global batch size: 512 | lm loss: 2.001752E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.858 | TFLOPs: 57.72 | 7: iteration 23200/ 44073 | consumed samples: 11878400 | consumed tokens: 24326963200 | elapsed time per iteration (s): 4.13 | learning rate: 1.039E-04 | global batch size: 512 | lm loss: 2.018189E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.850 | TFLOPs: 57.72 | 7: iteration 23210/ 44073 | consumed samples: 11883520 | consumed tokens: 24337448960 | elapsed time per iteration (s): 4.13 | learning rate: 1.038E-04 | global batch size: 512 | lm loss: 2.010066E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.894 | TFLOPs: 57.74 | 7: iteration 23220/ 44073 | consumed samples: 11888640 | consumed tokens: 24347934720 | elapsed time per iteration (s): 4.13 | learning rate: 1.038E-04 | global batch size: 512 | lm loss: 2.025330E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.857 | TFLOPs: 57.72 | 7: iteration 23230/ 44073 | consumed samples: 11893760 | consumed tokens: 24358420480 | elapsed time per iteration (s): 4.14 | learning rate: 1.037E-04 | global batch size: 512 | lm loss: 2.018102E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.727 | TFLOPs: 57.66 | 7: iteration 23240/ 44073 | consumed samples: 11898880 | consumed tokens: 24368906240 | elapsed time per iteration (s): 4.14 | learning rate: 1.036E-04 | global batch size: 512 | lm loss: 2.034091E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.607 | TFLOPs: 57.61 | 7: iteration 23250/ 44073 | consumed samples: 11904000 | consumed tokens: 24379392000 | elapsed time per iteration (s): 4.19 | learning rate: 1.036E-04 | global batch size: 512 | lm loss: 2.020440E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.141 | TFLOPs: 56.92 | 7: iteration 23260/ 44073 | consumed samples: 11909120 | consumed tokens: 24389877760 | elapsed time per iteration (s): 4.15 | learning rate: 1.035E-04 | global batch size: 512 | lm loss: 2.021387E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.385 | TFLOPs: 57.50 | 7: iteration 23270/ 44073 | consumed samples: 11914240 | consumed tokens: 24400363520 | elapsed time per iteration (s): 4.14 | learning rate: 1.034E-04 | global batch size: 512 | lm loss: 2.028321E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.765 | TFLOPs: 57.68 | 7: iteration 23280/ 44073 | consumed samples: 11919360 | consumed tokens: 24410849280 | elapsed time per iteration (s): 4.15 | learning rate: 1.034E-04 | global batch size: 512 | lm loss: 2.006017E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.320 | TFLOPs: 57.47 | 7: iteration 23290/ 44073 | consumed samples: 11924480 | consumed tokens: 24421335040 | elapsed time per iteration (s): 4.16 | learning rate: 1.033E-04 | global batch size: 512 | lm loss: 2.019139E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.117 | TFLOPs: 57.38 | 7: iteration 23300/ 44073 | consumed samples: 11929600 | consumed tokens: 24431820800 | elapsed time per iteration (s): 4.13 | learning rate: 1.032E-04 | global batch size: 512 | lm loss: 2.009533E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.871 | TFLOPs: 57.73 | 7: iteration 23310/ 44073 | consumed samples: 11934720 | consumed tokens: 24442306560 | elapsed time per iteration (s): 4.15 | learning rate: 1.032E-04 | global batch size: 512 | lm loss: 2.005422E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.348 | TFLOPs: 57.49 | 7: iteration 23320/ 44073 | consumed samples: 11939840 | consumed tokens: 24452792320 | elapsed time per iteration (s): 4.14 | learning rate: 1.031E-04 | global batch size: 512 | lm loss: 2.019321E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.696 | TFLOPs: 57.65 | 7: iteration 23330/ 44073 | consumed samples: 11944960 | consumed tokens: 24463278080 | elapsed time per iteration (s): 4.13 | learning rate: 1.031E-04 | global batch size: 512 | lm loss: 1.994873E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.823 | TFLOPs: 57.71 | 7: iteration 23340/ 44073 | consumed samples: 11950080 | consumed tokens: 24473763840 | elapsed time per iteration (s): 4.15 | learning rate: 1.030E-04 | global batch size: 512 | lm loss: 1.991412E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.498 | TFLOPs: 57.56 | 7: iteration 23350/ 44073 | consumed samples: 11955200 | consumed tokens: 24484249600 | elapsed time per iteration (s): 4.14 | learning rate: 1.029E-04 | global batch size: 512 | lm loss: 2.011268E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.657 | TFLOPs: 57.63 | 7: iteration 23360/ 44073 | consumed samples: 11960320 | consumed tokens: 24494735360 | elapsed time per iteration (s): 4.15 | learning rate: 1.029E-04 | global batch size: 512 | lm loss: 1.987199E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.363 | TFLOPs: 57.49 | 7: iteration 23370/ 44073 | consumed samples: 11965440 | consumed tokens: 24505221120 | elapsed time per iteration (s): 4.14 | learning rate: 1.028E-04 | global batch size: 512 | lm loss: 2.010523E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.544 | TFLOPs: 57.58 | 7: iteration 23380/ 44073 | consumed samples: 11970560 | consumed tokens: 24515706880 | elapsed time per iteration (s): 4.14 | learning rate: 1.027E-04 | global batch size: 512 | lm loss: 1.990898E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.606 | TFLOPs: 57.61 | 7: iteration 23390/ 44073 | consumed samples: 11975680 | consumed tokens: 24526192640 | elapsed time per iteration (s): 4.16 | learning rate: 1.027E-04 | global batch size: 512 | lm loss: 2.016035E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.086 | TFLOPs: 57.36 | 7: iteration 23400/ 44073 | consumed samples: 11980800 | consumed tokens: 24536678400 | elapsed time per iteration (s): 4.15 | learning rate: 1.026E-04 | global batch size: 512 | lm loss: 2.004624E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.226 | TFLOPs: 57.43 | 7: iteration 23410/ 44073 | consumed samples: 11985920 | consumed tokens: 24547164160 | elapsed time per iteration (s): 4.17 | learning rate: 1.025E-04 | global batch size: 512 | lm loss: 2.020130E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.882 | TFLOPs: 57.27 | 7: iteration 23420/ 44073 | consumed samples: 11991040 | consumed tokens: 24557649920 | elapsed time per iteration (s): 4.15 | learning rate: 1.025E-04 | global batch size: 512 | lm loss: 2.009974E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.501 | TFLOPs: 57.56 | 7: iteration 23430/ 44073 | consumed samples: 11996160 | consumed tokens: 24568135680 | elapsed time per iteration (s): 4.17 | learning rate: 1.024E-04 | global batch size: 512 | lm loss: 2.009783E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.653 | TFLOPs: 57.16 | 7: iteration 23440/ 44073 | consumed samples: 12001280 | consumed tokens: 24578621440 | elapsed time per iteration (s): 4.16 | learning rate: 1.023E-04 | global batch size: 512 | lm loss: 2.018095E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.218 | TFLOPs: 57.43 | 7: iteration 23450/ 44073 | consumed samples: 12006400 | consumed tokens: 24589107200 | elapsed time per iteration (s): 4.14 | learning rate: 1.023E-04 | global batch size: 512 | lm loss: 2.002659E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.757 | TFLOPs: 57.68 | 7: iteration 23460/ 44073 | consumed samples: 12011520 | consumed tokens: 24599592960 | elapsed time per iteration (s): 4.17 | learning rate: 1.022E-04 | global batch size: 512 | lm loss: 2.035155E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.851 | TFLOPs: 57.25 | 7: iteration 23470/ 44073 | consumed samples: 12016640 | consumed tokens: 24610078720 | elapsed time per iteration (s): 4.13 | learning rate: 1.022E-04 | global batch size: 512 | lm loss: 2.013151E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.881 | TFLOPs: 57.73 | 7: iteration 23480/ 44073 | consumed samples: 12021760 | consumed tokens: 24620564480 | elapsed time per iteration (s): 4.13 | learning rate: 1.021E-04 | global batch size: 512 | lm loss: 2.000643E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.877 | TFLOPs: 57.73 | 7: iteration 23490/ 44073 | consumed samples: 12026880 | consumed tokens: 24631050240 | elapsed time per iteration (s): 4.13 | learning rate: 1.020E-04 | global batch size: 512 | lm loss: 2.029712E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.868 | TFLOPs: 57.73 | 7: iteration 23500/ 44073 | consumed samples: 12032000 | consumed tokens: 24641536000 | elapsed time per iteration (s): 4.18 | learning rate: 1.020E-04 | global batch size: 512 | lm loss: 2.016386E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.394 | TFLOPs: 57.04 | 7: iteration 23510/ 44073 | consumed samples: 12037120 | consumed tokens: 24652021760 | elapsed time per iteration (s): 4.19 | learning rate: 1.019E-04 | global batch size: 512 | lm loss: 2.009042E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.145 | TFLOPs: 56.93 | 7: iteration 23520/ 44073 | consumed samples: 12042240 | consumed tokens: 24662507520 | elapsed time per iteration (s): 4.14 | learning rate: 1.018E-04 | global batch size: 512 | lm loss: 2.008849E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.629 | TFLOPs: 57.62 | 7: iteration 23530/ 44073 | consumed samples: 12047360 | consumed tokens: 24672993280 | elapsed time per iteration (s): 4.15 | learning rate: 1.018E-04 | global batch size: 512 | lm loss: 2.013655E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.244 | TFLOPs: 57.44 | 7: iteration 23540/ 44073 | consumed samples: 12052480 | consumed tokens: 24683479040 | elapsed time per iteration (s): 4.15 | learning rate: 1.017E-04 | global batch size: 512 | lm loss: 2.007258E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.342 | TFLOPs: 57.48 | 7: iteration 23550/ 44073 | consumed samples: 12057600 | consumed tokens: 24693964800 | elapsed time per iteration (s): 4.19 | learning rate: 1.016E-04 | global batch size: 512 | lm loss: 1.989201E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.119 | TFLOPs: 56.91 | 7: iteration 23560/ 44073 | consumed samples: 12062720 | consumed tokens: 24704450560 | elapsed time per iteration (s): 4.19 | learning rate: 1.016E-04 | global batch size: 512 | lm loss: 2.011752E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.304 | TFLOPs: 57.00 | 7: iteration 23570/ 44073 | consumed samples: 12067840 | consumed tokens: 24714936320 | elapsed time per iteration (s): 4.16 | learning rate: 1.015E-04 | global batch size: 512 | lm loss: 1.999229E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.006 | TFLOPs: 57.33 | 7: iteration 23580/ 44073 | consumed samples: 12072960 | consumed tokens: 24725422080 | elapsed time per iteration (s): 4.17 | learning rate: 1.014E-04 | global batch size: 512 | lm loss: 2.011320E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.714 | TFLOPs: 57.19 | 7: iteration 23590/ 44073 | consumed samples: 12078080 | consumed tokens: 24735907840 | elapsed time per iteration (s): 4.20 | learning rate: 1.014E-04 | global batch size: 512 | lm loss: 2.017185E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.938 | TFLOPs: 56.83 | 7: iteration 23600/ 44073 | consumed samples: 12083200 | consumed tokens: 24746393600 | elapsed time per iteration (s): 4.17 | learning rate: 1.013E-04 | global batch size: 512 | lm loss: 2.011355E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.807 | TFLOPs: 57.23 | 7: iteration 23610/ 44073 | consumed samples: 12088320 | consumed tokens: 24756879360 | elapsed time per iteration (s): 4.35 | learning rate: 1.012E-04 | global batch size: 512 | lm loss: 2.015609E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.573 | TFLOPs: 54.79 | 7: iteration 23620/ 44073 | consumed samples: 12093440 | consumed tokens: 24767365120 | elapsed time per iteration (s): 4.29 | learning rate: 1.012E-04 | global batch size: 512 | lm loss: 1.982343E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.218 | TFLOPs: 55.56 | 7: iteration 23630/ 44073 | consumed samples: 12098560 | consumed tokens: 24777850880 | elapsed time per iteration (s): 4.18 | learning rate: 1.011E-04 | global batch size: 512 | lm loss: 2.005728E+00 | grad norm: 0.115 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.464 | TFLOPs: 57.07 | 7: iteration 23640/ 44073 | consumed samples: 12103680 | consumed tokens: 24788336640 | elapsed time per iteration (s): 4.39 | learning rate: 1.011E-04 | global batch size: 512 | lm loss: 1.991619E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 116.704 | TFLOPs: 54.39 | 7: iteration 23650/ 44073 | consumed samples: 12108800 | consumed tokens: 24798822400 | elapsed time per iteration (s): 4.19 | learning rate: 1.010E-04 | global batch size: 512 | lm loss: 1.990356E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.218 | TFLOPs: 56.96 | 7: iteration 23660/ 44073 | consumed samples: 12113920 | consumed tokens: 24809308160 | elapsed time per iteration (s): 4.19 | learning rate: 1.009E-04 | global batch size: 512 | lm loss: 2.017447E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.258 | TFLOPs: 56.98 | 7: iteration 23670/ 44073 | consumed samples: 12119040 | consumed tokens: 24819793920 | elapsed time per iteration (s): 4.17 | learning rate: 1.009E-04 | global batch size: 512 | lm loss: 2.017313E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.753 | TFLOPs: 57.21 | 7: iteration 23680/ 44073 | consumed samples: 12124160 | consumed tokens: 24830279680 | elapsed time per iteration (s): 4.15 | learning rate: 1.008E-04 | global batch size: 512 | lm loss: 2.012855E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.443 | TFLOPs: 57.53 | 7: iteration 23690/ 44073 | consumed samples: 12129280 | consumed tokens: 24840765440 | elapsed time per iteration (s): 4.32 | learning rate: 1.007E-04 | global batch size: 512 | lm loss: 1.985268E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.391 | TFLOPs: 55.18 | 7: iteration 23700/ 44073 | consumed samples: 12134400 | consumed tokens: 24851251200 | elapsed time per iteration (s): 4.19 | learning rate: 1.007E-04 | global batch size: 512 | lm loss: 2.009125E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.116 | TFLOPs: 56.91 | 7: iteration 23710/ 44073 | consumed samples: 12139520 | consumed tokens: 24861736960 | elapsed time per iteration (s): 4.15 | learning rate: 1.006E-04 | global batch size: 512 | lm loss: 2.006498E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.388 | TFLOPs: 57.50 | 7: iteration 23720/ 44073 | consumed samples: 12144640 | consumed tokens: 24872222720 | elapsed time per iteration (s): 4.19 | learning rate: 1.005E-04 | global batch size: 512 | lm loss: 1.998778E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.130 | TFLOPs: 56.92 | 7: iteration 23730/ 44073 | consumed samples: 12149760 | consumed tokens: 24882708480 | elapsed time per iteration (s): 4.15 | learning rate: 1.005E-04 | global batch size: 512 | lm loss: 2.001639E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.379 | TFLOPs: 57.50 | 7: iteration 23740/ 44073 | consumed samples: 12154880 | consumed tokens: 24893194240 | elapsed time per iteration (s): 4.16 | learning rate: 1.004E-04 | global batch size: 512 | lm loss: 2.019091E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.221 | TFLOPs: 57.43 | 7: iteration 23750/ 44073 | consumed samples: 12160000 | consumed tokens: 24903680000 | elapsed time per iteration (s): 4.20 | learning rate: 1.003E-04 | global batch size: 512 | lm loss: 2.001110E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.970 | TFLOPs: 56.84 | 7: iteration 23760/ 44073 | consumed samples: 12165120 | consumed tokens: 24914165760 | elapsed time per iteration (s): 4.18 | learning rate: 1.003E-04 | global batch size: 512 | lm loss: 1.995874E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.539 | TFLOPs: 57.11 | 7: iteration 23770/ 44073 | consumed samples: 12170240 | consumed tokens: 24924651520 | elapsed time per iteration (s): 4.15 | learning rate: 1.002E-04 | global batch size: 512 | lm loss: 2.010505E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.408 | TFLOPs: 57.51 | 7: iteration 23780/ 44073 | consumed samples: 12175360 | consumed tokens: 24935137280 | elapsed time per iteration (s): 4.19 | learning rate: 1.002E-04 | global batch size: 512 | lm loss: 2.015040E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.123 | TFLOPs: 56.92 | 7: iteration 23790/ 44073 | consumed samples: 12180480 | consumed tokens: 24945623040 | elapsed time per iteration (s): 4.15 | learning rate: 1.001E-04 | global batch size: 512 | lm loss: 1.996992E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.328 | TFLOPs: 57.48 | 7: iteration 23800/ 44073 | consumed samples: 12185600 | consumed tokens: 24956108800 | elapsed time per iteration (s): 4.17 | learning rate: 1.000E-04 | global batch size: 512 | lm loss: 2.007903E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.900 | TFLOPs: 57.28 | 7: iteration 23810/ 44073 | consumed samples: 12190720 | consumed tokens: 24966594560 | elapsed time per iteration (s): 4.15 | learning rate: 9.996E-05 | global batch size: 512 | lm loss: 1.997001E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.431 | TFLOPs: 57.53 | 7: iteration 23820/ 44073 | consumed samples: 12195840 | consumed tokens: 24977080320 | elapsed time per iteration (s): 4.16 | learning rate: 9.989E-05 | global batch size: 512 | lm loss: 1.998065E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.941 | TFLOPs: 57.30 | 7: iteration 23830/ 44073 | consumed samples: 12200960 | consumed tokens: 24987566080 | elapsed time per iteration (s): 4.19 | learning rate: 9.983E-05 | global batch size: 512 | lm loss: 2.002921E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.238 | TFLOPs: 56.97 | 7: iteration 23840/ 44073 | consumed samples: 12206080 | consumed tokens: 24998051840 | elapsed time per iteration (s): 4.22 | learning rate: 9.977E-05 | global batch size: 512 | lm loss: 2.018434E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.277 | TFLOPs: 56.52 | 7: iteration 23850/ 44073 | consumed samples: 12211200 | consumed tokens: 25008537600 | elapsed time per iteration (s): 4.30 | learning rate: 9.970E-05 | global batch size: 512 | lm loss: 2.002556E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.957 | TFLOPs: 55.44 | 7: iteration 23860/ 44073 | consumed samples: 12216320 | consumed tokens: 25019023360 | elapsed time per iteration (s): 4.19 | learning rate: 9.964E-05 | global batch size: 512 | lm loss: 2.004439E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.256 | TFLOPs: 56.98 | 7: iteration 23870/ 44073 | consumed samples: 12221440 | consumed tokens: 25029509120 | elapsed time per iteration (s): 4.30 | learning rate: 9.957E-05 | global batch size: 512 | lm loss: 1.977829E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.959 | TFLOPs: 55.44 | 7: iteration 23880/ 44073 | consumed samples: 12226560 | consumed tokens: 25039994880 | elapsed time per iteration (s): 4.25 | learning rate: 9.951E-05 | global batch size: 512 | lm loss: 1.982235E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.579 | TFLOPs: 56.20 | 7: iteration 23890/ 44073 | consumed samples: 12231680 | consumed tokens: 25050480640 | elapsed time per iteration (s): 4.19 | learning rate: 9.944E-05 | global batch size: 512 | lm loss: 2.007823E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.188 | TFLOPs: 56.95 | 7: iteration 23900/ 44073 | consumed samples: 12236800 | consumed tokens: 25060966400 | elapsed time per iteration (s): 4.18 | learning rate: 9.938E-05 | global batch size: 512 | lm loss: 2.014584E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.434 | TFLOPs: 57.06 | 7: iteration 23910/ 44073 | consumed samples: 12241920 | consumed tokens: 25071452160 | elapsed time per iteration (s): 4.19 | learning rate: 9.931E-05 | global batch size: 512 | lm loss: 1.987422E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.152 | TFLOPs: 56.93 | 7: iteration 23920/ 44073 | consumed samples: 12247040 | consumed tokens: 25081937920 | elapsed time per iteration (s): 4.18 | learning rate: 9.925E-05 | global batch size: 512 | lm loss: 2.011849E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.522 | TFLOPs: 57.10 | 7: iteration 23930/ 44073 | consumed samples: 12252160 | consumed tokens: 25092423680 | elapsed time per iteration (s): 4.21 | learning rate: 9.919E-05 | global batch size: 512 | lm loss: 1.997518E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.586 | TFLOPs: 56.67 | 7: iteration 23940/ 44073 | consumed samples: 12257280 | consumed tokens: 25102909440 | elapsed time per iteration (s): 4.17 | learning rate: 9.912E-05 | global batch size: 512 | lm loss: 2.012357E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.695 | TFLOPs: 57.18 | 7: iteration 23950/ 44073 | consumed samples: 12262400 | consumed tokens: 25113395200 | elapsed time per iteration (s): 4.16 | learning rate: 9.906E-05 | global batch size: 512 | lm loss: 2.000481E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.952 | TFLOPs: 57.30 | 7: iteration 23960/ 44073 | consumed samples: 12267520 | consumed tokens: 25123880960 | elapsed time per iteration (s): 4.15 | learning rate: 9.899E-05 | global batch size: 512 | lm loss: 2.008941E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.243 | TFLOPs: 57.44 | 7: iteration 23970/ 44073 | consumed samples: 12272640 | consumed tokens: 25134366720 | elapsed time per iteration (s): 4.31 | learning rate: 9.893E-05 | global batch size: 512 | lm loss: 2.009643E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.902 | TFLOPs: 55.41 | 7: iteration 23980/ 44073 | consumed samples: 12277760 | consumed tokens: 25144852480 | elapsed time per iteration (s): 4.15 | learning rate: 9.886E-05 | global batch size: 512 | lm loss: 1.999999E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.482 | TFLOPs: 57.55 | 7: iteration 23990/ 44073 | consumed samples: 12282880 | consumed tokens: 25155338240 | elapsed time per iteration (s): 4.16 | learning rate: 9.880E-05 | global batch size: 512 | lm loss: 2.016288E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.197 | TFLOPs: 57.42 | 0: [2022-11-26 14:35:36,470] [INFO] [logging.py:68:log_dist] [Rank 0] step=24000, skipped=0, lr=[9.873603684605252e-05, 9.873603684605252e-05, 9.873603684605252e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 7: iteration 24000/ 44073 | consumed samples: 12288000 | consumed tokens: 25165824000 | elapsed time per iteration (s): 4.15 | learning rate: 9.874E-05 | global batch size: 512 | lm loss: 2.003626E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.300 | TFLOPs: 57.46 | 0: steps: 24000 loss: 2.0186 iter time (s): 4.166 samples/sec: 122.885 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 24000 | lm loss value: 1.944737E+00 | lm loss PPL: 6.991794E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 24000 to checkpoints_2b2 0: [2022-11-26 14:35:37,825] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step24000 is begin to save! 0: [2022-11-26 14:35:37,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_01-model_00-model_states.pt... 0: [2022-11-26 14:35:38,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_01-model_00-model_states.pt. 0: [2022-11-26 14:35:38,190] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_03-model_00-model_states.pt... 0: [2022-11-26 14:35:38,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_03-model_00-model_states.pt. 0: [2022-11-26 14:35:38,340] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_04-model_00-model_states.pt... 0: [2022-11-26 14:35:38,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_04-model_00-model_states.pt. 0: [2022-11-26 14:35:38,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_05-model_00-model_states.pt... 0: [2022-11-26 14:35:38,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_05-model_00-model_states.pt. 0: [2022-11-26 14:35:38,631] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_06-model_00-model_states.pt... 0: [2022-11-26 14:35:38,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_06-model_00-model_states.pt. 0: [2022-11-26 14:35:38,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_07-model_00-model_states.pt... 0: [2022-11-26 14:35:38,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_07-model_00-model_states.pt. 0: [2022-11-26 14:35:38,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_08-model_00-model_states.pt... 0: [2022-11-26 14:35:39,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_08-model_00-model_states.pt. 0: [2022-11-26 14:35:39,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_09-model_00-model_states.pt... 0: [2022-11-26 14:35:39,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_09-model_00-model_states.pt. 0: [2022-11-26 14:35:39,197] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_10-model_00-model_states.pt... 0: [2022-11-26 14:35:39,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_10-model_00-model_states.pt. 0: [2022-11-26 14:35:39,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_11-model_00-model_states.pt... 0: [2022-11-26 14:35:39,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_11-model_00-model_states.pt. 0: [2022-11-26 14:35:39,478] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_12-model_00-model_states.pt... 0: [2022-11-26 14:35:39,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_12-model_00-model_states.pt. 0: [2022-11-26 14:35:39,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_13-model_00-model_states.pt... 0: [2022-11-26 14:35:39,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_13-model_00-model_states.pt. 0: [2022-11-26 14:35:39,756] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_14-model_00-model_states.pt... 0: [2022-11-26 14:35:39,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_14-model_00-model_states.pt. 0: [2022-11-26 14:35:39,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_15-model_00-model_states.pt... 0: [2022-11-26 14:35:40,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_15-model_00-model_states.pt. 0: [2022-11-26 14:35:40,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_16-model_00-model_states.pt... 0: [2022-11-26 14:35:40,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_16-model_00-model_states.pt. 0: [2022-11-26 14:35:40,170] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_17-model_00-model_states.pt... 0: [2022-11-26 14:35:40,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_17-model_00-model_states.pt. 0: [2022-11-26 14:35:40,312] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_18-model_00-model_states.pt... 0: [2022-11-26 14:35:40,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_18-model_00-model_states.pt. 0: [2022-11-26 14:35:40,446] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_19-model_00-model_states.pt... 0: [2022-11-26 14:35:40,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_19-model_00-model_states.pt. 0: [2022-11-26 14:35:40,583] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_20-model_00-model_states.pt... 0: [2022-11-26 14:35:40,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_20-model_00-model_states.pt. 0: [2022-11-26 14:35:40,720] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_21-model_00-model_states.pt... 0: [2022-11-26 14:35:40,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_21-model_00-model_states.pt. 0: [2022-11-26 14:35:40,856] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_22-model_00-model_states.pt... 0: [2022-11-26 14:35:40,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_22-model_00-model_states.pt. 0: [2022-11-26 14:35:40,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_23-model_00-model_states.pt... 0: [2022-11-26 14:35:41,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_23-model_00-model_states.pt. 0: [2022-11-26 14:35:41,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_24-model_00-model_states.pt... 0: [2022-11-26 14:35:41,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_24-model_00-model_states.pt. 0: [2022-11-26 14:35:41,266] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_25-model_00-model_states.pt... 0: [2022-11-26 14:35:41,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_25-model_00-model_states.pt. 0: [2022-11-26 14:35:41,403] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_26-model_00-model_states.pt... 0: [2022-11-26 14:35:41,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_26-model_00-model_states.pt. 0: [2022-11-26 14:35:41,540] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_27-model_00-model_states.pt... 0: [2022-11-26 14:35:41,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_27-model_00-model_states.pt. 0: [2022-11-26 14:35:41,675] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_28-model_00-model_states.pt... 0: [2022-11-26 14:35:41,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_28-model_00-model_states.pt. 0: [2022-11-26 14:35:41,811] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_29-model_00-model_states.pt... 0: [2022-11-26 14:35:41,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_29-model_00-model_states.pt. 0: [2022-11-26 14:35:41,948] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_30-model_00-model_states.pt... 0: [2022-11-26 14:35:42,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_30-model_00-model_states.pt. 0: [2022-11-26 14:35:42,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_31-model_00-model_states.pt... 0: [2022-11-26 14:35:42,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_31-model_00-model_states.pt. 0: [2022-11-26 14:35:42,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_32-model_00-model_states.pt... 0: [2022-11-26 14:35:42,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_32-model_00-model_states.pt. 0: [2022-11-26 14:35:42,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_33-model_00-model_states.pt... 0: [2022-11-26 14:35:42,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_33-model_00-model_states.pt. 0: [2022-11-26 14:35:42,488] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_34-model_00-model_states.pt... 0: [2022-11-26 14:35:42,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_34-model_00-model_states.pt. 0: [2022-11-26 14:35:42,622] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/layer_36-model_00-model_states.pt... 0: [2022-11-26 14:35:42,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/layer_36-model_00-model_states.pt. 0: [2022-11-26 14:35:42,628] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step24000/mp_rank_00_model_states.pt 0: [2022-11-26 14:35:42,628] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/mp_rank_00_model_states.pt... 0: [2022-11-26 14:35:42,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/mp_rank_00_model_states.pt. 0: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:35:42,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step24000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:35:43,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:35:43,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 14:35:43,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: [2022-11-26 14:35:43,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:35:43,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 14:35:43,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: [2022-11-26 14:35:43,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:35:43,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 14:35:43,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: [2022-11-26 14:35:43,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:35:43,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:35:43,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 14:35:43,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: [2022-11-26 14:35:43,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:35:43,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 14:35:43,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: [2022-11-26 14:35:43,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:35:43,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 14:35:43,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: [2022-11-26 14:35:43,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:35:43,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 14:35:43,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 5: [2022-11-26 14:35:43,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:35:43,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 14:35:43,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 5: [2022-11-26 14:35:43,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:35:43,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 14:35:43,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: [2022-11-26 14:35:43,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:35:43,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 14:35:43,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 5: [2022-11-26 14:35:43,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:35:43,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 14:35:43,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 6: [2022-11-26 14:35:43,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:35:43,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 14:35:43,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 6: [2022-11-26 14:35:43,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:35:43,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 14:35:43,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 3: [2022-11-26 14:35:43,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:35:43,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 14:35:43,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 5: [2022-11-26 14:35:43,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:35:43,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 14:35:43,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 5: [2022-11-26 14:35:43,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:35:43,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:35:43,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 5: [2022-11-26 14:35:43,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 6: [2022-11-26 14:35:43,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 5: [2022-11-26 14:35:43,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 6: [2022-11-26 14:35:43,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:35:43,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 14:35:43,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 5: [2022-11-26 14:35:43,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:35:43,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 14:35:43,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 3: [2022-11-26 14:35:43,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:35:43,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 14:35:43,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 5: [2022-11-26 14:35:43,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:35:43,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 14:35:43,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 3: [2022-11-26 14:35:43,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:35:43,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 14:35:43,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 6: [2022-11-26 14:35:43,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:35:43,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 14:35:43,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 3: [2022-11-26 14:35:43,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:35:43,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 14:35:43,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 6: [2022-11-26 14:35:43,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:35:43,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 14:35:43,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 6: [2022-11-26 14:35:43,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:35:43,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 14:35:43,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 6: [2022-11-26 14:35:43,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:35:43,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 14:35:43,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 3: [2022-11-26 14:35:43,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:35:43,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 14:35:43,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 3: [2022-11-26 14:35:43,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:35:43,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 14:35:43,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 3: [2022-11-26 14:35:43,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:35:43,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 14:35:43,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 3: [2022-11-26 14:35:43,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:35:43,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 14:35:43,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 1: [2022-11-26 14:35:43,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:35:43,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:35:43,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:35:43,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 14:35:43,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 14:35:43,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 14:35:43,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 1: [2022-11-26 14:35:43,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 1: [2022-11-26 14:35:43,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 1: [2022-11-26 14:35:43,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:35:43,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 14:35:43,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 1: [2022-11-26 14:35:43,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:35:43,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 14:35:43,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 1: [2022-11-26 14:35:43,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:35:43,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 14:35:43,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 1: [2022-11-26 14:35:43,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:35:43,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 14:35:43,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 1: [2022-11-26 14:35:43,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:35:43,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 14:35:43,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: [2022-11-26 14:35:43,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 14:35:43,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 7: [2022-11-26 14:35:43,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:35:43,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:35:43,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:35:43,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:35:43,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:35:43,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:35:43,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:35:43,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 14:35:43,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 14:35:43,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 14:35:43,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 14:35:43,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 14:35:43,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 14:35:43,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 14:35:43,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 7: [2022-11-26 14:35:43,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 7: [2022-11-26 14:35:43,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 7: [2022-11-26 14:35:43,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 7: [2022-11-26 14:35:43,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 7: [2022-11-26 14:35:43,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 7: [2022-11-26 14:35:43,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 7: [2022-11-26 14:35:43,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:35:43,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 14:35:43,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 4: [2022-11-26 14:35:43,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:35:43,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:35:43,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:35:43,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:35:43,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:35:43,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:35:43,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:35:43,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 14:35:43,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 14:35:43,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 14:35:43,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 14:35:43,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 14:35:43,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 4: [2022-11-26 14:35:43,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 14:35:43,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 14:35:43,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 4: [2022-11-26 14:35:43,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 4: [2022-11-26 14:35:43,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 4: [2022-11-26 14:35:43,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 4: [2022-11-26 14:35:43,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 4: [2022-11-26 14:35:43,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 4: [2022-11-26 14:35:43,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:35:43,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 14:35:43,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 2: [2022-11-26 14:35:43,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:35:43,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:35:43,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:35:43,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:35:43,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:35:43,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 14:35:43,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 14:35:43,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 14:35:43,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 14:35:43,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 2: [2022-11-26 14:35:43,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 2: [2022-11-26 14:35:43,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 2: [2022-11-26 14:35:43,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 14:35:43,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 2: [2022-11-26 14:35:43,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 2: [2022-11-26 14:35:43,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:35:43,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:35:43,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:35:43,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 14:35:43,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 14:35:43,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step24000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 14:35:43,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 2: [2022-11-26 14:35:43,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 2: [2022-11-26 14:35:43,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: successfully saved checkpoint at iteration 24000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 6046.80 7: iteration 24010/ 44073 | consumed samples: 12293120 | consumed tokens: 25176309760 | elapsed time per iteration (s): 4.91 | learning rate: 9.867E-05 | global batch size: 512 | lm loss: 2.025010E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 104.289 | TFLOPs: 48.60 | 7: iteration 24020/ 44073 | consumed samples: 12298240 | consumed tokens: 25186795520 | elapsed time per iteration (s): 4.15 | learning rate: 9.861E-05 | global batch size: 512 | lm loss: 2.004685E+00 | grad norm: 0.112 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.436 | TFLOPs: 57.53 | 7: iteration 24030/ 44073 | consumed samples: 12303360 | consumed tokens: 25197281280 | elapsed time per iteration (s): 4.15 | learning rate: 9.854E-05 | global batch size: 512 | lm loss: 2.010905E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.376 | TFLOPs: 57.50 | 7: iteration 24040/ 44073 | consumed samples: 12308480 | consumed tokens: 25207767040 | elapsed time per iteration (s): 4.19 | learning rate: 9.848E-05 | global batch size: 512 | lm loss: 2.002752E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.171 | TFLOPs: 56.94 | 7: iteration 24050/ 44073 | consumed samples: 12313600 | consumed tokens: 25218252800 | elapsed time per iteration (s): 4.15 | learning rate: 9.841E-05 | global batch size: 512 | lm loss: 1.993088E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.403 | TFLOPs: 57.51 | 7: iteration 24060/ 44073 | consumed samples: 12318720 | consumed tokens: 25228738560 | elapsed time per iteration (s): 4.18 | learning rate: 9.835E-05 | global batch size: 512 | lm loss: 1.988015E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.581 | TFLOPs: 57.13 | 7: iteration 24070/ 44073 | consumed samples: 12323840 | consumed tokens: 25239224320 | elapsed time per iteration (s): 4.17 | learning rate: 9.829E-05 | global batch size: 512 | lm loss: 2.001099E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.829 | TFLOPs: 57.24 | 7: iteration 24080/ 44073 | consumed samples: 12328960 | consumed tokens: 25249710080 | elapsed time per iteration (s): 4.17 | learning rate: 9.822E-05 | global batch size: 512 | lm loss: 2.001882E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.673 | TFLOPs: 57.17 | 7: iteration 24090/ 44073 | consumed samples: 12334080 | consumed tokens: 25260195840 | elapsed time per iteration (s): 4.15 | learning rate: 9.816E-05 | global batch size: 512 | lm loss: 2.006926E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.346 | TFLOPs: 57.49 | 7: iteration 24100/ 44073 | consumed samples: 12339200 | consumed tokens: 25270681600 | elapsed time per iteration (s): 4.14 | learning rate: 9.809E-05 | global batch size: 512 | lm loss: 1.997083E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.803 | TFLOPs: 57.70 | 7: iteration 24110/ 44073 | consumed samples: 12344320 | consumed tokens: 25281167360 | elapsed time per iteration (s): 4.18 | learning rate: 9.803E-05 | global batch size: 512 | lm loss: 1.996026E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.467 | TFLOPs: 57.08 | 7: iteration 24120/ 44073 | consumed samples: 12349440 | consumed tokens: 25291653120 | elapsed time per iteration (s): 4.30 | learning rate: 9.796E-05 | global batch size: 512 | lm loss: 1.988218E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.950 | TFLOPs: 55.44 | 7: iteration 24130/ 44073 | consumed samples: 12354560 | consumed tokens: 25302138880 | elapsed time per iteration (s): 4.14 | learning rate: 9.790E-05 | global batch size: 512 | lm loss: 1.984710E+00 | grad norm: 0.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.636 | TFLOPs: 57.62 | 7: iteration 24140/ 44073 | consumed samples: 12359680 | consumed tokens: 25312624640 | elapsed time per iteration (s): 4.14 | learning rate: 9.784E-05 | global batch size: 512 | lm loss: 2.018603E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.658 | TFLOPs: 57.63 | 7: iteration 24150/ 44073 | consumed samples: 12364800 | consumed tokens: 25323110400 | elapsed time per iteration (s): 4.15 | learning rate: 9.777E-05 | global batch size: 512 | lm loss: 2.015043E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.355 | TFLOPs: 57.49 | 7: iteration 24160/ 44073 | consumed samples: 12369920 | consumed tokens: 25333596160 | elapsed time per iteration (s): 4.13 | learning rate: 9.771E-05 | global batch size: 512 | lm loss: 2.014156E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.934 | TFLOPs: 57.76 | 7: iteration 24170/ 44073 | consumed samples: 12375040 | consumed tokens: 25344081920 | elapsed time per iteration (s): 4.15 | learning rate: 9.764E-05 | global batch size: 512 | lm loss: 1.988978E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.263 | TFLOPs: 57.45 | 7: iteration 24180/ 44073 | consumed samples: 12380160 | consumed tokens: 25354567680 | elapsed time per iteration (s): 4.16 | learning rate: 9.758E-05 | global batch size: 512 | lm loss: 1.998583E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.953 | TFLOPs: 57.30 | 7: iteration 24190/ 44073 | consumed samples: 12385280 | consumed tokens: 25365053440 | elapsed time per iteration (s): 4.14 | learning rate: 9.752E-05 | global batch size: 512 | lm loss: 2.012187E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.576 | TFLOPs: 57.59 | 7: iteration 24200/ 44073 | consumed samples: 12390400 | consumed tokens: 25375539200 | elapsed time per iteration (s): 4.18 | learning rate: 9.745E-05 | global batch size: 512 | lm loss: 1.992801E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.578 | TFLOPs: 57.13 | 7: iteration 24210/ 44073 | consumed samples: 12395520 | consumed tokens: 25386024960 | elapsed time per iteration (s): 4.18 | learning rate: 9.739E-05 | global batch size: 512 | lm loss: 1.996671E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.388 | TFLOPs: 57.04 | 7: iteration 24220/ 44073 | consumed samples: 12400640 | consumed tokens: 25396510720 | elapsed time per iteration (s): 4.18 | learning rate: 9.732E-05 | global batch size: 512 | lm loss: 2.003308E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.538 | TFLOPs: 57.11 | 7: iteration 24230/ 44073 | consumed samples: 12405760 | consumed tokens: 25406996480 | elapsed time per iteration (s): 4.18 | learning rate: 9.726E-05 | global batch size: 512 | lm loss: 1.999235E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.595 | TFLOPs: 57.14 | 7: iteration 24240/ 44073 | consumed samples: 12410880 | consumed tokens: 25417482240 | elapsed time per iteration (s): 4.21 | learning rate: 9.719E-05 | global batch size: 512 | lm loss: 1.986050E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.748 | TFLOPs: 56.74 | 7: iteration 24250/ 44073 | consumed samples: 12416000 | consumed tokens: 25427968000 | elapsed time per iteration (s): 4.21 | learning rate: 9.713E-05 | global batch size: 512 | lm loss: 2.016874E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.629 | TFLOPs: 56.69 | 7: iteration 24260/ 44073 | consumed samples: 12421120 | consumed tokens: 25438453760 | elapsed time per iteration (s): 4.22 | learning rate: 9.707E-05 | global batch size: 512 | lm loss: 2.000718E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.347 | TFLOPs: 56.55 | 7: iteration 24270/ 44073 | consumed samples: 12426240 | consumed tokens: 25448939520 | elapsed time per iteration (s): 4.20 | learning rate: 9.700E-05 | global batch size: 512 | lm loss: 1.997202E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.765 | TFLOPs: 56.75 | 7: iteration 24280/ 44073 | consumed samples: 12431360 | consumed tokens: 25459425280 | elapsed time per iteration (s): 4.16 | learning rate: 9.694E-05 | global batch size: 512 | lm loss: 2.012175E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.061 | TFLOPs: 57.35 | 7: iteration 24290/ 44073 | consumed samples: 12436480 | consumed tokens: 25469911040 | elapsed time per iteration (s): 4.17 | learning rate: 9.687E-05 | global batch size: 512 | lm loss: 1.999293E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.852 | TFLOPs: 57.26 | 7: iteration 24300/ 44073 | consumed samples: 12441600 | consumed tokens: 25480396800 | elapsed time per iteration (s): 4.19 | learning rate: 9.681E-05 | global batch size: 512 | lm loss: 1.997095E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.207 | TFLOPs: 56.95 | 7: iteration 24310/ 44073 | consumed samples: 12446720 | consumed tokens: 25490882560 | elapsed time per iteration (s): 4.21 | learning rate: 9.675E-05 | global batch size: 512 | lm loss: 2.006426E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.518 | TFLOPs: 56.63 | 7: iteration 24320/ 44073 | consumed samples: 12451840 | consumed tokens: 25501368320 | elapsed time per iteration (s): 4.41 | learning rate: 9.668E-05 | global batch size: 512 | lm loss: 2.005782E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 116.162 | TFLOPs: 54.14 | 7: iteration 24330/ 44073 | consumed samples: 12456960 | consumed tokens: 25511854080 | elapsed time per iteration (s): 4.20 | learning rate: 9.662E-05 | global batch size: 512 | lm loss: 2.011562E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.037 | TFLOPs: 56.88 | 7: iteration 24340/ 44073 | consumed samples: 12462080 | consumed tokens: 25522339840 | elapsed time per iteration (s): 4.19 | learning rate: 9.655E-05 | global batch size: 512 | lm loss: 1.996576E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.132 | TFLOPs: 56.92 | 7: iteration 24350/ 44073 | consumed samples: 12467200 | consumed tokens: 25532825600 | elapsed time per iteration (s): 4.17 | learning rate: 9.649E-05 | global batch size: 512 | lm loss: 1.994490E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.724 | TFLOPs: 57.20 | 7: iteration 24360/ 44073 | consumed samples: 12472320 | consumed tokens: 25543311360 | elapsed time per iteration (s): 4.16 | learning rate: 9.643E-05 | global batch size: 512 | lm loss: 2.008336E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.016 | TFLOPs: 57.33 | 7: iteration 24370/ 44073 | consumed samples: 12477440 | consumed tokens: 25553797120 | elapsed time per iteration (s): 4.22 | learning rate: 9.636E-05 | global batch size: 512 | lm loss: 1.991537E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.219 | TFLOPs: 56.49 | 7: iteration 24380/ 44073 | consumed samples: 12482560 | consumed tokens: 25564282880 | elapsed time per iteration (s): 4.17 | learning rate: 9.630E-05 | global batch size: 512 | lm loss: 1.997533E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.845 | TFLOPs: 57.25 | 7: iteration 24390/ 44073 | consumed samples: 12487680 | consumed tokens: 25574768640 | elapsed time per iteration (s): 4.13 | learning rate: 9.623E-05 | global batch size: 512 | lm loss: 2.004854E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.855 | TFLOPs: 57.72 | 7: iteration 24400/ 44073 | consumed samples: 12492800 | consumed tokens: 25585254400 | elapsed time per iteration (s): 4.18 | learning rate: 9.617E-05 | global batch size: 512 | lm loss: 2.006329E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.562 | TFLOPs: 57.12 | 7: iteration 24410/ 44073 | consumed samples: 12497920 | consumed tokens: 25595740160 | elapsed time per iteration (s): 4.18 | learning rate: 9.611E-05 | global batch size: 512 | lm loss: 2.004196E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.374 | TFLOPs: 57.03 | 7: iteration 24420/ 44073 | consumed samples: 12503040 | consumed tokens: 25606225920 | elapsed time per iteration (s): 4.19 | learning rate: 9.604E-05 | global batch size: 512 | lm loss: 2.004674E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.107 | TFLOPs: 56.91 | 7: iteration 24430/ 44073 | consumed samples: 12508160 | consumed tokens: 25616711680 | elapsed time per iteration (s): 4.20 | learning rate: 9.598E-05 | global batch size: 512 | lm loss: 1.996478E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.878 | TFLOPs: 56.80 | 7: iteration 24440/ 44073 | consumed samples: 12513280 | consumed tokens: 25627197440 | elapsed time per iteration (s): 4.17 | learning rate: 9.591E-05 | global batch size: 512 | lm loss: 1.988216E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.782 | TFLOPs: 57.22 | 7: iteration 24450/ 44073 | consumed samples: 12518400 | consumed tokens: 25637683200 | elapsed time per iteration (s): 4.19 | learning rate: 9.585E-05 | global batch size: 512 | lm loss: 1.995741E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.128 | TFLOPs: 56.92 | 7: iteration 24460/ 44073 | consumed samples: 12523520 | consumed tokens: 25648168960 | elapsed time per iteration (s): 4.14 | learning rate: 9.579E-05 | global batch size: 512 | lm loss: 1.997092E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.607 | TFLOPs: 57.61 | 7: iteration 24470/ 44073 | consumed samples: 12528640 | consumed tokens: 25658654720 | elapsed time per iteration (s): 4.16 | learning rate: 9.572E-05 | global batch size: 512 | lm loss: 1.996112E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.161 | TFLOPs: 57.40 | 7: iteration 24480/ 44073 | consumed samples: 12533760 | consumed tokens: 25669140480 | elapsed time per iteration (s): 4.16 | learning rate: 9.566E-05 | global batch size: 512 | lm loss: 2.003514E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.009 | TFLOPs: 57.33 | 7: iteration 24490/ 44073 | consumed samples: 12538880 | consumed tokens: 25679626240 | elapsed time per iteration (s): 4.19 | learning rate: 9.559E-05 | global batch size: 512 | lm loss: 2.011568E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.312 | TFLOPs: 57.00 | 7: iteration 24500/ 44073 | consumed samples: 12544000 | consumed tokens: 25690112000 | elapsed time per iteration (s): 4.13 | learning rate: 9.553E-05 | global batch size: 512 | lm loss: 1.998925E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.856 | TFLOPs: 57.72 | 7: iteration 24510/ 44073 | consumed samples: 12549120 | consumed tokens: 25700597760 | elapsed time per iteration (s): 4.22 | learning rate: 9.547E-05 | global batch size: 512 | lm loss: 1.996172E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.358 | TFLOPs: 56.56 | 7: iteration 24520/ 44073 | consumed samples: 12554240 | consumed tokens: 25711083520 | elapsed time per iteration (s): 4.18 | learning rate: 9.540E-05 | global batch size: 512 | lm loss: 1.996683E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.539 | TFLOPs: 57.11 | 7: iteration 24530/ 44073 | consumed samples: 12559360 | consumed tokens: 25721569280 | elapsed time per iteration (s): 4.20 | learning rate: 9.534E-05 | global batch size: 512 | lm loss: 2.008040E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.887 | TFLOPs: 56.81 | 7: iteration 24540/ 44073 | consumed samples: 12564480 | consumed tokens: 25732055040 | elapsed time per iteration (s): 4.18 | learning rate: 9.527E-05 | global batch size: 512 | lm loss: 1.990444E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.419 | TFLOPs: 57.05 | 7: iteration 24550/ 44073 | consumed samples: 12569600 | consumed tokens: 25742540800 | elapsed time per iteration (s): 4.30 | learning rate: 9.521E-05 | global batch size: 512 | lm loss: 2.020074E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.000 | TFLOPs: 55.46 | 7: iteration 24560/ 44073 | consumed samples: 12574720 | consumed tokens: 25753026560 | elapsed time per iteration (s): 4.20 | learning rate: 9.515E-05 | global batch size: 512 | lm loss: 1.986472E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.961 | TFLOPs: 56.84 | 7: iteration 24570/ 44073 | consumed samples: 12579840 | consumed tokens: 25763512320 | elapsed time per iteration (s): 4.22 | learning rate: 9.508E-05 | global batch size: 512 | lm loss: 1.997031E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.299 | TFLOPs: 56.53 | 7: iteration 24580/ 44073 | consumed samples: 12584960 | consumed tokens: 25773998080 | elapsed time per iteration (s): 4.15 | learning rate: 9.502E-05 | global batch size: 512 | lm loss: 1.999773E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.227 | TFLOPs: 57.43 | 7: iteration 24590/ 44073 | consumed samples: 12590080 | consumed tokens: 25784483840 | elapsed time per iteration (s): 4.26 | learning rate: 9.495E-05 | global batch size: 512 | lm loss: 2.002271E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.229 | TFLOPs: 56.03 | 7: iteration 24600/ 44073 | consumed samples: 12595200 | consumed tokens: 25794969600 | elapsed time per iteration (s): 4.15 | learning rate: 9.489E-05 | global batch size: 512 | lm loss: 1.994240E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.475 | TFLOPs: 57.55 | 7: iteration 24610/ 44073 | consumed samples: 12600320 | consumed tokens: 25805455360 | elapsed time per iteration (s): 4.17 | learning rate: 9.483E-05 | global batch size: 512 | lm loss: 1.997509E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.836 | TFLOPs: 57.25 | 7: iteration 24620/ 44073 | consumed samples: 12605440 | consumed tokens: 25815941120 | elapsed time per iteration (s): 4.25 | learning rate: 9.476E-05 | global batch size: 512 | lm loss: 2.007921E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.485 | TFLOPs: 56.15 | 7: iteration 24630/ 44073 | consumed samples: 12610560 | consumed tokens: 25826426880 | elapsed time per iteration (s): 4.14 | learning rate: 9.470E-05 | global batch size: 512 | lm loss: 1.986557E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.538 | TFLOPs: 57.57 | 7: iteration 24640/ 44073 | consumed samples: 12615680 | consumed tokens: 25836912640 | elapsed time per iteration (s): 4.16 | learning rate: 9.463E-05 | global batch size: 512 | lm loss: 1.988976E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.947 | TFLOPs: 57.30 | 7: iteration 24650/ 44073 | consumed samples: 12620800 | consumed tokens: 25847398400 | elapsed time per iteration (s): 4.15 | learning rate: 9.457E-05 | global batch size: 512 | lm loss: 2.008737E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.336 | TFLOPs: 57.48 | 7: iteration 24660/ 44073 | consumed samples: 12625920 | consumed tokens: 25857884160 | elapsed time per iteration (s): 4.16 | learning rate: 9.451E-05 | global batch size: 512 | lm loss: 2.007868E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.996 | TFLOPs: 57.32 | 7: iteration 24670/ 44073 | consumed samples: 12631040 | consumed tokens: 25868369920 | elapsed time per iteration (s): 4.16 | learning rate: 9.444E-05 | global batch size: 512 | lm loss: 1.999116E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.042 | TFLOPs: 57.34 | 7: iteration 24680/ 44073 | consumed samples: 12636160 | consumed tokens: 25878855680 | elapsed time per iteration (s): 4.13 | learning rate: 9.438E-05 | global batch size: 512 | lm loss: 2.006391E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.939 | TFLOPs: 57.76 | 7: iteration 24690/ 44073 | consumed samples: 12641280 | consumed tokens: 25889341440 | elapsed time per iteration (s): 4.15 | learning rate: 9.432E-05 | global batch size: 512 | lm loss: 2.012516E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.350 | TFLOPs: 57.49 | 7: iteration 24700/ 44073 | consumed samples: 12646400 | consumed tokens: 25899827200 | elapsed time per iteration (s): 4.16 | learning rate: 9.425E-05 | global batch size: 512 | lm loss: 2.006123E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.096 | TFLOPs: 57.37 | 7: iteration 24710/ 44073 | consumed samples: 12651520 | consumed tokens: 25910312960 | elapsed time per iteration (s): 4.16 | learning rate: 9.419E-05 | global batch size: 512 | lm loss: 1.987588E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.116 | TFLOPs: 57.38 | 7: iteration 24720/ 44073 | consumed samples: 12656640 | consumed tokens: 25920798720 | elapsed time per iteration (s): 4.16 | learning rate: 9.412E-05 | global batch size: 512 | lm loss: 2.027436E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.945 | TFLOPs: 57.30 | 7: iteration 24730/ 44073 | consumed samples: 12661760 | consumed tokens: 25931284480 | elapsed time per iteration (s): 4.15 | learning rate: 9.406E-05 | global batch size: 512 | lm loss: 1.995171E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.480 | TFLOPs: 57.55 | 7: iteration 24740/ 44073 | consumed samples: 12666880 | consumed tokens: 25941770240 | elapsed time per iteration (s): 4.15 | learning rate: 9.400E-05 | global batch size: 512 | lm loss: 2.000046E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.255 | TFLOPs: 57.44 | 7: iteration 24750/ 44073 | consumed samples: 12672000 | consumed tokens: 25952256000 | elapsed time per iteration (s): 4.15 | learning rate: 9.393E-05 | global batch size: 512 | lm loss: 2.011498E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.322 | TFLOPs: 57.47 | 7: iteration 24760/ 44073 | consumed samples: 12677120 | consumed tokens: 25962741760 | elapsed time per iteration (s): 4.15 | learning rate: 9.387E-05 | global batch size: 512 | lm loss: 2.010203E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.491 | TFLOPs: 57.55 | 7: iteration 24770/ 44073 | consumed samples: 12682240 | consumed tokens: 25973227520 | elapsed time per iteration (s): 4.14 | learning rate: 9.381E-05 | global batch size: 512 | lm loss: 2.004156E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.532 | TFLOPs: 57.57 | 7: iteration 24780/ 44073 | consumed samples: 12687360 | consumed tokens: 25983713280 | elapsed time per iteration (s): 4.15 | learning rate: 9.374E-05 | global batch size: 512 | lm loss: 2.009705E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.452 | TFLOPs: 57.53 | 7: iteration 24790/ 44073 | consumed samples: 12692480 | consumed tokens: 25994199040 | elapsed time per iteration (s): 4.16 | learning rate: 9.368E-05 | global batch size: 512 | lm loss: 1.997570E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.934 | TFLOPs: 57.29 | 7: iteration 24800/ 44073 | consumed samples: 12697600 | consumed tokens: 26004684800 | elapsed time per iteration (s): 4.20 | learning rate: 9.361E-05 | global batch size: 512 | lm loss: 1.999430E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.007 | TFLOPs: 56.86 | 7: iteration 24810/ 44073 | consumed samples: 12702720 | consumed tokens: 26015170560 | elapsed time per iteration (s): 4.26 | learning rate: 9.355E-05 | global batch size: 512 | lm loss: 1.988673E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.157 | TFLOPs: 56.00 | 7: iteration 24820/ 44073 | consumed samples: 12707840 | consumed tokens: 26025656320 | elapsed time per iteration (s): 4.20 | learning rate: 9.349E-05 | global batch size: 512 | lm loss: 2.009148E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.998 | TFLOPs: 56.86 | 7: iteration 24830/ 44073 | consumed samples: 12712960 | consumed tokens: 26036142080 | elapsed time per iteration (s): 4.19 | learning rate: 9.342E-05 | global batch size: 512 | lm loss: 2.020293E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.096 | TFLOPs: 56.90 | 7: iteration 24840/ 44073 | consumed samples: 12718080 | consumed tokens: 26046627840 | elapsed time per iteration (s): 4.29 | learning rate: 9.336E-05 | global batch size: 512 | lm loss: 2.004607E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.453 | TFLOPs: 55.67 | 7: iteration 24850/ 44073 | consumed samples: 12723200 | consumed tokens: 26057113600 | elapsed time per iteration (s): 4.16 | learning rate: 9.330E-05 | global batch size: 512 | lm loss: 1.993741E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.206 | TFLOPs: 57.42 | 7: iteration 24860/ 44073 | consumed samples: 12728320 | consumed tokens: 26067599360 | elapsed time per iteration (s): 4.16 | learning rate: 9.323E-05 | global batch size: 512 | lm loss: 2.006676E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.212 | TFLOPs: 57.42 | 7: iteration 24870/ 44073 | consumed samples: 12733440 | consumed tokens: 26078085120 | elapsed time per iteration (s): 4.17 | learning rate: 9.317E-05 | global batch size: 512 | lm loss: 1.969060E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.818 | TFLOPs: 57.24 | 7: iteration 24880/ 44073 | consumed samples: 12738560 | consumed tokens: 26088570880 | elapsed time per iteration (s): 4.16 | learning rate: 9.310E-05 | global batch size: 512 | lm loss: 2.002629E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.976 | TFLOPs: 57.31 | 7: iteration 24890/ 44073 | consumed samples: 12743680 | consumed tokens: 26099056640 | elapsed time per iteration (s): 4.18 | learning rate: 9.304E-05 | global batch size: 512 | lm loss: 1.996498E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.599 | TFLOPs: 57.14 | 7: iteration 24900/ 44073 | consumed samples: 12748800 | consumed tokens: 26109542400 | elapsed time per iteration (s): 4.14 | learning rate: 9.298E-05 | global batch size: 512 | lm loss: 2.004327E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.558 | TFLOPs: 57.58 | 7: iteration 24910/ 44073 | consumed samples: 12753920 | consumed tokens: 26120028160 | elapsed time per iteration (s): 4.15 | learning rate: 9.291E-05 | global batch size: 512 | lm loss: 2.004261E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.414 | TFLOPs: 57.52 | 7: iteration 24920/ 44073 | consumed samples: 12759040 | consumed tokens: 26130513920 | elapsed time per iteration (s): 4.15 | learning rate: 9.285E-05 | global batch size: 512 | lm loss: 2.005166E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.244 | TFLOPs: 57.44 | 7: iteration 24930/ 44073 | consumed samples: 12764160 | consumed tokens: 26140999680 | elapsed time per iteration (s): 4.14 | learning rate: 9.279E-05 | global batch size: 512 | lm loss: 1.990764E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.637 | TFLOPs: 57.62 | 7: iteration 24940/ 44073 | consumed samples: 12769280 | consumed tokens: 26151485440 | elapsed time per iteration (s): 4.33 | learning rate: 9.272E-05 | global batch size: 512 | lm loss: 2.004743E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.260 | TFLOPs: 55.11 | 7: iteration 24950/ 44073 | consumed samples: 12774400 | consumed tokens: 26161971200 | elapsed time per iteration (s): 4.18 | learning rate: 9.266E-05 | global batch size: 512 | lm loss: 1.987505E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.380 | TFLOPs: 57.04 | 7: iteration 24960/ 44073 | consumed samples: 12779520 | consumed tokens: 26172456960 | elapsed time per iteration (s): 4.16 | learning rate: 9.260E-05 | global batch size: 512 | lm loss: 2.006069E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.186 | TFLOPs: 57.41 | 7: iteration 24970/ 44073 | consumed samples: 12784640 | consumed tokens: 26182942720 | elapsed time per iteration (s): 4.14 | learning rate: 9.253E-05 | global batch size: 512 | lm loss: 1.981139E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.547 | TFLOPs: 57.58 | 7: iteration 24980/ 44073 | consumed samples: 12789760 | consumed tokens: 26193428480 | elapsed time per iteration (s): 4.15 | learning rate: 9.247E-05 | global batch size: 512 | lm loss: 2.018821E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.479 | TFLOPs: 57.55 | 7: iteration 24990/ 44073 | consumed samples: 12794880 | consumed tokens: 26203914240 | elapsed time per iteration (s): 4.15 | learning rate: 9.241E-05 | global batch size: 512 | lm loss: 2.000249E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.288 | TFLOPs: 57.46 | 7: iteration 25000/ 44073 | consumed samples: 12800000 | consumed tokens: 26214400000 | elapsed time per iteration (s): 4.16 | learning rate: 9.234E-05 | global batch size: 512 | lm loss: 1.996266E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.094 | TFLOPs: 57.37 | 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 25000 | lm loss value: 1.964657E+00 | lm loss PPL: 7.132465E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 25000 to checkpoints_2b2 0: [2022-11-26 15:45:23,803] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step25000 is begin to save! 0: [2022-11-26 15:45:23,809] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_01-model_00-model_states.pt... 0: [2022-11-26 15:45:24,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_01-model_00-model_states.pt. 0: [2022-11-26 15:45:24,159] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_03-model_00-model_states.pt... 0: [2022-11-26 15:45:24,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_03-model_00-model_states.pt. 0: [2022-11-26 15:45:24,311] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_04-model_00-model_states.pt... 0: [2022-11-26 15:45:24,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_04-model_00-model_states.pt. 0: [2022-11-26 15:45:24,475] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_05-model_00-model_states.pt... 0: [2022-11-26 15:45:24,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_05-model_00-model_states.pt. 0: [2022-11-26 15:45:24,624] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_06-model_00-model_states.pt... 0: [2022-11-26 15:45:24,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_06-model_00-model_states.pt. 0: [2022-11-26 15:45:24,766] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_07-model_00-model_states.pt... 0: [2022-11-26 15:45:24,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_07-model_00-model_states.pt. 0: [2022-11-26 15:45:24,892] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_08-model_00-model_states.pt... 0: [2022-11-26 15:45:25,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_08-model_00-model_states.pt. 0: [2022-11-26 15:45:25,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_09-model_00-model_states.pt... 0: [2022-11-26 15:45:25,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_09-model_00-model_states.pt. 0: [2022-11-26 15:45:25,154] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_10-model_00-model_states.pt... 0: [2022-11-26 15:45:25,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_10-model_00-model_states.pt. 0: [2022-11-26 15:45:25,305] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_11-model_00-model_states.pt... 0: [2022-11-26 15:45:25,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_11-model_00-model_states.pt. 0: [2022-11-26 15:45:25,447] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_12-model_00-model_states.pt... 0: [2022-11-26 15:45:25,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_12-model_00-model_states.pt. 0: [2022-11-26 15:45:25,589] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_13-model_00-model_states.pt... 0: [2022-11-26 15:45:25,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_13-model_00-model_states.pt. 0: [2022-11-26 15:45:25,737] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_14-model_00-model_states.pt... 0: [2022-11-26 15:45:25,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_14-model_00-model_states.pt. 0: [2022-11-26 15:45:25,876] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_15-model_00-model_states.pt... 0: [2022-11-26 15:45:26,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_15-model_00-model_states.pt. 0: [2022-11-26 15:45:26,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_16-model_00-model_states.pt... 0: [2022-11-26 15:45:26,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_16-model_00-model_states.pt. 0: [2022-11-26 15:45:26,163] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_17-model_00-model_states.pt... 0: [2022-11-26 15:45:26,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_17-model_00-model_states.pt. 0: [2022-11-26 15:45:26,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_18-model_00-model_states.pt... 0: [2022-11-26 15:45:26,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_18-model_00-model_states.pt. 0: [2022-11-26 15:45:26,454] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_19-model_00-model_states.pt... 0: [2022-11-26 15:45:26,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_19-model_00-model_states.pt. 0: [2022-11-26 15:45:26,598] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_20-model_00-model_states.pt... 0: [2022-11-26 15:45:26,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_20-model_00-model_states.pt. 0: [2022-11-26 15:45:26,743] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_21-model_00-model_states.pt... 0: [2022-11-26 15:45:26,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_21-model_00-model_states.pt. 0: [2022-11-26 15:45:26,886] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_22-model_00-model_states.pt... 0: [2022-11-26 15:45:27,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_22-model_00-model_states.pt. 0: [2022-11-26 15:45:27,028] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_23-model_00-model_states.pt... 0: [2022-11-26 15:45:27,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_23-model_00-model_states.pt. 0: [2022-11-26 15:45:27,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_24-model_00-model_states.pt... 0: [2022-11-26 15:45:27,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_24-model_00-model_states.pt. 0: [2022-11-26 15:45:27,304] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_25-model_00-model_states.pt... 0: [2022-11-26 15:45:27,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_25-model_00-model_states.pt. 0: [2022-11-26 15:45:27,447] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_26-model_00-model_states.pt... 0: [2022-11-26 15:45:27,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_26-model_00-model_states.pt. 0: [2022-11-26 15:45:27,591] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_27-model_00-model_states.pt... 0: [2022-11-26 15:45:27,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_27-model_00-model_states.pt. 0: [2022-11-26 15:45:27,727] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_28-model_00-model_states.pt... 0: [2022-11-26 15:45:27,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_28-model_00-model_states.pt. 0: [2022-11-26 15:45:27,861] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_29-model_00-model_states.pt... 0: [2022-11-26 15:45:28,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_29-model_00-model_states.pt. 0: [2022-11-26 15:45:28,005] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_30-model_00-model_states.pt... 0: [2022-11-26 15:45:28,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_30-model_00-model_states.pt. 0: [2022-11-26 15:45:28,143] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_31-model_00-model_states.pt... 0: [2022-11-26 15:45:28,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_31-model_00-model_states.pt. 0: [2022-11-26 15:45:28,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_32-model_00-model_states.pt... 0: [2022-11-26 15:45:28,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_32-model_00-model_states.pt. 0: [2022-11-26 15:45:28,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_33-model_00-model_states.pt... 0: [2022-11-26 15:45:28,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_33-model_00-model_states.pt. 0: [2022-11-26 15:45:28,567] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_34-model_00-model_states.pt... 0: [2022-11-26 15:45:28,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_34-model_00-model_states.pt. 0: [2022-11-26 15:45:28,710] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/layer_36-model_00-model_states.pt... 0: [2022-11-26 15:45:28,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/layer_36-model_00-model_states.pt. 0: [2022-11-26 15:45:28,713] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step25000/mp_rank_00_model_states.pt 0: [2022-11-26 15:45:28,713] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/mp_rank_00_model_states.pt... 0: [2022-11-26 15:45:28,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/mp_rank_00_model_states.pt. 0: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:45:28,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step25000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:45:29,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:45:29,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:45:29,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 15:45:29,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: [2022-11-26 15:45:29,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 15:45:29,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: [2022-11-26 15:45:29,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:45:29,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 15:45:29,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 5: [2022-11-26 15:45:29,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:45:29,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 15:45:29,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 5: [2022-11-26 15:45:29,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:45:29,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 15:45:29,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 3: [2022-11-26 15:45:29,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:45:29,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 15:45:29,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 3: [2022-11-26 15:45:29,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:45:29,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 15:45:29,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 3: [2022-11-26 15:45:29,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:45:29,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 15:45:29,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 5: [2022-11-26 15:45:29,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:45:29,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 15:45:29,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 5: [2022-11-26 15:45:29,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:45:29,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 15:45:29,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 5: [2022-11-26 15:45:29,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:45:29,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 15:45:29,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 5: [2022-11-26 15:45:29,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:45:29,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 15:45:29,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 5: [2022-11-26 15:45:29,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:45:29,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 15:45:29,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 3: [2022-11-26 15:45:29,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:45:29,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 15:45:29,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 3: [2022-11-26 15:45:29,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:45:29,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 15:45:29,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 3: [2022-11-26 15:45:29,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:45:29,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 15:45:29,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 1: [2022-11-26 15:45:29,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:45:29,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 15:45:29,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 1: [2022-11-26 15:45:29,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:45:29,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 15:45:29,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 1: [2022-11-26 15:45:29,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:45:29,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 15:45:29,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 3: [2022-11-26 15:45:29,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:45:29,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 15:45:29,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: [2022-11-26 15:45:29,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:45:29,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:45:29,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 15:45:29,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 3: [2022-11-26 15:45:29,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 15:45:29,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: [2022-11-26 15:45:29,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:45:29,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:45:29,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 15:45:29,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 1: [2022-11-26 15:45:29,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:45:29,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:45:29,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:45:29,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:45:29,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 15:45:29,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 15:45:29,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 15:45:29,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 15:45:29,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 1: [2022-11-26 15:45:29,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 1: [2022-11-26 15:45:29,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 1: [2022-11-26 15:45:29,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 1: [2022-11-26 15:45:29,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:45:29,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 15:45:29,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: [2022-11-26 15:45:29,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:45:29,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 15:45:29,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: [2022-11-26 15:45:29,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:45:29,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 15:45:29,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: [2022-11-26 15:45:29,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:45:29,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 15:45:29,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 2: [2022-11-26 15:45:29,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:45:29,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 15:45:29,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:45:29,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:45:29,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 2: [2022-11-26 15:45:29,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 15:45:29,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 15:45:29,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 2: [2022-11-26 15:45:29,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:45:29,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 2: [2022-11-26 15:45:29,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 15:45:29,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: [2022-11-26 15:45:29,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 15:45:29,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 2: [2022-11-26 15:45:29,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:45:29,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 15:45:29,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 2: [2022-11-26 15:45:29,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:45:29,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 15:45:29,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 2: [2022-11-26 15:45:29,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:45:29,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 15:45:29,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 2: [2022-11-26 15:45:29,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:45:29,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 15:45:29,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 6: [2022-11-26 15:45:29,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:45:29,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:45:29,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:45:29,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:45:29,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:45:29,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 15:45:29,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 15:45:29,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 15:45:29,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 15:45:29,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:45:29,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 15:45:29,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 6: [2022-11-26 15:45:29,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 6: [2022-11-26 15:45:29,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 6: [2022-11-26 15:45:29,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 6: [2022-11-26 15:45:29,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 15:45:29,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 6: [2022-11-26 15:45:29,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 6: [2022-11-26 15:45:29,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:45:29,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 15:45:29,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 6: [2022-11-26 15:45:29,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:45:29,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 15:45:29,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 4: [2022-11-26 15:45:29,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:45:29,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 15:45:29,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 4: [2022-11-26 15:45:29,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:45:29,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 15:45:29,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 4: [2022-11-26 15:45:29,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:45:29,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 15:45:29,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 4: [2022-11-26 15:45:29,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:45:29,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 15:45:29,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 4: [2022-11-26 15:45:29,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:45:29,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 15:45:29,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 4: [2022-11-26 15:45:29,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:45:29,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 15:45:29,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 4: [2022-11-26 15:45:29,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:45:29,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 15:45:29,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 4: [2022-11-26 15:45:29,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:45:29,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 15:45:29,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 7: [2022-11-26 15:45:30,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:45:30,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:45:30,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:45:30,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 15:45:30,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 15:45:30,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:45:30,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:45:30,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 7: [2022-11-26 15:45:30,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 15:45:30,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 7: [2022-11-26 15:45:30,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 15:45:30,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 15:45:30,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 7: [2022-11-26 15:45:30,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 7: [2022-11-26 15:45:30,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 7: [2022-11-26 15:45:30,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:45:30,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:45:30,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:45:30,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 15:45:30,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 15:45:30,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 7: [2022-11-26 15:45:30,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 7: [2022-11-26 15:45:30,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step25000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 15:45:30,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: successfully saved checkpoint at iteration 25000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 6309.16 7: iteration 25010/ 44073 | consumed samples: 12805120 | consumed tokens: 26224885760 | elapsed time per iteration (s): 4.91 | learning rate: 9.228E-05 | global batch size: 512 | lm loss: 2.004218E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 104.364 | TFLOPs: 48.64 | 7: iteration 25020/ 44073 | consumed samples: 12810240 | consumed tokens: 26235371520 | elapsed time per iteration (s): 4.14 | learning rate: 9.221E-05 | global batch size: 512 | lm loss: 1.984560E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.670 | TFLOPs: 57.64 | 7: iteration 25030/ 44073 | consumed samples: 12815360 | consumed tokens: 26245857280 | elapsed time per iteration (s): 4.17 | learning rate: 9.215E-05 | global batch size: 512 | lm loss: 1.995938E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.859 | TFLOPs: 57.26 | 7: iteration 25040/ 44073 | consumed samples: 12820480 | consumed tokens: 26256343040 | elapsed time per iteration (s): 4.19 | learning rate: 9.209E-05 | global batch size: 512 | lm loss: 1.983896E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.144 | TFLOPs: 56.93 | 7: iteration 25050/ 44073 | consumed samples: 12825600 | consumed tokens: 26266828800 | elapsed time per iteration (s): 4.31 | learning rate: 9.202E-05 | global batch size: 512 | lm loss: 2.021547E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.679 | TFLOPs: 55.31 | 7: iteration 25060/ 44073 | consumed samples: 12830720 | consumed tokens: 26277314560 | elapsed time per iteration (s): 4.18 | learning rate: 9.196E-05 | global batch size: 512 | lm loss: 1.974327E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.595 | TFLOPs: 57.14 | 7: iteration 25070/ 44073 | consumed samples: 12835840 | consumed tokens: 26287800320 | elapsed time per iteration (s): 4.19 | learning rate: 9.190E-05 | global batch size: 512 | lm loss: 2.005906E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.239 | TFLOPs: 56.97 | 7: iteration 25080/ 44073 | consumed samples: 12840960 | consumed tokens: 26298286080 | elapsed time per iteration (s): 4.14 | learning rate: 9.183E-05 | global batch size: 512 | lm loss: 1.997885E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.589 | TFLOPs: 57.60 | 7: iteration 25090/ 44073 | consumed samples: 12846080 | consumed tokens: 26308771840 | elapsed time per iteration (s): 4.15 | learning rate: 9.177E-05 | global batch size: 512 | lm loss: 1.978634E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.495 | TFLOPs: 57.56 | 7: iteration 25100/ 44073 | consumed samples: 12851200 | consumed tokens: 26319257600 | elapsed time per iteration (s): 4.19 | learning rate: 9.171E-05 | global batch size: 512 | lm loss: 1.986139E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.205 | TFLOPs: 56.95 | 7: iteration 25110/ 44073 | consumed samples: 12856320 | consumed tokens: 26329743360 | elapsed time per iteration (s): 4.32 | learning rate: 9.164E-05 | global batch size: 512 | lm loss: 1.987820E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.436 | TFLOPs: 55.20 | 7: iteration 25120/ 44073 | consumed samples: 12861440 | consumed tokens: 26340229120 | elapsed time per iteration (s): 4.16 | learning rate: 9.158E-05 | global batch size: 512 | lm loss: 2.002816E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.973 | TFLOPs: 57.31 | 7: iteration 25130/ 44073 | consumed samples: 12866560 | consumed tokens: 26350714880 | elapsed time per iteration (s): 4.16 | learning rate: 9.152E-05 | global batch size: 512 | lm loss: 1.991603E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.196 | TFLOPs: 57.42 | 7: iteration 25140/ 44073 | consumed samples: 12871680 | consumed tokens: 26361200640 | elapsed time per iteration (s): 4.17 | learning rate: 9.145E-05 | global batch size: 512 | lm loss: 2.004316E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.647 | TFLOPs: 57.16 | 7: iteration 25150/ 44073 | consumed samples: 12876800 | consumed tokens: 26371686400 | elapsed time per iteration (s): 4.15 | learning rate: 9.139E-05 | global batch size: 512 | lm loss: 1.982751E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.248 | TFLOPs: 57.44 | 7: iteration 25160/ 44073 | consumed samples: 12881920 | consumed tokens: 26382172160 | elapsed time per iteration (s): 4.24 | learning rate: 9.133E-05 | global batch size: 512 | lm loss: 2.002476E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.831 | TFLOPs: 56.31 | 7: iteration 25170/ 44073 | consumed samples: 12887040 | consumed tokens: 26392657920 | elapsed time per iteration (s): 4.15 | learning rate: 9.126E-05 | global batch size: 512 | lm loss: 1.976705E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.231 | TFLOPs: 57.43 | 7: iteration 25180/ 44073 | consumed samples: 12892160 | consumed tokens: 26403143680 | elapsed time per iteration (s): 4.13 | learning rate: 9.120E-05 | global batch size: 512 | lm loss: 2.001797E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.880 | TFLOPs: 57.73 | 7: iteration 25190/ 44073 | consumed samples: 12897280 | consumed tokens: 26413629440 | elapsed time per iteration (s): 4.15 | learning rate: 9.114E-05 | global batch size: 512 | lm loss: 2.015508E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.338 | TFLOPs: 57.48 | 7: iteration 25200/ 44073 | consumed samples: 12902400 | consumed tokens: 26424115200 | elapsed time per iteration (s): 4.16 | learning rate: 9.107E-05 | global batch size: 512 | lm loss: 1.998103E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.180 | TFLOPs: 57.41 | 7: iteration 25210/ 44073 | consumed samples: 12907520 | consumed tokens: 26434600960 | elapsed time per iteration (s): 4.17 | learning rate: 9.101E-05 | global batch size: 512 | lm loss: 2.003011E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.856 | TFLOPs: 57.26 | 7: iteration 25220/ 44073 | consumed samples: 12912640 | consumed tokens: 26445086720 | elapsed time per iteration (s): 4.21 | learning rate: 9.095E-05 | global batch size: 512 | lm loss: 2.003445E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.720 | TFLOPs: 56.73 | 7: iteration 25230/ 44073 | consumed samples: 12917760 | consumed tokens: 26455572480 | elapsed time per iteration (s): 4.16 | learning rate: 9.088E-05 | global batch size: 512 | lm loss: 2.002209E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.000 | TFLOPs: 57.32 | 7: iteration 25240/ 44073 | consumed samples: 12922880 | consumed tokens: 26466058240 | elapsed time per iteration (s): 4.18 | learning rate: 9.082E-05 | global batch size: 512 | lm loss: 2.019490E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.425 | TFLOPs: 57.06 | 7: iteration 25250/ 44073 | consumed samples: 12928000 | consumed tokens: 26476544000 | elapsed time per iteration (s): 4.19 | learning rate: 9.076E-05 | global batch size: 512 | lm loss: 1.982578E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.287 | TFLOPs: 56.99 | 7: iteration 25260/ 44073 | consumed samples: 12933120 | consumed tokens: 26487029760 | elapsed time per iteration (s): 4.27 | learning rate: 9.069E-05 | global batch size: 512 | lm loss: 2.001502E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.024 | TFLOPs: 55.94 | 7: iteration 25270/ 44073 | consumed samples: 12938240 | consumed tokens: 26497515520 | elapsed time per iteration (s): 4.17 | learning rate: 9.063E-05 | global batch size: 512 | lm loss: 2.002568E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.854 | TFLOPs: 57.26 | 7: iteration 25280/ 44073 | consumed samples: 12943360 | consumed tokens: 26508001280 | elapsed time per iteration (s): 4.23 | learning rate: 9.057E-05 | global batch size: 512 | lm loss: 1.988152E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.139 | TFLOPs: 56.46 | 7: iteration 25290/ 44073 | consumed samples: 12948480 | consumed tokens: 26518487040 | elapsed time per iteration (s): 4.16 | learning rate: 9.050E-05 | global batch size: 512 | lm loss: 1.995567E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.086 | TFLOPs: 57.36 | 7: iteration 25300/ 44073 | consumed samples: 12953600 | consumed tokens: 26528972800 | elapsed time per iteration (s): 4.21 | learning rate: 9.044E-05 | global batch size: 512 | lm loss: 2.013054E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.497 | TFLOPs: 56.62 | 7: iteration 25310/ 44073 | consumed samples: 12958720 | consumed tokens: 26539458560 | elapsed time per iteration (s): 4.16 | learning rate: 9.038E-05 | global batch size: 512 | lm loss: 1.993833E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.209 | TFLOPs: 57.42 | 7: iteration 25320/ 44073 | consumed samples: 12963840 | consumed tokens: 26549944320 | elapsed time per iteration (s): 4.17 | learning rate: 9.031E-05 | global batch size: 512 | lm loss: 2.004768E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.751 | TFLOPs: 57.21 | 7: iteration 25330/ 44073 | consumed samples: 12968960 | consumed tokens: 26560430080 | elapsed time per iteration (s): 4.32 | learning rate: 9.025E-05 | global batch size: 512 | lm loss: 1.994082E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.470 | TFLOPs: 55.21 | 7: iteration 25340/ 44073 | consumed samples: 12974080 | consumed tokens: 26570915840 | elapsed time per iteration (s): 4.16 | learning rate: 9.019E-05 | global batch size: 512 | lm loss: 1.991929E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.191 | TFLOPs: 57.41 | 7: iteration 25350/ 44073 | consumed samples: 12979200 | consumed tokens: 26581401600 | elapsed time per iteration (s): 4.15 | learning rate: 9.012E-05 | global batch size: 512 | lm loss: 1.975090E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.232 | TFLOPs: 57.43 | 7: iteration 25360/ 44073 | consumed samples: 12984320 | consumed tokens: 26591887360 | elapsed time per iteration (s): 4.16 | learning rate: 9.006E-05 | global batch size: 512 | lm loss: 1.996858E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.937 | TFLOPs: 57.29 | 7: iteration 25370/ 44073 | consumed samples: 12989440 | consumed tokens: 26602373120 | elapsed time per iteration (s): 4.15 | learning rate: 9.000E-05 | global batch size: 512 | lm loss: 1.987194E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.451 | TFLOPs: 57.53 | 7: iteration 25380/ 44073 | consumed samples: 12994560 | consumed tokens: 26612858880 | elapsed time per iteration (s): 4.18 | learning rate: 8.993E-05 | global batch size: 512 | lm loss: 1.995579E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.489 | TFLOPs: 57.09 | 7: iteration 25390/ 44073 | consumed samples: 12999680 | consumed tokens: 26623344640 | elapsed time per iteration (s): 4.15 | learning rate: 8.987E-05 | global batch size: 512 | lm loss: 2.004430E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.273 | TFLOPs: 57.45 | 7: iteration 25400/ 44073 | consumed samples: 13004800 | consumed tokens: 26633830400 | elapsed time per iteration (s): 4.19 | learning rate: 8.981E-05 | global batch size: 512 | lm loss: 1.998153E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.126 | TFLOPs: 56.92 | 7: iteration 25410/ 44073 | consumed samples: 13009920 | consumed tokens: 26644316160 | elapsed time per iteration (s): 4.17 | learning rate: 8.974E-05 | global batch size: 512 | lm loss: 1.998615E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.876 | TFLOPs: 57.27 | 7: iteration 25420/ 44073 | consumed samples: 13015040 | consumed tokens: 26654801920 | elapsed time per iteration (s): 4.15 | learning rate: 8.968E-05 | global batch size: 512 | lm loss: 2.002123E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.397 | TFLOPs: 57.51 | 7: iteration 25430/ 44073 | consumed samples: 13020160 | consumed tokens: 26665287680 | elapsed time per iteration (s): 4.17 | learning rate: 8.962E-05 | global batch size: 512 | lm loss: 1.998878E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.750 | TFLOPs: 57.21 | 7: iteration 25440/ 44073 | consumed samples: 13025280 | consumed tokens: 26675773440 | elapsed time per iteration (s): 4.19 | learning rate: 8.956E-05 | global batch size: 512 | lm loss: 1.991406E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.254 | TFLOPs: 56.98 | 7: iteration 25450/ 44073 | consumed samples: 13030400 | consumed tokens: 26686259200 | elapsed time per iteration (s): 4.15 | learning rate: 8.949E-05 | global batch size: 512 | lm loss: 1.961196E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.248 | TFLOPs: 57.44 | 7: iteration 25460/ 44073 | consumed samples: 13035520 | consumed tokens: 26696744960 | elapsed time per iteration (s): 4.16 | learning rate: 8.943E-05 | global batch size: 512 | lm loss: 1.987811E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.097 | TFLOPs: 57.37 | 7: iteration 25470/ 44073 | consumed samples: 13040640 | consumed tokens: 26707230720 | elapsed time per iteration (s): 4.18 | learning rate: 8.937E-05 | global batch size: 512 | lm loss: 2.021501E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.364 | TFLOPs: 57.03 | 7: iteration 25480/ 44073 | consumed samples: 13045760 | consumed tokens: 26717716480 | elapsed time per iteration (s): 4.15 | learning rate: 8.930E-05 | global batch size: 512 | lm loss: 1.990093E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.418 | TFLOPs: 57.52 | 7: iteration 25490/ 44073 | consumed samples: 13050880 | consumed tokens: 26728202240 | elapsed time per iteration (s): 4.31 | learning rate: 8.924E-05 | global batch size: 512 | lm loss: 2.017212E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.802 | TFLOPs: 55.37 | 7: iteration 25500/ 44073 | consumed samples: 13056000 | consumed tokens: 26738688000 | elapsed time per iteration (s): 4.17 | learning rate: 8.918E-05 | global batch size: 512 | lm loss: 1.977269E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.795 | TFLOPs: 57.23 | 7: iteration 25510/ 44073 | consumed samples: 13061120 | consumed tokens: 26749173760 | elapsed time per iteration (s): 4.16 | learning rate: 8.911E-05 | global batch size: 512 | lm loss: 2.009645E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.082 | TFLOPs: 57.36 | 7: iteration 25520/ 44073 | consumed samples: 13066240 | consumed tokens: 26759659520 | elapsed time per iteration (s): 4.14 | learning rate: 8.905E-05 | global batch size: 512 | lm loss: 1.989320E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.524 | TFLOPs: 57.57 | 7: iteration 25530/ 44073 | consumed samples: 13071360 | consumed tokens: 26770145280 | elapsed time per iteration (s): 4.14 | learning rate: 8.899E-05 | global batch size: 512 | lm loss: 1.983665E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.705 | TFLOPs: 57.65 | 7: iteration 25540/ 44073 | consumed samples: 13076480 | consumed tokens: 26780631040 | elapsed time per iteration (s): 4.15 | learning rate: 8.892E-05 | global batch size: 512 | lm loss: 1.998265E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.332 | TFLOPs: 57.48 | 7: iteration 25550/ 44073 | consumed samples: 13081600 | consumed tokens: 26791116800 | elapsed time per iteration (s): 4.29 | learning rate: 8.886E-05 | global batch size: 512 | lm loss: 1.985726E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.443 | TFLOPs: 55.67 | 7: iteration 25560/ 44073 | consumed samples: 13086720 | consumed tokens: 26801602560 | elapsed time per iteration (s): 4.23 | learning rate: 8.880E-05 | global batch size: 512 | lm loss: 1.998014E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.942 | TFLOPs: 56.37 | 7: iteration 25570/ 44073 | consumed samples: 13091840 | consumed tokens: 26812088320 | elapsed time per iteration (s): 4.21 | learning rate: 8.874E-05 | global batch size: 512 | lm loss: 1.996746E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.611 | TFLOPs: 56.68 | 7: iteration 25580/ 44073 | consumed samples: 13096960 | consumed tokens: 26822574080 | elapsed time per iteration (s): 4.19 | learning rate: 8.867E-05 | global batch size: 512 | lm loss: 2.028022E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.134 | TFLOPs: 56.92 | 7: iteration 25590/ 44073 | consumed samples: 13102080 | consumed tokens: 26833059840 | elapsed time per iteration (s): 4.19 | learning rate: 8.861E-05 | global batch size: 512 | lm loss: 1.977272E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.306 | TFLOPs: 57.00 | 7: iteration 25600/ 44073 | consumed samples: 13107200 | consumed tokens: 26843545600 | elapsed time per iteration (s): 4.14 | learning rate: 8.855E-05 | global batch size: 512 | lm loss: 1.988007E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.769 | TFLOPs: 57.68 | 7: iteration 25610/ 44073 | consumed samples: 13112320 | consumed tokens: 26854031360 | elapsed time per iteration (s): 4.15 | learning rate: 8.848E-05 | global batch size: 512 | lm loss: 1.992309E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.399 | TFLOPs: 57.51 | 7: iteration 25620/ 44073 | consumed samples: 13117440 | consumed tokens: 26864517120 | elapsed time per iteration (s): 4.14 | learning rate: 8.842E-05 | global batch size: 512 | lm loss: 1.979029E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.562 | TFLOPs: 57.59 | 7: iteration 25630/ 44073 | consumed samples: 13122560 | consumed tokens: 26875002880 | elapsed time per iteration (s): 4.17 | learning rate: 8.836E-05 | global batch size: 512 | lm loss: 1.981558E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.723 | TFLOPs: 57.20 | 7: iteration 25640/ 44073 | consumed samples: 13127680 | consumed tokens: 26885488640 | elapsed time per iteration (s): 4.16 | learning rate: 8.830E-05 | global batch size: 512 | lm loss: 1.975590E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.939 | TFLOPs: 57.30 | 7: iteration 25650/ 44073 | consumed samples: 13132800 | consumed tokens: 26895974400 | elapsed time per iteration (s): 4.14 | learning rate: 8.823E-05 | global batch size: 512 | lm loss: 2.008541E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.543 | TFLOPs: 57.58 | 7: iteration 25660/ 44073 | consumed samples: 13137920 | consumed tokens: 26906460160 | elapsed time per iteration (s): 4.17 | learning rate: 8.817E-05 | global batch size: 512 | lm loss: 1.988708E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.918 | TFLOPs: 57.29 | 7: iteration 25670/ 44073 | consumed samples: 13143040 | consumed tokens: 26916945920 | elapsed time per iteration (s): 4.14 | learning rate: 8.811E-05 | global batch size: 512 | lm loss: 2.015025E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.774 | TFLOPs: 57.69 | 7: iteration 25680/ 44073 | consumed samples: 13148160 | consumed tokens: 26927431680 | elapsed time per iteration (s): 4.15 | learning rate: 8.804E-05 | global batch size: 512 | lm loss: 1.985392E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.260 | TFLOPs: 57.45 | 7: iteration 25690/ 44073 | consumed samples: 13153280 | consumed tokens: 26937917440 | elapsed time per iteration (s): 4.14 | learning rate: 8.798E-05 | global batch size: 512 | lm loss: 1.988287E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.569 | TFLOPs: 57.59 | 7: iteration 25700/ 44073 | consumed samples: 13158400 | consumed tokens: 26948403200 | elapsed time per iteration (s): 4.14 | learning rate: 8.792E-05 | global batch size: 512 | lm loss: 1.994277E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.578 | TFLOPs: 57.59 | 7: iteration 25710/ 44073 | consumed samples: 13163520 | consumed tokens: 26958888960 | elapsed time per iteration (s): 4.17 | learning rate: 8.786E-05 | global batch size: 512 | lm loss: 1.990557E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.913 | TFLOPs: 57.28 | 7: iteration 25720/ 44073 | consumed samples: 13168640 | consumed tokens: 26969374720 | elapsed time per iteration (s): 4.15 | learning rate: 8.779E-05 | global batch size: 512 | lm loss: 1.996274E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.490 | TFLOPs: 57.55 | 7: iteration 25730/ 44073 | consumed samples: 13173760 | consumed tokens: 26979860480 | elapsed time per iteration (s): 4.13 | learning rate: 8.773E-05 | global batch size: 512 | lm loss: 1.974293E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.984 | TFLOPs: 57.78 | 7: iteration 25740/ 44073 | consumed samples: 13178880 | consumed tokens: 26990346240 | elapsed time per iteration (s): 4.22 | learning rate: 8.767E-05 | global batch size: 512 | lm loss: 1.983058E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.184 | TFLOPs: 56.48 | 7: iteration 25750/ 44073 | consumed samples: 13184000 | consumed tokens: 27000832000 | elapsed time per iteration (s): 4.14 | learning rate: 8.760E-05 | global batch size: 512 | lm loss: 1.978273E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.614 | TFLOPs: 57.61 | 7: iteration 25760/ 44073 | consumed samples: 13189120 | consumed tokens: 27011317760 | elapsed time per iteration (s): 4.15 | learning rate: 8.754E-05 | global batch size: 512 | lm loss: 1.977457E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.380 | TFLOPs: 57.50 | 7: iteration 25770/ 44073 | consumed samples: 13194240 | consumed tokens: 27021803520 | elapsed time per iteration (s): 4.13 | learning rate: 8.748E-05 | global batch size: 512 | lm loss: 1.988729E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.934 | TFLOPs: 57.76 | 7: iteration 25780/ 44073 | consumed samples: 13199360 | consumed tokens: 27032289280 | elapsed time per iteration (s): 4.15 | learning rate: 8.742E-05 | global batch size: 512 | lm loss: 1.998874E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.407 | TFLOPs: 57.51 | 7: iteration 25790/ 44073 | consumed samples: 13204480 | consumed tokens: 27042775040 | elapsed time per iteration (s): 4.13 | learning rate: 8.735E-05 | global batch size: 512 | lm loss: 1.990283E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.823 | TFLOPs: 57.71 | 7: iteration 25800/ 44073 | consumed samples: 13209600 | consumed tokens: 27053260800 | elapsed time per iteration (s): 4.16 | learning rate: 8.729E-05 | global batch size: 512 | lm loss: 1.974194E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.952 | TFLOPs: 57.30 | 7: iteration 25810/ 44073 | consumed samples: 13214720 | consumed tokens: 27063746560 | elapsed time per iteration (s): 4.14 | learning rate: 8.723E-05 | global batch size: 512 | lm loss: 2.002520E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.820 | TFLOPs: 57.71 | 7: iteration 25820/ 44073 | consumed samples: 13219840 | consumed tokens: 27074232320 | elapsed time per iteration (s): 4.16 | learning rate: 8.717E-05 | global batch size: 512 | lm loss: 1.983490E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.936 | TFLOPs: 57.29 | 7: iteration 25830/ 44073 | consumed samples: 13224960 | consumed tokens: 27084718080 | elapsed time per iteration (s): 4.20 | learning rate: 8.710E-05 | global batch size: 512 | lm loss: 1.991068E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.997 | TFLOPs: 56.86 | 7: iteration 25840/ 44073 | consumed samples: 13230080 | consumed tokens: 27095203840 | elapsed time per iteration (s): 4.19 | learning rate: 8.704E-05 | global batch size: 512 | lm loss: 1.963569E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.270 | TFLOPs: 56.98 | 7: iteration 25850/ 44073 | consumed samples: 13235200 | consumed tokens: 27105689600 | elapsed time per iteration (s): 4.16 | learning rate: 8.698E-05 | global batch size: 512 | lm loss: 1.983482E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.986 | TFLOPs: 57.32 | 7: iteration 25860/ 44073 | consumed samples: 13240320 | consumed tokens: 27116175360 | elapsed time per iteration (s): 4.17 | learning rate: 8.691E-05 | global batch size: 512 | lm loss: 1.982455E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.660 | TFLOPs: 57.17 | 7: iteration 25870/ 44073 | consumed samples: 13245440 | consumed tokens: 27126661120 | elapsed time per iteration (s): 4.18 | learning rate: 8.685E-05 | global batch size: 512 | lm loss: 1.976569E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.595 | TFLOPs: 57.14 | 7: iteration 25880/ 44073 | consumed samples: 13250560 | consumed tokens: 27137146880 | elapsed time per iteration (s): 4.17 | learning rate: 8.679E-05 | global batch size: 512 | lm loss: 1.973408E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.900 | TFLOPs: 57.28 | 7: iteration 25890/ 44073 | consumed samples: 13255680 | consumed tokens: 27147632640 | elapsed time per iteration (s): 4.17 | learning rate: 8.673E-05 | global batch size: 512 | lm loss: 2.000709E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.880 | TFLOPs: 57.27 | 7: iteration 25900/ 44073 | consumed samples: 13260800 | consumed tokens: 27158118400 | elapsed time per iteration (s): 4.15 | learning rate: 8.666E-05 | global batch size: 512 | lm loss: 1.987470E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.346 | TFLOPs: 57.49 | 7: iteration 25910/ 44073 | consumed samples: 13265920 | consumed tokens: 27168604160 | elapsed time per iteration (s): 4.16 | learning rate: 8.660E-05 | global batch size: 512 | lm loss: 2.000211E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.077 | TFLOPs: 57.36 | 7: iteration 25920/ 44073 | consumed samples: 13271040 | consumed tokens: 27179089920 | elapsed time per iteration (s): 4.17 | learning rate: 8.654E-05 | global batch size: 512 | lm loss: 1.982645E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.801 | TFLOPs: 57.23 | 7: iteration 25930/ 44073 | consumed samples: 13276160 | consumed tokens: 27189575680 | elapsed time per iteration (s): 4.16 | learning rate: 8.648E-05 | global batch size: 512 | lm loss: 1.976389E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.216 | TFLOPs: 57.42 | 7: iteration 25940/ 44073 | consumed samples: 13281280 | consumed tokens: 27200061440 | elapsed time per iteration (s): 4.14 | learning rate: 8.641E-05 | global batch size: 512 | lm loss: 1.984595E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.702 | TFLOPs: 57.65 | 7: iteration 25950/ 44073 | consumed samples: 13286400 | consumed tokens: 27210547200 | elapsed time per iteration (s): 4.24 | learning rate: 8.635E-05 | global batch size: 512 | lm loss: 1.977240E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.656 | TFLOPs: 56.23 | 7: iteration 25960/ 44073 | consumed samples: 13291520 | consumed tokens: 27221032960 | elapsed time per iteration (s): 4.14 | learning rate: 8.629E-05 | global batch size: 512 | lm loss: 1.967600E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.587 | TFLOPs: 57.60 | 7: iteration 25970/ 44073 | consumed samples: 13296640 | consumed tokens: 27231518720 | elapsed time per iteration (s): 4.16 | learning rate: 8.623E-05 | global batch size: 512 | lm loss: 1.977264E+00 | grad norm: 0.115 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.180 | TFLOPs: 57.41 | 7: iteration 25980/ 44073 | consumed samples: 13301760 | consumed tokens: 27242004480 | elapsed time per iteration (s): 4.33 | learning rate: 8.616E-05 | global batch size: 512 | lm loss: 1.986643E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.178 | TFLOPs: 55.08 | 7: iteration 25990/ 44073 | consumed samples: 13306880 | consumed tokens: 27252490240 | elapsed time per iteration (s): 4.16 | learning rate: 8.610E-05 | global batch size: 512 | lm loss: 1.980042E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.967 | TFLOPs: 57.31 | 0: [2022-11-26 16:55:05,784] [INFO] [logging.py:68:log_dist] [Rank 0] step=26000, skipped=0, lr=[8.60388138259506e-05, 8.60388138259506e-05, 8.60388138259506e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 7: iteration 26000/ 44073 | consumed samples: 13312000 | consumed tokens: 27262976000 | elapsed time per iteration (s): 4.19 | learning rate: 8.604E-05 | global batch size: 512 | lm loss: 1.989435E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.143 | TFLOPs: 56.93 | 0: steps: 26000 loss: 1.9366 iter time (s): 4.174 samples/sec: 122.656 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 26000 | lm loss value: 1.931051E+00 | lm loss PPL: 6.896754E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 26000 to checkpoints_2b2 0: [2022-11-26 16:55:07,155] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step26000 is begin to save! 0: [2022-11-26 16:55:07,182] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_01-model_00-model_states.pt... 0: [2022-11-26 16:55:07,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_01-model_00-model_states.pt. 0: [2022-11-26 16:55:07,506] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_03-model_00-model_states.pt... 0: [2022-11-26 16:55:07,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_03-model_00-model_states.pt. 0: [2022-11-26 16:55:07,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_04-model_00-model_states.pt... 0: [2022-11-26 16:55:07,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_04-model_00-model_states.pt. 0: [2022-11-26 16:55:07,811] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_05-model_00-model_states.pt... 0: [2022-11-26 16:55:07,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_05-model_00-model_states.pt. 0: [2022-11-26 16:55:07,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_06-model_00-model_states.pt... 0: [2022-11-26 16:55:08,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_06-model_00-model_states.pt. 0: [2022-11-26 16:55:08,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_07-model_00-model_states.pt... 0: [2022-11-26 16:55:08,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_07-model_00-model_states.pt. 0: [2022-11-26 16:55:08,207] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_08-model_00-model_states.pt... 0: [2022-11-26 16:55:08,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_08-model_00-model_states.pt. 0: [2022-11-26 16:55:08,352] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_09-model_00-model_states.pt... 0: [2022-11-26 16:55:08,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_09-model_00-model_states.pt. 0: [2022-11-26 16:55:08,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_10-model_00-model_states.pt... 0: [2022-11-26 16:55:08,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_10-model_00-model_states.pt. 0: [2022-11-26 16:55:08,642] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_11-model_00-model_states.pt... 0: [2022-11-26 16:55:08,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_11-model_00-model_states.pt. 0: [2022-11-26 16:55:08,784] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_12-model_00-model_states.pt... 0: [2022-11-26 16:55:08,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_12-model_00-model_states.pt. 0: [2022-11-26 16:55:08,923] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_13-model_00-model_states.pt... 0: [2022-11-26 16:55:09,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_13-model_00-model_states.pt. 0: [2022-11-26 16:55:09,065] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_14-model_00-model_states.pt... 0: [2022-11-26 16:55:09,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_14-model_00-model_states.pt. 0: [2022-11-26 16:55:09,212] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_15-model_00-model_states.pt... 0: [2022-11-26 16:55:09,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_15-model_00-model_states.pt. 0: [2022-11-26 16:55:09,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_16-model_00-model_states.pt... 0: [2022-11-26 16:55:09,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_16-model_00-model_states.pt. 0: [2022-11-26 16:55:09,504] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_17-model_00-model_states.pt... 0: [2022-11-26 16:55:09,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_17-model_00-model_states.pt. 0: [2022-11-26 16:55:09,648] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_18-model_00-model_states.pt... 0: [2022-11-26 16:55:09,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_18-model_00-model_states.pt. 0: [2022-11-26 16:55:09,786] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_19-model_00-model_states.pt... 0: [2022-11-26 16:55:09,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_19-model_00-model_states.pt. 0: [2022-11-26 16:55:09,925] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_20-model_00-model_states.pt... 0: [2022-11-26 16:55:10,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_20-model_00-model_states.pt. 0: [2022-11-26 16:55:10,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_21-model_00-model_states.pt... 0: [2022-11-26 16:55:10,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_21-model_00-model_states.pt. 0: [2022-11-26 16:55:10,206] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_22-model_00-model_states.pt... 0: [2022-11-26 16:55:10,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_22-model_00-model_states.pt. 0: [2022-11-26 16:55:10,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_23-model_00-model_states.pt... 0: [2022-11-26 16:55:10,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_23-model_00-model_states.pt. 0: [2022-11-26 16:55:10,489] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_24-model_00-model_states.pt... 0: [2022-11-26 16:55:10,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_24-model_00-model_states.pt. 0: [2022-11-26 16:55:10,622] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_25-model_00-model_states.pt... 0: [2022-11-26 16:55:10,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_25-model_00-model_states.pt. 0: [2022-11-26 16:55:10,762] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_26-model_00-model_states.pt... 0: [2022-11-26 16:55:10,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_26-model_00-model_states.pt. 0: [2022-11-26 16:55:10,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_27-model_00-model_states.pt... 0: [2022-11-26 16:55:11,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_27-model_00-model_states.pt. 0: [2022-11-26 16:55:11,029] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_28-model_00-model_states.pt... 0: [2022-11-26 16:55:11,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_28-model_00-model_states.pt. 0: [2022-11-26 16:55:11,168] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_29-model_00-model_states.pt... 0: [2022-11-26 16:55:11,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_29-model_00-model_states.pt. 0: [2022-11-26 16:55:11,301] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_30-model_00-model_states.pt... 0: [2022-11-26 16:55:11,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_30-model_00-model_states.pt. 0: [2022-11-26 16:55:11,438] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_31-model_00-model_states.pt... 0: [2022-11-26 16:55:11,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_31-model_00-model_states.pt. 0: [2022-11-26 16:55:11,581] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_32-model_00-model_states.pt... 0: [2022-11-26 16:55:11,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_32-model_00-model_states.pt. 0: [2022-11-26 16:55:11,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_33-model_00-model_states.pt... 0: [2022-11-26 16:55:11,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_33-model_00-model_states.pt. 0: [2022-11-26 16:55:11,850] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_34-model_00-model_states.pt... 0: [2022-11-26 16:55:11,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_34-model_00-model_states.pt. 0: [2022-11-26 16:55:11,986] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/layer_36-model_00-model_states.pt... 0: [2022-11-26 16:55:11,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/layer_36-model_00-model_states.pt. 0: [2022-11-26 16:55:11,989] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step26000/mp_rank_00_model_states.pt 0: [2022-11-26 16:55:11,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/mp_rank_00_model_states.pt... 0: [2022-11-26 16:55:11,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/mp_rank_00_model_states.pt. 0: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:55:12,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step26000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:55:12,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:55:12,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 16:55:12,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: [2022-11-26 16:55:12,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:55:12,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 16:55:12,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: [2022-11-26 16:55:12,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:55:12,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 16:55:12,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: [2022-11-26 16:55:12,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:55:12,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 16:55:12,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: [2022-11-26 16:55:12,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:55:12,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:55:12,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 16:55:12,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 16:55:12,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: [2022-11-26 16:55:12,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 3: [2022-11-26 16:55:12,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:55:12,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:55:12,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 16:55:12,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 16:55:12,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 3: [2022-11-26 16:55:12,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 1: [2022-11-26 16:55:12,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:55:12,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 16:55:12,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: [2022-11-26 16:55:12,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:55:12,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 16:55:12,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: [2022-11-26 16:55:12,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:55:12,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 16:55:12,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 1: [2022-11-26 16:55:12,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:55:12,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 16:55:12,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 3: [2022-11-26 16:55:12,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:55:12,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 16:55:12,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 1: [2022-11-26 16:55:12,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:55:12,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 16:55:12,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 3: [2022-11-26 16:55:12,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:55:12,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 16:55:12,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 3: [2022-11-26 16:55:12,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:55:12,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 16:55:12,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: [2022-11-26 16:55:12,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:55:12,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:55:12,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 16:55:12,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 1: [2022-11-26 16:55:12,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:55:12,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 16:55:12,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 1: [2022-11-26 16:55:12,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:55:12,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 16:55:12,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 1: [2022-11-26 16:55:12,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:55:12,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 16:55:12,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 3: [2022-11-26 16:55:12,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:55:12,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 16:55:12,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 1: [2022-11-26 16:55:12,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:55:12,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 16:55:12,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 3: [2022-11-26 16:55:12,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:55:12,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 16:55:12,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 5: [2022-11-26 16:55:12,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:55:12,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:55:12,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:55:12,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:55:12,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 16:55:12,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 16:55:12,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:55:12,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 16:55:12,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 5: [2022-11-26 16:55:12,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 16:55:12,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:55:12,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:55:12,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 5: [2022-11-26 16:55:12,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 16:55:12,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:55:12,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 5: [2022-11-26 16:55:12,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 5: [2022-11-26 16:55:12,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 16:55:12,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 16:55:12,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 5: [2022-11-26 16:55:12,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 16:55:12,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 5: [2022-11-26 16:55:12,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 5: [2022-11-26 16:55:12,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 7: [2022-11-26 16:55:12,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:55:12,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:55:12,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:55:12,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:55:12,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:55:12,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:55:12,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:55:12,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:55:12,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 16:55:12,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 16:55:12,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 16:55:12,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 16:55:12,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 16:55:12,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 16:55:12,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 16:55:12,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 7: [2022-11-26 16:55:12,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 16:55:12,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 7: [2022-11-26 16:55:12,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 7: [2022-11-26 16:55:12,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 7: [2022-11-26 16:55:12,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 7: [2022-11-26 16:55:12,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 7: [2022-11-26 16:55:12,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 7: [2022-11-26 16:55:12,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 6: [2022-11-26 16:55:13,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:55:13,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:55:13,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:55:13,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:55:13,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 16:55:13,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 16:55:13,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 16:55:13,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 16:55:13,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 6: [2022-11-26 16:55:13,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 6: [2022-11-26 16:55:13,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 6: [2022-11-26 16:55:13,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 6: [2022-11-26 16:55:13,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:55:13,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 16:55:13,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 6: [2022-11-26 16:55:13,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:55:13,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 16:55:13,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 6: [2022-11-26 16:55:13,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:55:13,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 16:55:13,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 6: [2022-11-26 16:55:13,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:55:13,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 16:55:13,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 2: [2022-11-26 16:55:13,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:55:13,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:55:13,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:55:13,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:55:13,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:55:13,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:55:13,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:55:13,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:55:13,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 16:55:13,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 16:55:13,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 16:55:13,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 16:55:13,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 16:55:13,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 16:55:13,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 16:55:13,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 16:55:13,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 2: [2022-11-26 16:55:13,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 2: [2022-11-26 16:55:13,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 2: [2022-11-26 16:55:13,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 2: [2022-11-26 16:55:13,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 2: [2022-11-26 16:55:13,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 2: [2022-11-26 16:55:13,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 2: [2022-11-26 16:55:13,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: [2022-11-26 16:55:13,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 16:55:13,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 4: [2022-11-26 16:55:13,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:55:13,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:55:13,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:55:13,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:55:13,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:55:13,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:55:13,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 16:55:13,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 16:55:13,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 16:55:13,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 16:55:13,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 16:55:13,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 4: [2022-11-26 16:55:13,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 4: [2022-11-26 16:55:13,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 16:55:13,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 4: [2022-11-26 16:55:13,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 4: [2022-11-26 16:55:13,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 4: [2022-11-26 16:55:13,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:55:13,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 4: [2022-11-26 16:55:13,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:55:13,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 16:55:13,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 4: [2022-11-26 16:55:13,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step26000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 16:55:13,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: successfully saved checkpoint at iteration 26000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 6102.45 7: iteration 26010/ 44073 | consumed samples: 13317120 | consumed tokens: 27273461760 | elapsed time per iteration (s): 4.88 | learning rate: 8.598E-05 | global batch size: 512 | lm loss: 1.975064E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 105.025 | TFLOPs: 48.95 | 7: iteration 26020/ 44073 | consumed samples: 13322240 | consumed tokens: 27283947520 | elapsed time per iteration (s): 4.18 | learning rate: 8.591E-05 | global batch size: 512 | lm loss: 1.962239E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.392 | TFLOPs: 57.04 | 7: iteration 26030/ 44073 | consumed samples: 13327360 | consumed tokens: 27294433280 | elapsed time per iteration (s): 4.13 | learning rate: 8.585E-05 | global batch size: 512 | lm loss: 1.971895E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.832 | TFLOPs: 57.71 | 7: iteration 26040/ 44073 | consumed samples: 13332480 | consumed tokens: 27304919040 | elapsed time per iteration (s): 4.16 | learning rate: 8.579E-05 | global batch size: 512 | lm loss: 1.981322E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.160 | TFLOPs: 57.40 | 7: iteration 26050/ 44073 | consumed samples: 13337600 | consumed tokens: 27315404800 | elapsed time per iteration (s): 4.29 | learning rate: 8.573E-05 | global batch size: 512 | lm loss: 1.966945E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.367 | TFLOPs: 55.63 | 7: iteration 26060/ 44073 | consumed samples: 13342720 | consumed tokens: 27325890560 | elapsed time per iteration (s): 4.17 | learning rate: 8.566E-05 | global batch size: 512 | lm loss: 1.992301E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.828 | TFLOPs: 57.24 | 7: iteration 26070/ 44073 | consumed samples: 13347840 | consumed tokens: 27336376320 | elapsed time per iteration (s): 4.13 | learning rate: 8.560E-05 | global batch size: 512 | lm loss: 1.996100E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.953 | TFLOPs: 57.77 | 7: iteration 26080/ 44073 | consumed samples: 13352960 | consumed tokens: 27346862080 | elapsed time per iteration (s): 4.13 | learning rate: 8.554E-05 | global batch size: 512 | lm loss: 2.001755E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.843 | TFLOPs: 57.72 | 7: iteration 26090/ 44073 | consumed samples: 13358080 | consumed tokens: 27357347840 | elapsed time per iteration (s): 4.13 | learning rate: 8.548E-05 | global batch size: 512 | lm loss: 2.000854E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 124.014 | TFLOPs: 57.80 | 7: iteration 26100/ 44073 | consumed samples: 13363200 | consumed tokens: 27367833600 | elapsed time per iteration (s): 4.14 | learning rate: 8.541E-05 | global batch size: 512 | lm loss: 1.983892E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.765 | TFLOPs: 57.68 | 7: iteration 26110/ 44073 | consumed samples: 13368320 | consumed tokens: 27378319360 | elapsed time per iteration (s): 4.13 | learning rate: 8.535E-05 | global batch size: 512 | lm loss: 2.008453E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.856 | TFLOPs: 57.72 | 7: iteration 26120/ 44073 | consumed samples: 13373440 | consumed tokens: 27388805120 | elapsed time per iteration (s): 4.17 | learning rate: 8.529E-05 | global batch size: 512 | lm loss: 1.968088E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.886 | TFLOPs: 57.27 | 7: iteration 26130/ 44073 | consumed samples: 13378560 | consumed tokens: 27399290880 | elapsed time per iteration (s): 4.15 | learning rate: 8.523E-05 | global batch size: 512 | lm loss: 1.983208E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.437 | TFLOPs: 57.53 | 7: iteration 26140/ 44073 | consumed samples: 13383680 | consumed tokens: 27409776640 | elapsed time per iteration (s): 4.17 | learning rate: 8.517E-05 | global batch size: 512 | lm loss: 1.998677E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.875 | TFLOPs: 57.27 | 7: iteration 26150/ 44073 | consumed samples: 13388800 | consumed tokens: 27420262400 | elapsed time per iteration (s): 4.14 | learning rate: 8.510E-05 | global batch size: 512 | lm loss: 1.999682E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.568 | TFLOPs: 57.59 | 7: iteration 26160/ 44073 | consumed samples: 13393920 | consumed tokens: 27430748160 | elapsed time per iteration (s): 4.15 | learning rate: 8.504E-05 | global batch size: 512 | lm loss: 1.983787E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.448 | TFLOPs: 57.53 | 7: iteration 26170/ 44073 | consumed samples: 13399040 | consumed tokens: 27441233920 | elapsed time per iteration (s): 4.13 | learning rate: 8.498E-05 | global batch size: 512 | lm loss: 1.981425E+00 | grad norm: 0.114 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.968 | TFLOPs: 57.78 | 7: iteration 26180/ 44073 | consumed samples: 13404160 | consumed tokens: 27451719680 | elapsed time per iteration (s): 4.15 | learning rate: 8.492E-05 | global batch size: 512 | lm loss: 1.967341E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.270 | TFLOPs: 57.45 | 7: iteration 26190/ 44073 | consumed samples: 13409280 | consumed tokens: 27462205440 | elapsed time per iteration (s): 4.14 | learning rate: 8.485E-05 | global batch size: 512 | lm loss: 1.991452E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.604 | TFLOPs: 57.61 | 7: iteration 26200/ 44073 | consumed samples: 13414400 | consumed tokens: 27472691200 | elapsed time per iteration (s): 4.16 | learning rate: 8.479E-05 | global batch size: 512 | lm loss: 1.976195E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.154 | TFLOPs: 57.40 | 7: iteration 26210/ 44073 | consumed samples: 13419520 | consumed tokens: 27483176960 | elapsed time per iteration (s): 4.15 | learning rate: 8.473E-05 | global batch size: 512 | lm loss: 2.004479E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.317 | TFLOPs: 57.47 | 7: iteration 26220/ 44073 | consumed samples: 13424640 | consumed tokens: 27493662720 | elapsed time per iteration (s): 4.13 | learning rate: 8.467E-05 | global batch size: 512 | lm loss: 1.980326E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.854 | TFLOPs: 57.72 | 7: iteration 26230/ 44073 | consumed samples: 13429760 | consumed tokens: 27504148480 | elapsed time per iteration (s): 4.15 | learning rate: 8.461E-05 | global batch size: 512 | lm loss: 1.975476E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.273 | TFLOPs: 57.45 | 7: iteration 26240/ 44073 | consumed samples: 13434880 | consumed tokens: 27514634240 | elapsed time per iteration (s): 4.15 | learning rate: 8.454E-05 | global batch size: 512 | lm loss: 1.989649E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.340 | TFLOPs: 57.48 | 7: iteration 26250/ 44073 | consumed samples: 13440000 | consumed tokens: 27525120000 | elapsed time per iteration (s): 4.15 | learning rate: 8.448E-05 | global batch size: 512 | lm loss: 1.973881E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.410 | TFLOPs: 57.52 | 7: iteration 26260/ 44073 | consumed samples: 13445120 | consumed tokens: 27535605760 | elapsed time per iteration (s): 4.13 | learning rate: 8.442E-05 | global batch size: 512 | lm loss: 1.960179E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.927 | TFLOPs: 57.76 | 7: iteration 26270/ 44073 | consumed samples: 13450240 | consumed tokens: 27546091520 | elapsed time per iteration (s): 4.30 | learning rate: 8.436E-05 | global batch size: 512 | lm loss: 1.976778E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.010 | TFLOPs: 55.46 | 7: iteration 26280/ 44073 | consumed samples: 13455360 | consumed tokens: 27556577280 | elapsed time per iteration (s): 4.13 | learning rate: 8.429E-05 | global batch size: 512 | lm loss: 1.987304E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.890 | TFLOPs: 57.74 | 7: iteration 26290/ 44073 | consumed samples: 13460480 | consumed tokens: 27567063040 | elapsed time per iteration (s): 4.14 | learning rate: 8.423E-05 | global batch size: 512 | lm loss: 1.978966E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.707 | TFLOPs: 57.65 | 7: iteration 26300/ 44073 | consumed samples: 13465600 | consumed tokens: 27577548800 | elapsed time per iteration (s): 4.14 | learning rate: 8.417E-05 | global batch size: 512 | lm loss: 1.985590E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.644 | TFLOPs: 57.62 | 7: iteration 26310/ 44073 | consumed samples: 13470720 | consumed tokens: 27588034560 | elapsed time per iteration (s): 4.13 | learning rate: 8.411E-05 | global batch size: 512 | lm loss: 1.967863E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.879 | TFLOPs: 57.73 | 7: iteration 26320/ 44073 | consumed samples: 13475840 | consumed tokens: 27598520320 | elapsed time per iteration (s): 4.13 | learning rate: 8.405E-05 | global batch size: 512 | lm loss: 1.992157E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.856 | TFLOPs: 57.72 | 7: iteration 26330/ 44073 | consumed samples: 13480960 | consumed tokens: 27609006080 | elapsed time per iteration (s): 4.13 | learning rate: 8.398E-05 | global batch size: 512 | lm loss: 1.980028E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.917 | TFLOPs: 57.75 | 7: iteration 26340/ 44073 | consumed samples: 13486080 | consumed tokens: 27619491840 | elapsed time per iteration (s): 4.14 | learning rate: 8.392E-05 | global batch size: 512 | lm loss: 1.985288E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.739 | TFLOPs: 57.67 | 7: iteration 26350/ 44073 | consumed samples: 13491200 | consumed tokens: 27629977600 | elapsed time per iteration (s): 4.13 | learning rate: 8.386E-05 | global batch size: 512 | lm loss: 1.980812E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.867 | TFLOPs: 57.73 | 7: iteration 26360/ 44073 | consumed samples: 13496320 | consumed tokens: 27640463360 | elapsed time per iteration (s): 4.13 | learning rate: 8.380E-05 | global batch size: 512 | lm loss: 1.993415E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.844 | TFLOPs: 57.72 | 7: iteration 26370/ 44073 | consumed samples: 13501440 | consumed tokens: 27650949120 | elapsed time per iteration (s): 4.14 | learning rate: 8.374E-05 | global batch size: 512 | lm loss: 1.986956E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.769 | TFLOPs: 57.68 | 7: iteration 26380/ 44073 | consumed samples: 13506560 | consumed tokens: 27661434880 | elapsed time per iteration (s): 4.17 | learning rate: 8.367E-05 | global batch size: 512 | lm loss: 2.001777E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.716 | TFLOPs: 57.19 | 7: iteration 26390/ 44073 | consumed samples: 13511680 | consumed tokens: 27671920640 | elapsed time per iteration (s): 4.17 | learning rate: 8.361E-05 | global batch size: 512 | lm loss: 1.994871E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.733 | TFLOPs: 57.20 | 7: iteration 26400/ 44073 | consumed samples: 13516800 | consumed tokens: 27682406400 | elapsed time per iteration (s): 4.13 | learning rate: 8.355E-05 | global batch size: 512 | lm loss: 1.961502E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.907 | TFLOPs: 57.75 | 7: iteration 26410/ 44073 | consumed samples: 13521920 | consumed tokens: 27692892160 | elapsed time per iteration (s): 4.15 | learning rate: 8.349E-05 | global batch size: 512 | lm loss: 1.985903E+00 | grad norm: 0.114 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.271 | TFLOPs: 57.45 | 7: iteration 26420/ 44073 | consumed samples: 13527040 | consumed tokens: 27703377920 | elapsed time per iteration (s): 4.28 | learning rate: 8.343E-05 | global batch size: 512 | lm loss: 1.979182E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.496 | TFLOPs: 55.69 | 7: iteration 26430/ 44073 | consumed samples: 13532160 | consumed tokens: 27713863680 | elapsed time per iteration (s): 4.13 | learning rate: 8.336E-05 | global batch size: 512 | lm loss: 1.978018E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.907 | TFLOPs: 57.75 | 7: iteration 26440/ 44073 | consumed samples: 13537280 | consumed tokens: 27724349440 | elapsed time per iteration (s): 4.13 | learning rate: 8.330E-05 | global batch size: 512 | lm loss: 2.000140E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.911 | TFLOPs: 57.75 | 7: iteration 26450/ 44073 | consumed samples: 13542400 | consumed tokens: 27734835200 | elapsed time per iteration (s): 4.14 | learning rate: 8.324E-05 | global batch size: 512 | lm loss: 1.985242E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.807 | TFLOPs: 57.70 | 7: iteration 26460/ 44073 | consumed samples: 13547520 | consumed tokens: 27745320960 | elapsed time per iteration (s): 4.14 | learning rate: 8.318E-05 | global batch size: 512 | lm loss: 1.960979E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.764 | TFLOPs: 57.68 | 7: iteration 26470/ 44073 | consumed samples: 13552640 | consumed tokens: 27755806720 | elapsed time per iteration (s): 4.15 | learning rate: 8.312E-05 | global batch size: 512 | lm loss: 1.971981E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.519 | TFLOPs: 57.57 | 7: iteration 26480/ 44073 | consumed samples: 13557760 | consumed tokens: 27766292480 | elapsed time per iteration (s): 4.13 | learning rate: 8.306E-05 | global batch size: 512 | lm loss: 1.957642E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.874 | TFLOPs: 57.73 | 7: iteration 26490/ 44073 | consumed samples: 13562880 | consumed tokens: 27776778240 | elapsed time per iteration (s): 4.30 | learning rate: 8.299E-05 | global batch size: 512 | lm loss: 1.990583E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.163 | TFLOPs: 55.54 | 7: iteration 26500/ 44073 | consumed samples: 13568000 | consumed tokens: 27787264000 | elapsed time per iteration (s): 4.15 | learning rate: 8.293E-05 | global batch size: 512 | lm loss: 1.978408E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.491 | TFLOPs: 57.55 | 7: iteration 26510/ 44073 | consumed samples: 13573120 | consumed tokens: 27797749760 | elapsed time per iteration (s): 4.13 | learning rate: 8.287E-05 | global batch size: 512 | lm loss: 1.991100E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.862 | TFLOPs: 57.73 | 7: iteration 26520/ 44073 | consumed samples: 13578240 | consumed tokens: 27808235520 | elapsed time per iteration (s): 4.18 | learning rate: 8.281E-05 | global batch size: 512 | lm loss: 1.998780E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.514 | TFLOPs: 57.10 | 7: iteration 26530/ 44073 | consumed samples: 13583360 | consumed tokens: 27818721280 | elapsed time per iteration (s): 4.16 | learning rate: 8.275E-05 | global batch size: 512 | lm loss: 1.983384E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.071 | TFLOPs: 57.36 | 7: iteration 26540/ 44073 | consumed samples: 13588480 | consumed tokens: 27829207040 | elapsed time per iteration (s): 4.17 | learning rate: 8.268E-05 | global batch size: 512 | lm loss: 1.985315E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.723 | TFLOPs: 57.20 | 7: iteration 26550/ 44073 | consumed samples: 13593600 | consumed tokens: 27839692800 | elapsed time per iteration (s): 4.30 | learning rate: 8.262E-05 | global batch size: 512 | lm loss: 1.993948E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.045 | TFLOPs: 55.48 | 7: iteration 26560/ 44073 | consumed samples: 13598720 | consumed tokens: 27850178560 | elapsed time per iteration (s): 4.15 | learning rate: 8.256E-05 | global batch size: 512 | lm loss: 1.985678E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.497 | TFLOPs: 57.56 | 7: iteration 26570/ 44073 | consumed samples: 13603840 | consumed tokens: 27860664320 | elapsed time per iteration (s): 4.16 | learning rate: 8.250E-05 | global batch size: 512 | lm loss: 1.990561E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.997 | TFLOPs: 57.32 | 7: iteration 26580/ 44073 | consumed samples: 13608960 | consumed tokens: 27871150080 | elapsed time per iteration (s): 4.14 | learning rate: 8.244E-05 | global batch size: 512 | lm loss: 1.985769E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.743 | TFLOPs: 57.67 | 7: iteration 26590/ 44073 | consumed samples: 13614080 | consumed tokens: 27881635840 | elapsed time per iteration (s): 4.18 | learning rate: 8.238E-05 | global batch size: 512 | lm loss: 1.988239E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.395 | TFLOPs: 57.04 | 7: iteration 26600/ 44073 | consumed samples: 13619200 | consumed tokens: 27892121600 | elapsed time per iteration (s): 4.13 | learning rate: 8.231E-05 | global batch size: 512 | lm loss: 2.002833E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.945 | TFLOPs: 57.76 | 7: iteration 26610/ 44073 | consumed samples: 13624320 | consumed tokens: 27902607360 | elapsed time per iteration (s): 4.15 | learning rate: 8.225E-05 | global batch size: 512 | lm loss: 1.969679E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.473 | TFLOPs: 57.54 | 7: iteration 26620/ 44073 | consumed samples: 13629440 | consumed tokens: 27913093120 | elapsed time per iteration (s): 4.15 | learning rate: 8.219E-05 | global batch size: 512 | lm loss: 1.988087E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.512 | TFLOPs: 57.56 | 7: iteration 26630/ 44073 | consumed samples: 13634560 | consumed tokens: 27923578880 | elapsed time per iteration (s): 4.13 | learning rate: 8.213E-05 | global batch size: 512 | lm loss: 1.973970E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.941 | TFLOPs: 57.76 | 7: iteration 26640/ 44073 | consumed samples: 13639680 | consumed tokens: 27934064640 | elapsed time per iteration (s): 4.19 | learning rate: 8.207E-05 | global batch size: 512 | lm loss: 1.986964E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.278 | TFLOPs: 56.99 | 7: iteration 26650/ 44073 | consumed samples: 13644800 | consumed tokens: 27944550400 | elapsed time per iteration (s): 4.14 | learning rate: 8.201E-05 | global batch size: 512 | lm loss: 1.984981E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.528 | TFLOPs: 57.57 | 7: iteration 26660/ 44073 | consumed samples: 13649920 | consumed tokens: 27955036160 | elapsed time per iteration (s): 4.19 | learning rate: 8.194E-05 | global batch size: 512 | lm loss: 1.978181E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.303 | TFLOPs: 57.00 | 7: iteration 26670/ 44073 | consumed samples: 13655040 | consumed tokens: 27965521920 | elapsed time per iteration (s): 4.18 | learning rate: 8.188E-05 | global batch size: 512 | lm loss: 2.013022E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.569 | TFLOPs: 57.12 | 7: iteration 26680/ 44073 | consumed samples: 13660160 | consumed tokens: 27976007680 | elapsed time per iteration (s): 4.20 | learning rate: 8.182E-05 | global batch size: 512 | lm loss: 1.982003E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.804 | TFLOPs: 56.77 | 7: iteration 26690/ 44073 | consumed samples: 13665280 | consumed tokens: 27986493440 | elapsed time per iteration (s): 4.15 | learning rate: 8.176E-05 | global batch size: 512 | lm loss: 1.996584E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.454 | TFLOPs: 57.54 | 7: iteration 26700/ 44073 | consumed samples: 13670400 | consumed tokens: 27996979200 | elapsed time per iteration (s): 4.14 | learning rate: 8.170E-05 | global batch size: 512 | lm loss: 1.971472E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.602 | TFLOPs: 57.60 | 7: iteration 26710/ 44073 | consumed samples: 13675520 | consumed tokens: 28007464960 | elapsed time per iteration (s): 4.17 | learning rate: 8.164E-05 | global batch size: 512 | lm loss: 1.998529E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.707 | TFLOPs: 57.19 | 7: iteration 26720/ 44073 | consumed samples: 13680640 | consumed tokens: 28017950720 | elapsed time per iteration (s): 4.14 | learning rate: 8.158E-05 | global batch size: 512 | lm loss: 1.981174E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.644 | TFLOPs: 57.62 | 7: iteration 26730/ 44073 | consumed samples: 13685760 | consumed tokens: 28028436480 | elapsed time per iteration (s): 4.14 | learning rate: 8.151E-05 | global batch size: 512 | lm loss: 1.978761E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.702 | TFLOPs: 57.65 | 7: iteration 26740/ 44073 | consumed samples: 13690880 | consumed tokens: 28038922240 | elapsed time per iteration (s): 4.14 | learning rate: 8.145E-05 | global batch size: 512 | lm loss: 1.988525E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.639 | TFLOPs: 57.62 | 7: iteration 26750/ 44073 | consumed samples: 13696000 | consumed tokens: 28049408000 | elapsed time per iteration (s): 4.13 | learning rate: 8.139E-05 | global batch size: 512 | lm loss: 1.987538E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.827 | TFLOPs: 57.71 | 7: iteration 26760/ 44073 | consumed samples: 13701120 | consumed tokens: 28059893760 | elapsed time per iteration (s): 4.13 | learning rate: 8.133E-05 | global batch size: 512 | lm loss: 1.987526E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.920 | TFLOPs: 57.75 | 7: iteration 26770/ 44073 | consumed samples: 13706240 | consumed tokens: 28070379520 | elapsed time per iteration (s): 4.14 | learning rate: 8.127E-05 | global batch size: 512 | lm loss: 2.004974E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.632 | TFLOPs: 57.62 | 7: iteration 26780/ 44073 | consumed samples: 13711360 | consumed tokens: 28080865280 | elapsed time per iteration (s): 4.16 | learning rate: 8.121E-05 | global batch size: 512 | lm loss: 1.986357E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.144 | TFLOPs: 57.39 | 7: iteration 26790/ 44073 | consumed samples: 13716480 | consumed tokens: 28091351040 | elapsed time per iteration (s): 4.21 | learning rate: 8.115E-05 | global batch size: 512 | lm loss: 1.973225E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.603 | TFLOPs: 56.67 | 7: iteration 26800/ 44073 | consumed samples: 13721600 | consumed tokens: 28101836800 | elapsed time per iteration (s): 4.13 | learning rate: 8.108E-05 | global batch size: 512 | lm loss: 1.975000E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.901 | TFLOPs: 57.74 | 7: iteration 26810/ 44073 | consumed samples: 13726720 | consumed tokens: 28112322560 | elapsed time per iteration (s): 4.14 | learning rate: 8.102E-05 | global batch size: 512 | lm loss: 1.980805E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.540 | TFLOPs: 57.58 | 7: iteration 26820/ 44073 | consumed samples: 13731840 | consumed tokens: 28122808320 | elapsed time per iteration (s): 4.14 | learning rate: 8.096E-05 | global batch size: 512 | lm loss: 1.974225E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.631 | TFLOPs: 57.62 | 7: iteration 26830/ 44073 | consumed samples: 13736960 | consumed tokens: 28133294080 | elapsed time per iteration (s): 4.18 | learning rate: 8.090E-05 | global batch size: 512 | lm loss: 1.998885E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.589 | TFLOPs: 57.13 | 7: iteration 26840/ 44073 | consumed samples: 13742080 | consumed tokens: 28143779840 | elapsed time per iteration (s): 4.16 | learning rate: 8.084E-05 | global batch size: 512 | lm loss: 1.992799E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.127 | TFLOPs: 57.38 | 7: iteration 26850/ 44073 | consumed samples: 13747200 | consumed tokens: 28154265600 | elapsed time per iteration (s): 4.16 | learning rate: 8.078E-05 | global batch size: 512 | lm loss: 2.007664E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.223 | TFLOPs: 57.43 | 7: iteration 26860/ 44073 | consumed samples: 13752320 | consumed tokens: 28164751360 | elapsed time per iteration (s): 4.16 | learning rate: 8.072E-05 | global batch size: 512 | lm loss: 1.997292E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.018 | TFLOPs: 57.33 | 7: iteration 26870/ 44073 | consumed samples: 13757440 | consumed tokens: 28175237120 | elapsed time per iteration (s): 4.14 | learning rate: 8.066E-05 | global batch size: 512 | lm loss: 1.990266E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.744 | TFLOPs: 57.67 | 7: iteration 26880/ 44073 | consumed samples: 13762560 | consumed tokens: 28185722880 | elapsed time per iteration (s): 4.15 | learning rate: 8.059E-05 | global batch size: 512 | lm loss: 1.977253E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.263 | TFLOPs: 57.45 | 7: iteration 26890/ 44073 | consumed samples: 13767680 | consumed tokens: 28196208640 | elapsed time per iteration (s): 4.17 | learning rate: 8.053E-05 | global batch size: 512 | lm loss: 1.968460E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.701 | TFLOPs: 57.18 | 7: iteration 26900/ 44073 | consumed samples: 13772800 | consumed tokens: 28206694400 | elapsed time per iteration (s): 4.16 | learning rate: 8.047E-05 | global batch size: 512 | lm loss: 1.971027E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.195 | TFLOPs: 57.41 | 7: iteration 26910/ 44073 | consumed samples: 13777920 | consumed tokens: 28217180160 | elapsed time per iteration (s): 4.14 | learning rate: 8.041E-05 | global batch size: 512 | lm loss: 1.980107E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.801 | TFLOPs: 57.70 | 7: iteration 26920/ 44073 | consumed samples: 13783040 | consumed tokens: 28227665920 | elapsed time per iteration (s): 4.15 | learning rate: 8.035E-05 | global batch size: 512 | lm loss: 1.976189E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.515 | TFLOPs: 57.56 | 7: iteration 26930/ 44073 | consumed samples: 13788160 | consumed tokens: 28238151680 | elapsed time per iteration (s): 4.14 | learning rate: 8.029E-05 | global batch size: 512 | lm loss: 1.965658E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.714 | TFLOPs: 57.66 | 7: iteration 26940/ 44073 | consumed samples: 13793280 | consumed tokens: 28248637440 | elapsed time per iteration (s): 4.16 | learning rate: 8.023E-05 | global batch size: 512 | lm loss: 1.965010E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.050 | TFLOPs: 57.35 | 7: iteration 26950/ 44073 | consumed samples: 13798400 | consumed tokens: 28259123200 | elapsed time per iteration (s): 4.18 | learning rate: 8.017E-05 | global batch size: 512 | lm loss: 1.985451E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.482 | TFLOPs: 57.08 | 7: iteration 26960/ 44073 | consumed samples: 13803520 | consumed tokens: 28269608960 | elapsed time per iteration (s): 4.13 | learning rate: 8.010E-05 | global batch size: 512 | lm loss: 1.979415E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.868 | TFLOPs: 57.73 | 7: iteration 26970/ 44073 | consumed samples: 13808640 | consumed tokens: 28280094720 | elapsed time per iteration (s): 4.13 | learning rate: 8.004E-05 | global batch size: 512 | lm loss: 1.969100E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.919 | TFLOPs: 57.75 | 7: iteration 26980/ 44073 | consumed samples: 13813760 | consumed tokens: 28290580480 | elapsed time per iteration (s): 4.29 | learning rate: 7.998E-05 | global batch size: 512 | lm loss: 1.972381E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.267 | TFLOPs: 55.58 | 7: iteration 26990/ 44073 | consumed samples: 13818880 | consumed tokens: 28301066240 | elapsed time per iteration (s): 4.13 | learning rate: 7.992E-05 | global batch size: 512 | lm loss: 1.994390E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.891 | TFLOPs: 57.74 | 7: iteration 27000/ 44073 | consumed samples: 13824000 | consumed tokens: 28311552000 | elapsed time per iteration (s): 4.32 | learning rate: 7.986E-05 | global batch size: 512 | lm loss: 1.985397E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.480 | TFLOPs: 55.22 | 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 27000 | lm loss value: 1.983988E+00 | lm loss PPL: 7.271687E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 27000 to checkpoints_2b2 0: [2022-11-26 18:04:33,890] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step27000 is begin to save! 0: [2022-11-26 18:04:33,898] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_01-model_00-model_states.pt... 0: [2022-11-26 18:04:34,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_01-model_00-model_states.pt. 0: [2022-11-26 18:04:34,218] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_03-model_00-model_states.pt... 0: [2022-11-26 18:04:34,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_03-model_00-model_states.pt. 0: [2022-11-26 18:04:34,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_04-model_00-model_states.pt... 0: [2022-11-26 18:04:34,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_04-model_00-model_states.pt. 0: [2022-11-26 18:04:34,516] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_05-model_00-model_states.pt... 0: [2022-11-26 18:04:34,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_05-model_00-model_states.pt. 0: [2022-11-26 18:04:34,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_06-model_00-model_states.pt... 0: [2022-11-26 18:04:34,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_06-model_00-model_states.pt. 0: [2022-11-26 18:04:34,802] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_07-model_00-model_states.pt... 0: [2022-11-26 18:04:34,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_07-model_00-model_states.pt. 0: [2022-11-26 18:04:34,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_08-model_00-model_states.pt... 0: [2022-11-26 18:04:35,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_08-model_00-model_states.pt. 0: [2022-11-26 18:04:35,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_09-model_00-model_states.pt... 0: [2022-11-26 18:04:35,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_09-model_00-model_states.pt. 0: [2022-11-26 18:04:35,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_10-model_00-model_states.pt... 0: [2022-11-26 18:04:35,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_10-model_00-model_states.pt. 0: [2022-11-26 18:04:35,369] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_11-model_00-model_states.pt... 0: [2022-11-26 18:04:35,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_11-model_00-model_states.pt. 0: [2022-11-26 18:04:35,510] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_12-model_00-model_states.pt... 0: [2022-11-26 18:04:35,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_12-model_00-model_states.pt. 0: [2022-11-26 18:04:35,651] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_13-model_00-model_states.pt... 0: [2022-11-26 18:04:35,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_13-model_00-model_states.pt. 0: [2022-11-26 18:04:35,850] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_14-model_00-model_states.pt... 0: [2022-11-26 18:04:35,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_14-model_00-model_states.pt. 0: [2022-11-26 18:04:35,990] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_15-model_00-model_states.pt... 0: [2022-11-26 18:04:36,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_15-model_00-model_states.pt. 0: [2022-11-26 18:04:36,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_16-model_00-model_states.pt... 0: [2022-11-26 18:04:36,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_16-model_00-model_states.pt. 0: [2022-11-26 18:04:36,268] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_17-model_00-model_states.pt... 0: [2022-11-26 18:04:36,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_17-model_00-model_states.pt. 0: [2022-11-26 18:04:36,408] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_18-model_00-model_states.pt... 0: [2022-11-26 18:04:36,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_18-model_00-model_states.pt. 0: [2022-11-26 18:04:36,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_19-model_00-model_states.pt... 0: [2022-11-26 18:04:36,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_19-model_00-model_states.pt. 0: [2022-11-26 18:04:36,689] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_20-model_00-model_states.pt... 0: [2022-11-26 18:04:36,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_20-model_00-model_states.pt. 0: [2022-11-26 18:04:36,824] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_21-model_00-model_states.pt... 0: [2022-11-26 18:04:36,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_21-model_00-model_states.pt. 0: [2022-11-26 18:04:36,965] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_22-model_00-model_states.pt... 0: [2022-11-26 18:04:37,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_22-model_00-model_states.pt. 0: [2022-11-26 18:04:37,103] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_23-model_00-model_states.pt... 0: [2022-11-26 18:04:37,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_23-model_00-model_states.pt. 0: [2022-11-26 18:04:37,242] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_24-model_00-model_states.pt... 0: [2022-11-26 18:04:37,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_24-model_00-model_states.pt. 0: [2022-11-26 18:04:37,386] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_25-model_00-model_states.pt... 0: [2022-11-26 18:04:37,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_25-model_00-model_states.pt. 0: [2022-11-26 18:04:37,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_26-model_00-model_states.pt... 0: [2022-11-26 18:04:37,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_26-model_00-model_states.pt. 0: [2022-11-26 18:04:37,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_27-model_00-model_states.pt... 0: [2022-11-26 18:04:37,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_27-model_00-model_states.pt. 0: [2022-11-26 18:04:37,799] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_28-model_00-model_states.pt... 0: [2022-11-26 18:04:37,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_28-model_00-model_states.pt. 0: [2022-11-26 18:04:37,939] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_29-model_00-model_states.pt... 0: [2022-11-26 18:04:38,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_29-model_00-model_states.pt. 0: [2022-11-26 18:04:38,076] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_30-model_00-model_states.pt... 0: [2022-11-26 18:04:38,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_30-model_00-model_states.pt. 0: [2022-11-26 18:04:38,213] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_31-model_00-model_states.pt... 0: [2022-11-26 18:04:38,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_31-model_00-model_states.pt. 0: [2022-11-26 18:04:38,352] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_32-model_00-model_states.pt... 0: [2022-11-26 18:04:38,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_32-model_00-model_states.pt. 0: [2022-11-26 18:04:38,485] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_33-model_00-model_states.pt... 0: [2022-11-26 18:04:38,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_33-model_00-model_states.pt. 0: [2022-11-26 18:04:38,622] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_34-model_00-model_states.pt... 0: [2022-11-26 18:04:38,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_34-model_00-model_states.pt. 0: [2022-11-26 18:04:38,760] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/layer_36-model_00-model_states.pt... 0: [2022-11-26 18:04:38,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/layer_36-model_00-model_states.pt. 0: [2022-11-26 18:04:38,765] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step27000/mp_rank_00_model_states.pt 0: [2022-11-26 18:04:38,765] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/mp_rank_00_model_states.pt... 0: [2022-11-26 18:04:38,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/mp_rank_00_model_states.pt. 0: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:04:38,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step27000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:04:39,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:04:39,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:04:39,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 18:04:39,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: [2022-11-26 18:04:39,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:04:39,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 18:04:39,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: [2022-11-26 18:04:39,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:04:39,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 18:04:39,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: [2022-11-26 18:04:39,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:04:39,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 18:04:39,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: [2022-11-26 18:04:39,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:04:39,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 18:04:39,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: [2022-11-26 18:04:39,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:04:39,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 18:04:39,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: [2022-11-26 18:04:39,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:04:39,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 18:04:39,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 3: [2022-11-26 18:04:39,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:04:39,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:04:39,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 18:04:39,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 18:04:39,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 3: [2022-11-26 18:04:39,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 3: [2022-11-26 18:04:39,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:04:39,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 18:04:39,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 3: [2022-11-26 18:04:39,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:04:39,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 18:04:39,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 3: [2022-11-26 18:04:39,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:04:39,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 18:04:39,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 3: [2022-11-26 18:04:39,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:04:39,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 18:04:39,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 3: [2022-11-26 18:04:39,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:04:39,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 18:04:39,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 3: [2022-11-26 18:04:39,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:04:39,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 18:04:39,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 7: [2022-11-26 18:04:39,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:04:39,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 18:04:39,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 7: [2022-11-26 18:04:39,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:04:39,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:04:39,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 18:04:39,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 18:04:39,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 7: [2022-11-26 18:04:39,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 7: [2022-11-26 18:04:39,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:04:39,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 5: [2022-11-26 18:04:39,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:04:39,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:04:39,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 18:04:39,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 18:04:39,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 5: [2022-11-26 18:04:39,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 5: [2022-11-26 18:04:39,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:04:39,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:04:39,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:04:39,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:04:39,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:04:39,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 18:04:39,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 5: [2022-11-26 18:04:39,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:04:39,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 18:04:39,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 18:04:39,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 18:04:39,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 18:04:39,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 18:04:39,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 5: [2022-11-26 18:04:39,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 5: [2022-11-26 18:04:39,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 5: [2022-11-26 18:04:39,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 5: [2022-11-26 18:04:39,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 7: [2022-11-26 18:04:39,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 7: [2022-11-26 18:04:39,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:04:39,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 18:04:39,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 7: [2022-11-26 18:04:39,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:04:39,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 18:04:39,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 7: [2022-11-26 18:04:39,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:04:39,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 18:04:39,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 7: [2022-11-26 18:04:39,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:04:39,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 18:04:39,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 6: [2022-11-26 18:04:39,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:04:39,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:04:39,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:04:39,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:04:39,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:04:39,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:04:39,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:04:39,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 18:04:39,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:04:39,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 18:04:39,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 18:04:39,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 18:04:39,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 18:04:39,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 18:04:39,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 6: [2022-11-26 18:04:39,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 18:04:39,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 18:04:39,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 6: [2022-11-26 18:04:39,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 6: [2022-11-26 18:04:39,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 6: [2022-11-26 18:04:39,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 6: [2022-11-26 18:04:39,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 6: [2022-11-26 18:04:39,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 6: [2022-11-26 18:04:39,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: [2022-11-26 18:04:39,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 18:04:39,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 2: [2022-11-26 18:04:39,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:04:39,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 18:04:39,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 2: [2022-11-26 18:04:39,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:04:39,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 18:04:39,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 2: [2022-11-26 18:04:39,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:04:39,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 18:04:39,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 2: [2022-11-26 18:04:39,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:04:39,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 18:04:39,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 2: [2022-11-26 18:04:39,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:04:39,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:04:39,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 18:04:39,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 18:04:39,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 2: [2022-11-26 18:04:39,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 2: [2022-11-26 18:04:39,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:04:39,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:04:39,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 18:04:39,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 18:04:39,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 2: [2022-11-26 18:04:39,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 18:04:40,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:04:40,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:04:40,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:04:40,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 18:04:40,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 18:04:40,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 18:04:40,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:04:40,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 18:04:40,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 18:04:40,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 18:04:40,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 18:04:40,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 18:04:40,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:04:40,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 18:04:40,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 18:04:40,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:04:40,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 18:04:40,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 18:04:40,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:04:40,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 18:04:40,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 18:04:40,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:04:40,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 18:04:40,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: successfully saved checkpoint at iteration 27000 to checkpoints_2b2 4: [2022-11-26 18:04:40,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:04:40,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 18:04:40,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:04:40,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:04:40,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:04:40,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 4: [2022-11-26 18:04:40,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:04:40,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:04:40,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:04:40,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 18:04:40,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:04:40,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 18:04:40,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 18:04:40,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 18:04:40,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 4: [2022-11-26 18:04:40,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 18:04:40,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 18:04:40,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step27000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 18:04:40,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 4: [2022-11-26 18:04:40,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 4: [2022-11-26 18:04:40,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 4: [2022-11-26 18:04:40,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 4: [2022-11-26 18:04:40,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 4: [2022-11-26 18:04:40,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 7: time (ms) | save-checkpoint: 6246.43 7: iteration 27010/ 44073 | consumed samples: 13829120 | consumed tokens: 28322037760 | elapsed time per iteration (s): 4.91 | learning rate: 7.980E-05 | global batch size: 512 | lm loss: 1.959452E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 104.315 | TFLOPs: 48.62 | 7: iteration 27020/ 44073 | consumed samples: 13834240 | consumed tokens: 28332523520 | elapsed time per iteration (s): 4.17 | learning rate: 7.974E-05 | global batch size: 512 | lm loss: 1.966917E+00 | grad norm: 0.115 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.653 | TFLOPs: 57.16 | 7: iteration 27030/ 44073 | consumed samples: 13839360 | consumed tokens: 28343009280 | elapsed time per iteration (s): 4.17 | learning rate: 7.968E-05 | global batch size: 512 | lm loss: 1.968919E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.826 | TFLOPs: 57.24 | 7: iteration 27040/ 44073 | consumed samples: 13844480 | consumed tokens: 28353495040 | elapsed time per iteration (s): 4.15 | learning rate: 7.962E-05 | global batch size: 512 | lm loss: 1.967478E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.503 | TFLOPs: 57.56 | 7: iteration 27050/ 44073 | consumed samples: 13849600 | consumed tokens: 28363980800 | elapsed time per iteration (s): 4.15 | learning rate: 7.956E-05 | global batch size: 512 | lm loss: 1.998078E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.232 | TFLOPs: 57.43 | 7: iteration 27060/ 44073 | consumed samples: 13854720 | consumed tokens: 28374466560 | elapsed time per iteration (s): 4.18 | learning rate: 7.949E-05 | global batch size: 512 | lm loss: 1.964843E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.432 | TFLOPs: 57.06 | 7: iteration 27070/ 44073 | consumed samples: 13859840 | consumed tokens: 28384952320 | elapsed time per iteration (s): 4.13 | learning rate: 7.943E-05 | global batch size: 512 | lm loss: 1.996779E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.863 | TFLOPs: 57.73 | 7: iteration 27080/ 44073 | consumed samples: 13864960 | consumed tokens: 28395438080 | elapsed time per iteration (s): 4.14 | learning rate: 7.937E-05 | global batch size: 512 | lm loss: 1.977963E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.540 | TFLOPs: 57.58 | 7: iteration 27090/ 44073 | consumed samples: 13870080 | consumed tokens: 28405923840 | elapsed time per iteration (s): 4.14 | learning rate: 7.931E-05 | global batch size: 512 | lm loss: 1.979209E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.658 | TFLOPs: 57.63 | 7: iteration 27100/ 44073 | consumed samples: 13875200 | consumed tokens: 28416409600 | elapsed time per iteration (s): 4.18 | learning rate: 7.925E-05 | global batch size: 512 | lm loss: 1.986141E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.375 | TFLOPs: 57.03 | 7: iteration 27110/ 44073 | consumed samples: 13880320 | consumed tokens: 28426895360 | elapsed time per iteration (s): 4.15 | learning rate: 7.919E-05 | global batch size: 512 | lm loss: 1.967361E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.455 | TFLOPs: 57.54 | 7: iteration 27120/ 44073 | consumed samples: 13885440 | consumed tokens: 28437381120 | elapsed time per iteration (s): 4.16 | learning rate: 7.913E-05 | global batch size: 512 | lm loss: 1.965239E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.931 | TFLOPs: 57.29 | 7: iteration 27130/ 44073 | consumed samples: 13890560 | consumed tokens: 28447866880 | elapsed time per iteration (s): 4.14 | learning rate: 7.907E-05 | global batch size: 512 | lm loss: 1.996633E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.593 | TFLOPs: 57.60 | 7: iteration 27140/ 44073 | consumed samples: 13895680 | consumed tokens: 28458352640 | elapsed time per iteration (s): 4.16 | learning rate: 7.901E-05 | global batch size: 512 | lm loss: 2.008146E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.132 | TFLOPs: 57.39 | 7: iteration 27150/ 44073 | consumed samples: 13900800 | consumed tokens: 28468838400 | elapsed time per iteration (s): 4.19 | learning rate: 7.895E-05 | global batch size: 512 | lm loss: 2.000170E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.260 | TFLOPs: 56.98 | 7: iteration 27160/ 44073 | consumed samples: 13905920 | consumed tokens: 28479324160 | elapsed time per iteration (s): 4.17 | learning rate: 7.889E-05 | global batch size: 512 | lm loss: 1.976992E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.659 | TFLOPs: 57.17 | 7: iteration 27170/ 44073 | consumed samples: 13911040 | consumed tokens: 28489809920 | elapsed time per iteration (s): 4.16 | learning rate: 7.882E-05 | global batch size: 512 | lm loss: 1.972664E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.107 | TFLOPs: 57.37 | 7: iteration 27180/ 44073 | consumed samples: 13916160 | consumed tokens: 28500295680 | elapsed time per iteration (s): 4.19 | learning rate: 7.876E-05 | global batch size: 512 | lm loss: 2.000785E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.097 | TFLOPs: 56.90 | 7: iteration 27190/ 44073 | consumed samples: 13921280 | consumed tokens: 28510781440 | elapsed time per iteration (s): 4.15 | learning rate: 7.870E-05 | global batch size: 512 | lm loss: 1.975515E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.457 | TFLOPs: 57.54 | 7: iteration 27200/ 44073 | consumed samples: 13926400 | consumed tokens: 28521267200 | elapsed time per iteration (s): 4.16 | learning rate: 7.864E-05 | global batch size: 512 | lm loss: 1.985011E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.028 | TFLOPs: 57.34 | 7: iteration 27210/ 44073 | consumed samples: 13931520 | consumed tokens: 28531752960 | elapsed time per iteration (s): 4.16 | learning rate: 7.858E-05 | global batch size: 512 | lm loss: 1.953408E+00 | grad norm: 0.115 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.104 | TFLOPs: 57.37 | 7: iteration 27220/ 44073 | consumed samples: 13936640 | consumed tokens: 28542238720 | elapsed time per iteration (s): 4.18 | learning rate: 7.852E-05 | global batch size: 512 | lm loss: 1.989783E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.498 | TFLOPs: 57.09 | 7: iteration 27230/ 44073 | consumed samples: 13941760 | consumed tokens: 28552724480 | elapsed time per iteration (s): 4.20 | learning rate: 7.846E-05 | global batch size: 512 | lm loss: 1.986398E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.905 | TFLOPs: 56.81 | 7: iteration 27240/ 44073 | consumed samples: 13946880 | consumed tokens: 28563210240 | elapsed time per iteration (s): 4.17 | learning rate: 7.840E-05 | global batch size: 512 | lm loss: 1.984715E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.763 | TFLOPs: 57.21 | 7: iteration 27250/ 44073 | consumed samples: 13952000 | consumed tokens: 28573696000 | elapsed time per iteration (s): 4.18 | learning rate: 7.834E-05 | global batch size: 512 | lm loss: 1.977288E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.623 | TFLOPs: 57.15 | 7: iteration 27260/ 44073 | consumed samples: 13957120 | consumed tokens: 28584181760 | elapsed time per iteration (s): 4.16 | learning rate: 7.828E-05 | global batch size: 512 | lm loss: 1.977786E+00 | grad norm: 0.112 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.955 | TFLOPs: 57.30 | 7: iteration 27270/ 44073 | consumed samples: 13962240 | consumed tokens: 28594667520 | elapsed time per iteration (s): 4.17 | learning rate: 7.822E-05 | global batch size: 512 | lm loss: 1.968068E+00 | grad norm: 0.114 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.884 | TFLOPs: 57.27 | 7: iteration 27280/ 44073 | consumed samples: 13967360 | consumed tokens: 28605153280 | elapsed time per iteration (s): 4.17 | learning rate: 7.816E-05 | global batch size: 512 | lm loss: 1.974557E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.687 | TFLOPs: 57.18 | 7: iteration 27290/ 44073 | consumed samples: 13972480 | consumed tokens: 28615639040 | elapsed time per iteration (s): 4.16 | learning rate: 7.810E-05 | global batch size: 512 | lm loss: 1.961656E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.108 | TFLOPs: 57.37 | 7: iteration 27300/ 44073 | consumed samples: 13977600 | consumed tokens: 28626124800 | elapsed time per iteration (s): 4.22 | learning rate: 7.804E-05 | global batch size: 512 | lm loss: 1.993303E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.313 | TFLOPs: 56.54 | 7: iteration 27310/ 44073 | consumed samples: 13982720 | consumed tokens: 28636610560 | elapsed time per iteration (s): 4.16 | learning rate: 7.797E-05 | global batch size: 512 | lm loss: 1.970577E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.973 | TFLOPs: 57.31 | 7: iteration 27320/ 44073 | consumed samples: 13987840 | consumed tokens: 28647096320 | elapsed time per iteration (s): 4.16 | learning rate: 7.791E-05 | global batch size: 512 | lm loss: 1.980907E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.036 | TFLOPs: 57.34 | 7: iteration 27330/ 44073 | consumed samples: 13992960 | consumed tokens: 28657582080 | elapsed time per iteration (s): 4.17 | learning rate: 7.785E-05 | global batch size: 512 | lm loss: 1.952723E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.854 | TFLOPs: 57.26 | 7: iteration 27340/ 44073 | consumed samples: 13998080 | consumed tokens: 28668067840 | elapsed time per iteration (s): 4.15 | learning rate: 7.779E-05 | global batch size: 512 | lm loss: 1.973365E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.252 | TFLOPs: 57.44 | 7: iteration 27350/ 44073 | consumed samples: 14003200 | consumed tokens: 28678553600 | elapsed time per iteration (s): 4.14 | learning rate: 7.773E-05 | global batch size: 512 | lm loss: 1.988824E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.636 | TFLOPs: 57.62 | 7: iteration 27360/ 44073 | consumed samples: 14008320 | consumed tokens: 28689039360 | elapsed time per iteration (s): 4.41 | learning rate: 7.767E-05 | global batch size: 512 | lm loss: 1.976867E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 116.088 | TFLOPs: 54.10 | 7: iteration 27370/ 44073 | consumed samples: 14013440 | consumed tokens: 28699525120 | elapsed time per iteration (s): 4.19 | learning rate: 7.761E-05 | global batch size: 512 | lm loss: 1.987526E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.172 | TFLOPs: 56.94 | 7: iteration 27380/ 44073 | consumed samples: 14018560 | consumed tokens: 28710010880 | elapsed time per iteration (s): 4.17 | learning rate: 7.755E-05 | global batch size: 512 | lm loss: 1.963314E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.646 | TFLOPs: 57.16 | 7: iteration 27390/ 44073 | consumed samples: 14023680 | consumed tokens: 28720496640 | elapsed time per iteration (s): 4.19 | learning rate: 7.749E-05 | global batch size: 512 | lm loss: 1.996653E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.316 | TFLOPs: 57.01 | 7: iteration 27400/ 44073 | consumed samples: 14028800 | consumed tokens: 28730982400 | elapsed time per iteration (s): 4.18 | learning rate: 7.743E-05 | global batch size: 512 | lm loss: 1.979793E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.607 | TFLOPs: 57.14 | 7: iteration 27410/ 44073 | consumed samples: 14033920 | consumed tokens: 28741468160 | elapsed time per iteration (s): 4.21 | learning rate: 7.737E-05 | global batch size: 512 | lm loss: 1.976561E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.602 | TFLOPs: 56.67 | 7: iteration 27420/ 44073 | consumed samples: 14039040 | consumed tokens: 28751953920 | elapsed time per iteration (s): 4.16 | learning rate: 7.731E-05 | global batch size: 512 | lm loss: 1.978463E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.057 | TFLOPs: 57.35 | 7: iteration 27430/ 44073 | consumed samples: 14044160 | consumed tokens: 28762439680 | elapsed time per iteration (s): 4.18 | learning rate: 7.725E-05 | global batch size: 512 | lm loss: 1.974670E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.509 | TFLOPs: 57.10 | 7: iteration 27440/ 44073 | consumed samples: 14049280 | consumed tokens: 28772925440 | elapsed time per iteration (s): 4.18 | learning rate: 7.719E-05 | global batch size: 512 | lm loss: 1.971966E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.539 | TFLOPs: 57.11 | 7: iteration 27450/ 44073 | consumed samples: 14054400 | consumed tokens: 28783411200 | elapsed time per iteration (s): 4.15 | learning rate: 7.713E-05 | global batch size: 512 | lm loss: 1.965007E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.511 | TFLOPs: 57.56 | 7: iteration 27460/ 44073 | consumed samples: 14059520 | consumed tokens: 28793896960 | elapsed time per iteration (s): 4.17 | learning rate: 7.707E-05 | global batch size: 512 | lm loss: 1.990468E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.723 | TFLOPs: 57.20 | 7: iteration 27470/ 44073 | consumed samples: 14064640 | consumed tokens: 28804382720 | elapsed time per iteration (s): 4.16 | learning rate: 7.701E-05 | global batch size: 512 | lm loss: 1.982566E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.171 | TFLOPs: 57.40 | 7: iteration 27480/ 44073 | consumed samples: 14069760 | consumed tokens: 28814868480 | elapsed time per iteration (s): 4.30 | learning rate: 7.695E-05 | global batch size: 512 | lm loss: 1.988672E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.975 | TFLOPs: 55.45 | 7: iteration 27490/ 44073 | consumed samples: 14074880 | consumed tokens: 28825354240 | elapsed time per iteration (s): 4.13 | learning rate: 7.689E-05 | global batch size: 512 | lm loss: 1.984309E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.891 | TFLOPs: 57.74 | 7: iteration 27500/ 44073 | consumed samples: 14080000 | consumed tokens: 28835840000 | elapsed time per iteration (s): 4.18 | learning rate: 7.683E-05 | global batch size: 512 | lm loss: 1.978556E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.508 | TFLOPs: 57.10 | 7: iteration 27510/ 44073 | consumed samples: 14085120 | consumed tokens: 28846325760 | elapsed time per iteration (s): 4.14 | learning rate: 7.677E-05 | global batch size: 512 | lm loss: 1.965413E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.690 | TFLOPs: 57.65 | 7: iteration 27520/ 44073 | consumed samples: 14090240 | consumed tokens: 28856811520 | elapsed time per iteration (s): 4.14 | learning rate: 7.671E-05 | global batch size: 512 | lm loss: 1.976583E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.734 | TFLOPs: 57.67 | 7: iteration 27530/ 44073 | consumed samples: 14095360 | consumed tokens: 28867297280 | elapsed time per iteration (s): 4.13 | learning rate: 7.665E-05 | global batch size: 512 | lm loss: 1.977283E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.878 | TFLOPs: 57.73 | 7: iteration 27540/ 44073 | consumed samples: 14100480 | consumed tokens: 28877783040 | elapsed time per iteration (s): 4.14 | learning rate: 7.659E-05 | global batch size: 512 | lm loss: 1.983666E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.629 | TFLOPs: 57.62 | 7: iteration 27550/ 44073 | consumed samples: 14105600 | consumed tokens: 28888268800 | elapsed time per iteration (s): 4.16 | learning rate: 7.653E-05 | global batch size: 512 | lm loss: 1.993562E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.966 | TFLOPs: 57.31 | 7: iteration 27560/ 44073 | consumed samples: 14110720 | consumed tokens: 28898754560 | elapsed time per iteration (s): 4.16 | learning rate: 7.647E-05 | global batch size: 512 | lm loss: 1.993641E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.212 | TFLOPs: 57.42 | 7: iteration 27570/ 44073 | consumed samples: 14115840 | consumed tokens: 28909240320 | elapsed time per iteration (s): 4.14 | learning rate: 7.641E-05 | global batch size: 512 | lm loss: 1.959390E+00 | grad norm: 0.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.811 | TFLOPs: 57.70 | 7: iteration 27580/ 44073 | consumed samples: 14120960 | consumed tokens: 28919726080 | elapsed time per iteration (s): 4.15 | learning rate: 7.635E-05 | global batch size: 512 | lm loss: 1.976715E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.242 | TFLOPs: 57.44 | 7: iteration 27590/ 44073 | consumed samples: 14126080 | consumed tokens: 28930211840 | elapsed time per iteration (s): 4.15 | learning rate: 7.629E-05 | global batch size: 512 | lm loss: 1.986285E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.520 | TFLOPs: 57.57 | 7: iteration 27600/ 44073 | consumed samples: 14131200 | consumed tokens: 28940697600 | elapsed time per iteration (s): 4.13 | learning rate: 7.623E-05 | global batch size: 512 | lm loss: 1.992300E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.877 | TFLOPs: 57.73 | 7: iteration 27610/ 44073 | consumed samples: 14136320 | consumed tokens: 28951183360 | elapsed time per iteration (s): 4.21 | learning rate: 7.617E-05 | global batch size: 512 | lm loss: 2.000592E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.514 | TFLOPs: 56.63 | 7: iteration 27620/ 44073 | consumed samples: 14141440 | consumed tokens: 28961669120 | elapsed time per iteration (s): 4.17 | learning rate: 7.611E-05 | global batch size: 512 | lm loss: 1.987909E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.860 | TFLOPs: 57.26 | 7: iteration 27630/ 44073 | consumed samples: 14146560 | consumed tokens: 28972154880 | elapsed time per iteration (s): 4.18 | learning rate: 7.605E-05 | global batch size: 512 | lm loss: 1.964473E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.588 | TFLOPs: 57.13 | 7: iteration 27640/ 44073 | consumed samples: 14151680 | consumed tokens: 28982640640 | elapsed time per iteration (s): 4.21 | learning rate: 7.599E-05 | global batch size: 512 | lm loss: 1.976179E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.700 | TFLOPs: 56.72 | 7: iteration 27650/ 44073 | consumed samples: 14156800 | consumed tokens: 28993126400 | elapsed time per iteration (s): 4.17 | learning rate: 7.593E-05 | global batch size: 512 | lm loss: 1.994427E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.736 | TFLOPs: 57.20 | 7: iteration 27660/ 44073 | consumed samples: 14161920 | consumed tokens: 29003612160 | elapsed time per iteration (s): 4.18 | learning rate: 7.587E-05 | global batch size: 512 | lm loss: 1.979066E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.634 | TFLOPs: 57.15 | 7: iteration 27670/ 44073 | consumed samples: 14167040 | consumed tokens: 29014097920 | elapsed time per iteration (s): 4.19 | learning rate: 7.581E-05 | global batch size: 512 | lm loss: 1.974698E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.148 | TFLOPs: 56.93 | 7: iteration 27680/ 44073 | consumed samples: 14172160 | consumed tokens: 29024583680 | elapsed time per iteration (s): 4.19 | learning rate: 7.575E-05 | global batch size: 512 | lm loss: 1.964574E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.279 | TFLOPs: 56.99 | 7: iteration 27690/ 44073 | consumed samples: 14177280 | consumed tokens: 29035069440 | elapsed time per iteration (s): 4.17 | learning rate: 7.569E-05 | global batch size: 512 | lm loss: 1.962667E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.654 | TFLOPs: 57.16 | 7: iteration 27700/ 44073 | consumed samples: 14182400 | consumed tokens: 29045555200 | elapsed time per iteration (s): 4.13 | learning rate: 7.563E-05 | global batch size: 512 | lm loss: 1.982389E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.889 | TFLOPs: 57.74 | 7: iteration 27710/ 44073 | consumed samples: 14187520 | consumed tokens: 29056040960 | elapsed time per iteration (s): 4.17 | learning rate: 7.557E-05 | global batch size: 512 | lm loss: 1.982931E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.658 | TFLOPs: 57.17 | 7: iteration 27720/ 44073 | consumed samples: 14192640 | consumed tokens: 29066526720 | elapsed time per iteration (s): 4.19 | learning rate: 7.551E-05 | global batch size: 512 | lm loss: 1.982277E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.081 | TFLOPs: 56.90 | 7: iteration 27730/ 44073 | consumed samples: 14197760 | consumed tokens: 29077012480 | elapsed time per iteration (s): 4.16 | learning rate: 7.545E-05 | global batch size: 512 | lm loss: 1.994114E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.930 | TFLOPs: 57.29 | 7: iteration 27740/ 44073 | consumed samples: 14202880 | consumed tokens: 29087498240 | elapsed time per iteration (s): 4.17 | learning rate: 7.539E-05 | global batch size: 512 | lm loss: 1.982764E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.909 | TFLOPs: 57.28 | 7: iteration 27750/ 44073 | consumed samples: 14208000 | consumed tokens: 29097984000 | elapsed time per iteration (s): 4.16 | learning rate: 7.533E-05 | global batch size: 512 | lm loss: 1.952847E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.078 | TFLOPs: 57.36 | 7: iteration 27760/ 44073 | consumed samples: 14213120 | consumed tokens: 29108469760 | elapsed time per iteration (s): 4.15 | learning rate: 7.527E-05 | global batch size: 512 | lm loss: 1.983847E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.364 | TFLOPs: 57.49 | 7: iteration 27770/ 44073 | consumed samples: 14218240 | consumed tokens: 29118955520 | elapsed time per iteration (s): 4.18 | learning rate: 7.521E-05 | global batch size: 512 | lm loss: 1.971270E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.570 | TFLOPs: 57.12 | 7: iteration 27780/ 44073 | consumed samples: 14223360 | consumed tokens: 29129441280 | elapsed time per iteration (s): 4.20 | learning rate: 7.515E-05 | global batch size: 512 | lm loss: 1.972226E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.866 | TFLOPs: 56.80 | 7: iteration 27790/ 44073 | consumed samples: 14228480 | consumed tokens: 29139927040 | elapsed time per iteration (s): 4.14 | learning rate: 7.509E-05 | global batch size: 512 | lm loss: 1.966865E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.743 | TFLOPs: 57.67 | 7: iteration 27800/ 44073 | consumed samples: 14233600 | consumed tokens: 29150412800 | elapsed time per iteration (s): 4.19 | learning rate: 7.503E-05 | global batch size: 512 | lm loss: 1.984733E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.175 | TFLOPs: 56.94 | 7: iteration 27810/ 44073 | consumed samples: 14238720 | consumed tokens: 29160898560 | elapsed time per iteration (s): 4.18 | learning rate: 7.497E-05 | global batch size: 512 | lm loss: 1.994654E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.510 | TFLOPs: 57.10 | 7: iteration 27820/ 44073 | consumed samples: 14243840 | consumed tokens: 29171384320 | elapsed time per iteration (s): 4.17 | learning rate: 7.491E-05 | global batch size: 512 | lm loss: 1.962260E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.722 | TFLOPs: 57.19 | 7: iteration 27830/ 44073 | consumed samples: 14248960 | consumed tokens: 29181870080 | elapsed time per iteration (s): 4.15 | learning rate: 7.485E-05 | global batch size: 512 | lm loss: 1.991804E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.305 | TFLOPs: 57.47 | 7: iteration 27840/ 44073 | consumed samples: 14254080 | consumed tokens: 29192355840 | elapsed time per iteration (s): 4.17 | learning rate: 7.479E-05 | global batch size: 512 | lm loss: 1.973135E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.863 | TFLOPs: 57.26 | 7: iteration 27850/ 44073 | consumed samples: 14259200 | consumed tokens: 29202841600 | elapsed time per iteration (s): 4.14 | learning rate: 7.473E-05 | global batch size: 512 | lm loss: 1.969526E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.600 | TFLOPs: 57.60 | 7: iteration 27860/ 44073 | consumed samples: 14264320 | consumed tokens: 29213327360 | elapsed time per iteration (s): 4.16 | learning rate: 7.467E-05 | global batch size: 512 | lm loss: 1.959130E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.932 | TFLOPs: 57.29 | 7: iteration 27870/ 44073 | consumed samples: 14269440 | consumed tokens: 29223813120 | elapsed time per iteration (s): 4.14 | learning rate: 7.461E-05 | global batch size: 512 | lm loss: 1.971637E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.558 | TFLOPs: 57.58 | 7: iteration 27880/ 44073 | consumed samples: 14274560 | consumed tokens: 29234298880 | elapsed time per iteration (s): 4.31 | learning rate: 7.455E-05 | global batch size: 512 | lm loss: 1.986003E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.670 | TFLOPs: 55.31 | 7: iteration 27890/ 44073 | consumed samples: 14279680 | consumed tokens: 29244784640 | elapsed time per iteration (s): 4.16 | learning rate: 7.449E-05 | global batch size: 512 | lm loss: 1.987833E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.108 | TFLOPs: 57.37 | 7: iteration 27900/ 44073 | consumed samples: 14284800 | consumed tokens: 29255270400 | elapsed time per iteration (s): 4.15 | learning rate: 7.443E-05 | global batch size: 512 | lm loss: 1.984713E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.236 | TFLOPs: 57.43 | 7: iteration 27910/ 44073 | consumed samples: 14289920 | consumed tokens: 29265756160 | elapsed time per iteration (s): 4.18 | learning rate: 7.437E-05 | global batch size: 512 | lm loss: 1.984749E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.444 | TFLOPs: 57.06 | 7: iteration 27920/ 44073 | consumed samples: 14295040 | consumed tokens: 29276241920 | elapsed time per iteration (s): 4.13 | learning rate: 7.431E-05 | global batch size: 512 | lm loss: 1.983612E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.958 | TFLOPs: 57.77 | 7: iteration 27930/ 44073 | consumed samples: 14300160 | consumed tokens: 29286727680 | elapsed time per iteration (s): 4.14 | learning rate: 7.425E-05 | global batch size: 512 | lm loss: 1.983110E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.603 | TFLOPs: 57.61 | 7: iteration 27940/ 44073 | consumed samples: 14305280 | consumed tokens: 29297213440 | elapsed time per iteration (s): 4.16 | learning rate: 7.419E-05 | global batch size: 512 | lm loss: 1.972046E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.185 | TFLOPs: 57.41 | 7: iteration 27950/ 44073 | consumed samples: 14310400 | consumed tokens: 29307699200 | elapsed time per iteration (s): 4.20 | learning rate: 7.413E-05 | global batch size: 512 | lm loss: 1.993669E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.031 | TFLOPs: 56.87 | 7: iteration 27960/ 44073 | consumed samples: 14315520 | consumed tokens: 29318184960 | elapsed time per iteration (s): 4.15 | learning rate: 7.408E-05 | global batch size: 512 | lm loss: 1.954282E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.496 | TFLOPs: 57.56 | 7: iteration 27970/ 44073 | consumed samples: 14320640 | consumed tokens: 29328670720 | elapsed time per iteration (s): 4.14 | learning rate: 7.402E-05 | global batch size: 512 | lm loss: 1.969758E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.780 | TFLOPs: 57.69 | 7: iteration 27980/ 44073 | consumed samples: 14325760 | consumed tokens: 29339156480 | elapsed time per iteration (s): 4.14 | learning rate: 7.396E-05 | global batch size: 512 | lm loss: 1.966420E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.748 | TFLOPs: 57.67 | 7: iteration 27990/ 44073 | consumed samples: 14330880 | consumed tokens: 29349642240 | elapsed time per iteration (s): 4.14 | learning rate: 7.390E-05 | global batch size: 512 | lm loss: 1.987685E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.640 | TFLOPs: 57.62 | 0: [2022-11-26 19:14:09,658] [INFO] [logging.py:68:log_dist] [Rank 0] step=28000, skipped=0, lr=[7.383759989222292e-05, 7.383759989222292e-05, 7.383759989222292e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 7: iteration 28000/ 44073 | consumed samples: 14336000 | consumed tokens: 29360128000 | elapsed time per iteration (s): 4.16 | learning rate: 7.384E-05 | global batch size: 512 | lm loss: 1.966833E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.095 | TFLOPs: 57.37 | 0: steps: 28000 loss: 1.9430 iter time (s): 4.160 samples/sec: 123.065 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 28000 | lm loss value: 1.916011E+00 | lm loss PPL: 6.793807E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 28000 to checkpoints_2b2 0: [2022-11-26 19:14:10,999] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step28000 is begin to save! 0: [2022-11-26 19:14:11,005] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_01-model_00-model_states.pt... 0: [2022-11-26 19:14:11,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_01-model_00-model_states.pt. 0: [2022-11-26 19:14:11,352] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_03-model_00-model_states.pt... 0: [2022-11-26 19:14:11,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_03-model_00-model_states.pt. 0: [2022-11-26 19:14:11,504] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_04-model_00-model_states.pt... 0: [2022-11-26 19:14:11,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_04-model_00-model_states.pt. 0: [2022-11-26 19:14:11,653] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_05-model_00-model_states.pt... 0: [2022-11-26 19:14:11,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_05-model_00-model_states.pt. 0: [2022-11-26 19:14:11,800] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_06-model_00-model_states.pt... 0: [2022-11-26 19:14:11,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_06-model_00-model_states.pt. 0: [2022-11-26 19:14:11,946] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_07-model_00-model_states.pt... 0: [2022-11-26 19:14:12,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_07-model_00-model_states.pt. 0: [2022-11-26 19:14:12,092] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_08-model_00-model_states.pt... 0: [2022-11-26 19:14:12,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_08-model_00-model_states.pt. 0: [2022-11-26 19:14:12,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_09-model_00-model_states.pt... 0: [2022-11-26 19:14:12,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_09-model_00-model_states.pt. 0: [2022-11-26 19:14:12,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_10-model_00-model_states.pt... 0: [2022-11-26 19:14:12,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_10-model_00-model_states.pt. 0: [2022-11-26 19:14:12,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_11-model_00-model_states.pt... 0: [2022-11-26 19:14:12,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_11-model_00-model_states.pt. 0: [2022-11-26 19:14:12,637] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_12-model_00-model_states.pt... 0: [2022-11-26 19:14:12,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_12-model_00-model_states.pt. 0: [2022-11-26 19:14:12,776] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_13-model_00-model_states.pt... 0: [2022-11-26 19:14:12,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_13-model_00-model_states.pt. 0: [2022-11-26 19:14:12,906] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_14-model_00-model_states.pt... 0: [2022-11-26 19:14:13,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_14-model_00-model_states.pt. 0: [2022-11-26 19:14:13,040] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_15-model_00-model_states.pt... 0: [2022-11-26 19:14:13,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_15-model_00-model_states.pt. 0: [2022-11-26 19:14:13,174] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_16-model_00-model_states.pt... 0: [2022-11-26 19:14:13,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_16-model_00-model_states.pt. 0: [2022-11-26 19:14:13,304] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_17-model_00-model_states.pt... 0: [2022-11-26 19:14:13,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_17-model_00-model_states.pt. 0: [2022-11-26 19:14:13,436] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_18-model_00-model_states.pt... 0: [2022-11-26 19:14:13,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_18-model_00-model_states.pt. 0: [2022-11-26 19:14:13,571] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_19-model_00-model_states.pt... 0: [2022-11-26 19:14:13,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_19-model_00-model_states.pt. 0: [2022-11-26 19:14:13,709] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_20-model_00-model_states.pt... 0: [2022-11-26 19:14:13,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_20-model_00-model_states.pt. 0: [2022-11-26 19:14:13,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_21-model_00-model_states.pt... 0: [2022-11-26 19:14:13,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_21-model_00-model_states.pt. 0: [2022-11-26 19:14:13,974] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_22-model_00-model_states.pt... 0: [2022-11-26 19:14:14,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_22-model_00-model_states.pt. 0: [2022-11-26 19:14:14,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_23-model_00-model_states.pt... 0: [2022-11-26 19:14:14,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_23-model_00-model_states.pt. 0: [2022-11-26 19:14:14,241] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_24-model_00-model_states.pt... 0: [2022-11-26 19:14:14,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_24-model_00-model_states.pt. 0: [2022-11-26 19:14:14,374] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_25-model_00-model_states.pt... 0: [2022-11-26 19:14:14,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_25-model_00-model_states.pt. 0: [2022-11-26 19:14:14,505] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_26-model_00-model_states.pt... 0: [2022-11-26 19:14:14,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_26-model_00-model_states.pt. 0: [2022-11-26 19:14:14,636] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_27-model_00-model_states.pt... 0: [2022-11-26 19:14:14,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_27-model_00-model_states.pt. 0: [2022-11-26 19:14:14,769] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_28-model_00-model_states.pt... 0: [2022-11-26 19:14:14,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_28-model_00-model_states.pt. 0: [2022-11-26 19:14:14,906] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_29-model_00-model_states.pt... 0: [2022-11-26 19:14:15,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_29-model_00-model_states.pt. 0: [2022-11-26 19:14:15,035] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_30-model_00-model_states.pt... 0: [2022-11-26 19:14:15,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_30-model_00-model_states.pt. 0: [2022-11-26 19:14:15,170] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_31-model_00-model_states.pt... 0: [2022-11-26 19:14:15,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_31-model_00-model_states.pt. 0: [2022-11-26 19:14:15,301] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_32-model_00-model_states.pt... 0: [2022-11-26 19:14:15,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_32-model_00-model_states.pt. 0: [2022-11-26 19:14:15,431] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_33-model_00-model_states.pt... 0: [2022-11-26 19:14:15,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_33-model_00-model_states.pt. 0: [2022-11-26 19:14:15,564] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_34-model_00-model_states.pt... 0: [2022-11-26 19:14:15,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_34-model_00-model_states.pt. 0: [2022-11-26 19:14:15,695] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/layer_36-model_00-model_states.pt... 0: [2022-11-26 19:14:15,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/layer_36-model_00-model_states.pt. 0: [2022-11-26 19:14:15,700] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step28000/mp_rank_00_model_states.pt 0: [2022-11-26 19:14:15,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/mp_rank_00_model_states.pt... 0: [2022-11-26 19:14:15,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/mp_rank_00_model_states.pt. 0: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:14:15,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step28000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:14:16,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:14:16,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 19:14:16,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 19:14:16,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:14:16,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 19:14:16,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 19:14:16,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:14:16,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 19:14:16,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 19:14:16,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:14:16,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 19:14:16,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 19:14:16,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:14:16,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:14:16,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 19:14:16,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 6: [2022-11-26 19:14:16,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:14:16,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:14:16,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 19:14:16,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 6: [2022-11-26 19:14:16,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 19:14:16,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 19:14:16,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:14:16,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 19:14:16,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 2: [2022-11-26 19:14:16,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:14:16,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 19:14:16,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 2: [2022-11-26 19:14:16,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:14:16,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 19:14:16,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 2: [2022-11-26 19:14:16,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:14:16,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 19:14:16,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 2: [2022-11-26 19:14:16,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:14:16,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 19:14:16,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 19:14:16,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:14:16,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:14:16,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:14:16,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 19:14:16,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 19:14:16,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 19:14:16,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 19:14:16,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 19:14:16,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 2: [2022-11-26 19:14:16,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:14:16,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:14:16,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 19:14:16,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 19:14:16,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 2: [2022-11-26 19:14:16,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 2: [2022-11-26 19:14:16,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:14:16,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:14:16,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 19:14:16,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 6: [2022-11-26 19:14:16,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:14:16,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 19:14:16,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 3: [2022-11-26 19:14:16,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:14:16,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 19:14:16,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 3: [2022-11-26 19:14:16,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:14:16,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 19:14:16,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 3: [2022-11-26 19:14:16,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:14:16,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:14:16,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 19:14:16,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 19:14:16,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 3: [2022-11-26 19:14:16,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 6: [2022-11-26 19:14:16,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:14:16,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:14:16,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 19:14:16,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 19:14:16,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 6: [2022-11-26 19:14:16,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 2: [2022-11-26 19:14:16,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 19:14:16,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 6: [2022-11-26 19:14:16,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:14:16,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 19:14:16,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 6: [2022-11-26 19:14:16,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:14:16,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 19:14:16,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 1: [2022-11-26 19:14:16,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:14:16,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 19:14:16,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 1: [2022-11-26 19:14:16,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:14:16,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:14:16,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 19:14:16,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 19:14:16,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 1: [2022-11-26 19:14:16,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 1: [2022-11-26 19:14:16,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:14:16,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:14:16,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:14:16,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:14:16,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 19:14:16,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 19:14:16,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 1: [2022-11-26 19:14:16,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 19:14:16,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 1: [2022-11-26 19:14:16,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 19:14:16,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 1: [2022-11-26 19:14:16,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 3: [2022-11-26 19:14:16,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:14:16,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 19:14:16,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 3: [2022-11-26 19:14:16,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:14:16,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:14:16,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:14:16,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 19:14:16,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 19:14:16,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 19:14:16,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 3: [2022-11-26 19:14:16,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 3: [2022-11-26 19:14:16,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 5: [2022-11-26 19:14:16,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:14:16,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:14:16,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:14:16,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 19:14:16,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 19:14:16,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 19:14:16,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 5: [2022-11-26 19:14:16,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 5: [2022-11-26 19:14:16,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 5: [2022-11-26 19:14:16,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:14:16,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 19:14:16,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 5: [2022-11-26 19:14:16,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:14:16,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 19:14:16,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 5: [2022-11-26 19:14:16,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:14:16,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 19:14:16,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 5: [2022-11-26 19:14:16,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:14:16,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 19:14:16,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 5: [2022-11-26 19:14:16,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:14:16,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 19:14:16,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 19:14:16,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 19:14:16,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 7: [2022-11-26 19:14:16,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:14:16,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:14:16,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:14:16,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:14:16,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 19:14:16,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 19:14:16,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 19:14:16,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 19:14:16,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 7: [2022-11-26 19:14:16,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 7: [2022-11-26 19:14:16,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 7: [2022-11-26 19:14:16,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 7: [2022-11-26 19:14:16,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:14:16,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 19:14:16,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 7: [2022-11-26 19:14:16,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:14:16,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 19:14:16,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 7: [2022-11-26 19:14:16,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:14:16,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 19:14:16,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 7: [2022-11-26 19:14:16,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:14:16,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 19:14:16,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 4: [2022-11-26 19:14:16,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:14:16,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:14:16,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 19:14:16,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:14:16,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 19:14:16,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:14:16,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:14:16,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 4: [2022-11-26 19:14:16,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 4: [2022-11-26 19:14:16,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 19:14:16,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 19:14:16,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 19:14:16,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 4: [2022-11-26 19:14:16,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 4: [2022-11-26 19:14:16,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 4: [2022-11-26 19:14:16,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:14:16,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:14:16,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:14:16,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 19:14:16,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 19:14:16,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step28000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 19:14:16,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 4: [2022-11-26 19:14:16,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 4: [2022-11-26 19:14:16,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: successfully saved checkpoint at iteration 28000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 6010.68 7: iteration 28010/ 44073 | consumed samples: 14341120 | consumed tokens: 29370613760 | elapsed time per iteration (s): 4.89 | learning rate: 7.378E-05 | global batch size: 512 | lm loss: 1.979453E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 104.711 | TFLOPs: 48.80 | 7: iteration 28020/ 44073 | consumed samples: 14346240 | consumed tokens: 29381099520 | elapsed time per iteration (s): 4.15 | learning rate: 7.372E-05 | global batch size: 512 | lm loss: 2.012879E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.286 | TFLOPs: 57.46 | 7: iteration 28030/ 44073 | consumed samples: 14351360 | consumed tokens: 29391585280 | elapsed time per iteration (s): 4.18 | learning rate: 7.366E-05 | global batch size: 512 | lm loss: 1.937394E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.470 | TFLOPs: 57.08 | 7: iteration 28040/ 44073 | consumed samples: 14356480 | consumed tokens: 29402071040 | elapsed time per iteration (s): 4.15 | learning rate: 7.360E-05 | global batch size: 512 | lm loss: 1.984395E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.243 | TFLOPs: 57.44 | 7: iteration 28050/ 44073 | consumed samples: 14361600 | consumed tokens: 29412556800 | elapsed time per iteration (s): 4.15 | learning rate: 7.354E-05 | global batch size: 512 | lm loss: 1.979488E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.274 | TFLOPs: 57.45 | 7: iteration 28060/ 44073 | consumed samples: 14366720 | consumed tokens: 29423042560 | elapsed time per iteration (s): 4.15 | learning rate: 7.348E-05 | global batch size: 512 | lm loss: 1.979824E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.453 | TFLOPs: 57.54 | 7: iteration 28070/ 44073 | consumed samples: 14371840 | consumed tokens: 29433528320 | elapsed time per iteration (s): 4.17 | learning rate: 7.342E-05 | global batch size: 512 | lm loss: 1.982081E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.731 | TFLOPs: 57.20 | 7: iteration 28080/ 44073 | consumed samples: 14376960 | consumed tokens: 29444014080 | elapsed time per iteration (s): 4.14 | learning rate: 7.336E-05 | global batch size: 512 | lm loss: 1.987398E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.577 | TFLOPs: 57.59 | 7: iteration 28090/ 44073 | consumed samples: 14382080 | consumed tokens: 29454499840 | elapsed time per iteration (s): 4.15 | learning rate: 7.330E-05 | global batch size: 512 | lm loss: 1.976099E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.496 | TFLOPs: 57.56 | 7: iteration 28100/ 44073 | consumed samples: 14387200 | consumed tokens: 29464985600 | elapsed time per iteration (s): 4.17 | learning rate: 7.325E-05 | global batch size: 512 | lm loss: 1.953134E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.899 | TFLOPs: 57.28 | 7: iteration 28110/ 44073 | consumed samples: 14392320 | consumed tokens: 29475471360 | elapsed time per iteration (s): 4.16 | learning rate: 7.319E-05 | global batch size: 512 | lm loss: 1.962701E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.010 | TFLOPs: 57.33 | 7: iteration 28120/ 44073 | consumed samples: 14397440 | consumed tokens: 29485957120 | elapsed time per iteration (s): 4.17 | learning rate: 7.313E-05 | global batch size: 512 | lm loss: 1.980745E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.888 | TFLOPs: 57.27 | 7: iteration 28130/ 44073 | consumed samples: 14402560 | consumed tokens: 29496442880 | elapsed time per iteration (s): 4.64 | learning rate: 7.307E-05 | global batch size: 512 | lm loss: 1.972290E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 110.291 | TFLOPs: 51.40 | 7: iteration 28140/ 44073 | consumed samples: 14407680 | consumed tokens: 29506928640 | elapsed time per iteration (s): 4.14 | learning rate: 7.301E-05 | global batch size: 512 | lm loss: 1.972601E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.813 | TFLOPs: 57.70 | 7: iteration 28150/ 44073 | consumed samples: 14412800 | consumed tokens: 29517414400 | elapsed time per iteration (s): 4.17 | learning rate: 7.295E-05 | global batch size: 512 | lm loss: 1.980842E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.752 | TFLOPs: 57.21 | 7: iteration 28160/ 44073 | consumed samples: 14417920 | consumed tokens: 29527900160 | elapsed time per iteration (s): 4.16 | learning rate: 7.289E-05 | global batch size: 512 | lm loss: 1.978274E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.988 | TFLOPs: 57.32 | 7: iteration 28170/ 44073 | consumed samples: 14423040 | consumed tokens: 29538385920 | elapsed time per iteration (s): 4.14 | learning rate: 7.283E-05 | global batch size: 512 | lm loss: 1.966549E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.733 | TFLOPs: 57.67 | 7: iteration 28180/ 44073 | consumed samples: 14428160 | consumed tokens: 29548871680 | elapsed time per iteration (s): 4.18 | learning rate: 7.277E-05 | global batch size: 512 | lm loss: 1.986002E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.418 | TFLOPs: 57.05 | 7: iteration 28190/ 44073 | consumed samples: 14433280 | consumed tokens: 29559357440 | elapsed time per iteration (s): 4.17 | learning rate: 7.271E-05 | global batch size: 512 | lm loss: 1.972129E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.840 | TFLOPs: 57.25 | 7: iteration 28200/ 44073 | consumed samples: 14438400 | consumed tokens: 29569843200 | elapsed time per iteration (s): 4.16 | learning rate: 7.265E-05 | global batch size: 512 | lm loss: 1.972250E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.150 | TFLOPs: 57.39 | 7: iteration 28210/ 44073 | consumed samples: 14443520 | consumed tokens: 29580328960 | elapsed time per iteration (s): 4.19 | learning rate: 7.260E-05 | global batch size: 512 | lm loss: 1.983513E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.239 | TFLOPs: 56.97 | 7: iteration 28220/ 44073 | consumed samples: 14448640 | consumed tokens: 29590814720 | elapsed time per iteration (s): 4.31 | learning rate: 7.254E-05 | global batch size: 512 | lm loss: 1.967538E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.909 | TFLOPs: 55.42 | 7: iteration 28230/ 44073 | consumed samples: 14453760 | consumed tokens: 29601300480 | elapsed time per iteration (s): 4.15 | learning rate: 7.248E-05 | global batch size: 512 | lm loss: 1.954959E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.517 | TFLOPs: 57.57 | 7: iteration 28240/ 44073 | consumed samples: 14458880 | consumed tokens: 29611786240 | elapsed time per iteration (s): 4.15 | learning rate: 7.242E-05 | global batch size: 512 | lm loss: 1.968194E+00 | grad norm: 0.115 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.308 | TFLOPs: 57.47 | 7: iteration 28250/ 44073 | consumed samples: 14464000 | consumed tokens: 29622272000 | elapsed time per iteration (s): 4.34 | learning rate: 7.236E-05 | global batch size: 512 | lm loss: 1.971076E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.852 | TFLOPs: 54.92 | 7: iteration 28260/ 44073 | consumed samples: 14469120 | consumed tokens: 29632757760 | elapsed time per iteration (s): 4.24 | learning rate: 7.230E-05 | global batch size: 512 | lm loss: 1.978252E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.621 | TFLOPs: 56.22 | 7: iteration 28270/ 44073 | consumed samples: 14474240 | consumed tokens: 29643243520 | elapsed time per iteration (s): 4.17 | learning rate: 7.224E-05 | global batch size: 512 | lm loss: 1.974745E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.870 | TFLOPs: 57.26 | 7: iteration 28280/ 44073 | consumed samples: 14479360 | consumed tokens: 29653729280 | elapsed time per iteration (s): 4.17 | learning rate: 7.218E-05 | global batch size: 512 | lm loss: 1.974515E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.727 | TFLOPs: 57.20 | 7: iteration 28290/ 44073 | consumed samples: 14484480 | consumed tokens: 29664215040 | elapsed time per iteration (s): 4.30 | learning rate: 7.212E-05 | global batch size: 512 | lm loss: 1.982672E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.073 | TFLOPs: 55.49 | 7: iteration 28300/ 44073 | consumed samples: 14489600 | consumed tokens: 29674700800 | elapsed time per iteration (s): 4.15 | learning rate: 7.207E-05 | global batch size: 512 | lm loss: 1.981456E+00 | grad norm: 0.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.513 | TFLOPs: 57.56 | 7: iteration 28310/ 44073 | consumed samples: 14494720 | consumed tokens: 29685186560 | elapsed time per iteration (s): 4.16 | learning rate: 7.201E-05 | global batch size: 512 | lm loss: 1.967707E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.962 | TFLOPs: 57.31 | 7: iteration 28320/ 44073 | consumed samples: 14499840 | consumed tokens: 29695672320 | elapsed time per iteration (s): 4.14 | learning rate: 7.195E-05 | global batch size: 512 | lm loss: 1.971397E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.612 | TFLOPs: 57.61 | 7: iteration 28330/ 44073 | consumed samples: 14504960 | consumed tokens: 29706158080 | elapsed time per iteration (s): 4.14 | learning rate: 7.189E-05 | global batch size: 512 | lm loss: 1.985871E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.590 | TFLOPs: 57.60 | 7: iteration 28340/ 44073 | consumed samples: 14510080 | consumed tokens: 29716643840 | elapsed time per iteration (s): 4.13 | learning rate: 7.183E-05 | global batch size: 512 | lm loss: 1.970674E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.835 | TFLOPs: 57.71 | 7: iteration 28350/ 44073 | consumed samples: 14515200 | consumed tokens: 29727129600 | elapsed time per iteration (s): 4.15 | learning rate: 7.177E-05 | global batch size: 512 | lm loss: 1.977929E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.510 | TFLOPs: 57.56 | 7: iteration 28360/ 44073 | consumed samples: 14520320 | consumed tokens: 29737615360 | elapsed time per iteration (s): 4.15 | learning rate: 7.171E-05 | global batch size: 512 | lm loss: 1.969471E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.351 | TFLOPs: 57.49 | 7: iteration 28370/ 44073 | consumed samples: 14525440 | consumed tokens: 29748101120 | elapsed time per iteration (s): 4.17 | learning rate: 7.166E-05 | global batch size: 512 | lm loss: 1.972321E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.637 | TFLOPs: 57.16 | 7: iteration 28380/ 44073 | consumed samples: 14530560 | consumed tokens: 29758586880 | elapsed time per iteration (s): 4.15 | learning rate: 7.160E-05 | global batch size: 512 | lm loss: 1.950489E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.428 | TFLOPs: 57.52 | 7: iteration 28390/ 44073 | consumed samples: 14535680 | consumed tokens: 29769072640 | elapsed time per iteration (s): 4.15 | learning rate: 7.154E-05 | global batch size: 512 | lm loss: 1.973838E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.413 | TFLOPs: 57.52 | 7: iteration 28400/ 44073 | consumed samples: 14540800 | consumed tokens: 29779558400 | elapsed time per iteration (s): 4.14 | learning rate: 7.148E-05 | global batch size: 512 | lm loss: 1.981274E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.819 | TFLOPs: 57.71 | 7: iteration 28410/ 44073 | consumed samples: 14545920 | consumed tokens: 29790044160 | elapsed time per iteration (s): 4.18 | learning rate: 7.142E-05 | global batch size: 512 | lm loss: 1.945971E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.345 | TFLOPs: 57.02 | 7: iteration 28420/ 44073 | consumed samples: 14551040 | consumed tokens: 29800529920 | elapsed time per iteration (s): 4.36 | learning rate: 7.136E-05 | global batch size: 512 | lm loss: 1.967292E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.393 | TFLOPs: 54.71 | 7: iteration 28430/ 44073 | consumed samples: 14556160 | consumed tokens: 29811015680 | elapsed time per iteration (s): 4.18 | learning rate: 7.130E-05 | global batch size: 512 | lm loss: 1.973175E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.389 | TFLOPs: 57.04 | 7: iteration 28440/ 44073 | consumed samples: 14561280 | consumed tokens: 29821501440 | elapsed time per iteration (s): 4.18 | learning rate: 7.125E-05 | global batch size: 512 | lm loss: 1.959702E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.497 | TFLOPs: 57.09 | 7: iteration 28450/ 44073 | consumed samples: 14566400 | consumed tokens: 29831987200 | elapsed time per iteration (s): 4.16 | learning rate: 7.119E-05 | global batch size: 512 | lm loss: 1.957800E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.026 | TFLOPs: 57.34 | 7: iteration 28460/ 44073 | consumed samples: 14571520 | consumed tokens: 29842472960 | elapsed time per iteration (s): 4.14 | learning rate: 7.113E-05 | global batch size: 512 | lm loss: 1.989369E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.671 | TFLOPs: 57.64 | 7: iteration 28470/ 44073 | consumed samples: 14576640 | consumed tokens: 29852958720 | elapsed time per iteration (s): 4.14 | learning rate: 7.107E-05 | global batch size: 512 | lm loss: 1.975461E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.674 | TFLOPs: 57.64 | 7: iteration 28480/ 44073 | consumed samples: 14581760 | consumed tokens: 29863444480 | elapsed time per iteration (s): 4.15 | learning rate: 7.101E-05 | global batch size: 512 | lm loss: 1.956254E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.402 | TFLOPs: 57.51 | 7: iteration 28490/ 44073 | consumed samples: 14586880 | consumed tokens: 29873930240 | elapsed time per iteration (s): 4.15 | learning rate: 7.095E-05 | global batch size: 512 | lm loss: 1.972934E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.367 | TFLOPs: 57.50 | 7: iteration 28500/ 44073 | consumed samples: 14592000 | consumed tokens: 29884416000 | elapsed time per iteration (s): 4.18 | learning rate: 7.089E-05 | global batch size: 512 | lm loss: 1.968170E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.496 | TFLOPs: 57.09 | 7: iteration 28510/ 44073 | consumed samples: 14597120 | consumed tokens: 29894901760 | elapsed time per iteration (s): 4.17 | learning rate: 7.084E-05 | global batch size: 512 | lm loss: 1.967880E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.854 | TFLOPs: 57.26 | 7: iteration 28520/ 44073 | consumed samples: 14602240 | consumed tokens: 29905387520 | elapsed time per iteration (s): 4.14 | learning rate: 7.078E-05 | global batch size: 512 | lm loss: 1.974221E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.801 | TFLOPs: 57.70 | 7: iteration 28530/ 44073 | consumed samples: 14607360 | consumed tokens: 29915873280 | elapsed time per iteration (s): 4.17 | learning rate: 7.072E-05 | global batch size: 512 | lm loss: 1.946306E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.926 | TFLOPs: 57.29 | 7: iteration 28540/ 44073 | consumed samples: 14612480 | consumed tokens: 29926359040 | elapsed time per iteration (s): 4.20 | learning rate: 7.066E-05 | global batch size: 512 | lm loss: 1.971101E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.864 | TFLOPs: 56.79 | 7: iteration 28550/ 44073 | consumed samples: 14617600 | consumed tokens: 29936844800 | elapsed time per iteration (s): 4.16 | learning rate: 7.060E-05 | global batch size: 512 | lm loss: 1.976933E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.957 | TFLOPs: 57.30 | 7: iteration 28560/ 44073 | consumed samples: 14622720 | consumed tokens: 29947330560 | elapsed time per iteration (s): 4.15 | learning rate: 7.054E-05 | global batch size: 512 | lm loss: 1.969592E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.420 | TFLOPs: 57.52 | 7: iteration 28570/ 44073 | consumed samples: 14627840 | consumed tokens: 29957816320 | elapsed time per iteration (s): 4.18 | learning rate: 7.049E-05 | global batch size: 512 | lm loss: 1.962351E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.453 | TFLOPs: 57.07 | 7: iteration 28580/ 44073 | consumed samples: 14632960 | consumed tokens: 29968302080 | elapsed time per iteration (s): 4.18 | learning rate: 7.043E-05 | global batch size: 512 | lm loss: 1.968620E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.584 | TFLOPs: 57.13 | 7: iteration 28590/ 44073 | consumed samples: 14638080 | consumed tokens: 29978787840 | elapsed time per iteration (s): 4.15 | learning rate: 7.037E-05 | global batch size: 512 | lm loss: 1.980166E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.440 | TFLOPs: 57.53 | 7: iteration 28600/ 44073 | consumed samples: 14643200 | consumed tokens: 29989273600 | elapsed time per iteration (s): 4.17 | learning rate: 7.031E-05 | global batch size: 512 | lm loss: 1.966474E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.918 | TFLOPs: 57.29 | 7: iteration 28610/ 44073 | consumed samples: 14648320 | consumed tokens: 29999759360 | elapsed time per iteration (s): 4.16 | learning rate: 7.025E-05 | global batch size: 512 | lm loss: 1.964546E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.207 | TFLOPs: 57.42 | 7: iteration 28620/ 44073 | consumed samples: 14653440 | consumed tokens: 30010245120 | elapsed time per iteration (s): 4.16 | learning rate: 7.020E-05 | global batch size: 512 | lm loss: 1.994670E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.222 | TFLOPs: 57.43 | 7: iteration 28630/ 44073 | consumed samples: 14658560 | consumed tokens: 30020730880 | elapsed time per iteration (s): 4.16 | learning rate: 7.014E-05 | global batch size: 512 | lm loss: 1.993749E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.061 | TFLOPs: 57.35 | 7: iteration 28640/ 44073 | consumed samples: 14663680 | consumed tokens: 30031216640 | elapsed time per iteration (s): 4.18 | learning rate: 7.008E-05 | global batch size: 512 | lm loss: 1.932542E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.534 | TFLOPs: 57.11 | 7: iteration 28650/ 44073 | consumed samples: 14668800 | consumed tokens: 30041702400 | elapsed time per iteration (s): 4.14 | learning rate: 7.002E-05 | global batch size: 512 | lm loss: 1.968448E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.665 | TFLOPs: 57.63 | 7: iteration 28660/ 44073 | consumed samples: 14673920 | consumed tokens: 30052188160 | elapsed time per iteration (s): 4.14 | learning rate: 6.996E-05 | global batch size: 512 | lm loss: 1.980124E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.653 | TFLOPs: 57.63 | 7: iteration 28670/ 44073 | consumed samples: 14679040 | consumed tokens: 30062673920 | elapsed time per iteration (s): 4.17 | learning rate: 6.991E-05 | global batch size: 512 | lm loss: 1.953336E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.759 | TFLOPs: 57.21 | 7: iteration 28680/ 44073 | consumed samples: 14684160 | consumed tokens: 30073159680 | elapsed time per iteration (s): 4.18 | learning rate: 6.985E-05 | global batch size: 512 | lm loss: 1.953868E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.574 | TFLOPs: 57.13 | 7: iteration 28690/ 44073 | consumed samples: 14689280 | consumed tokens: 30083645440 | elapsed time per iteration (s): 4.18 | learning rate: 6.979E-05 | global batch size: 512 | lm loss: 1.982662E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.436 | TFLOPs: 57.06 | 7: iteration 28700/ 44073 | consumed samples: 14694400 | consumed tokens: 30094131200 | elapsed time per iteration (s): 4.16 | learning rate: 6.973E-05 | global batch size: 512 | lm loss: 1.971819E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.144 | TFLOPs: 57.39 | 7: iteration 28710/ 44073 | consumed samples: 14699520 | consumed tokens: 30104616960 | elapsed time per iteration (s): 4.17 | learning rate: 6.967E-05 | global batch size: 512 | lm loss: 1.987150E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.832 | TFLOPs: 57.25 | 7: iteration 28720/ 44073 | consumed samples: 14704640 | consumed tokens: 30115102720 | elapsed time per iteration (s): 4.14 | learning rate: 6.962E-05 | global batch size: 512 | lm loss: 1.954501E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.638 | TFLOPs: 57.62 | 7: iteration 28730/ 44073 | consumed samples: 14709760 | consumed tokens: 30125588480 | elapsed time per iteration (s): 4.15 | learning rate: 6.956E-05 | global batch size: 512 | lm loss: 1.971795E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.267 | TFLOPs: 57.45 | 7: iteration 28740/ 44073 | consumed samples: 14714880 | consumed tokens: 30136074240 | elapsed time per iteration (s): 4.15 | learning rate: 6.950E-05 | global batch size: 512 | lm loss: 1.964979E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.392 | TFLOPs: 57.51 | 7: iteration 28750/ 44073 | consumed samples: 14720000 | consumed tokens: 30146560000 | elapsed time per iteration (s): 4.14 | learning rate: 6.944E-05 | global batch size: 512 | lm loss: 1.986868E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.557 | TFLOPs: 57.58 | 7: iteration 28760/ 44073 | consumed samples: 14725120 | consumed tokens: 30157045760 | elapsed time per iteration (s): 4.15 | learning rate: 6.938E-05 | global batch size: 512 | lm loss: 1.972747E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.392 | TFLOPs: 57.51 | 7: iteration 28770/ 44073 | consumed samples: 14730240 | consumed tokens: 30167531520 | elapsed time per iteration (s): 4.17 | learning rate: 6.933E-05 | global batch size: 512 | lm loss: 1.984119E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.674 | TFLOPs: 57.17 | 7: iteration 28780/ 44073 | consumed samples: 14735360 | consumed tokens: 30178017280 | elapsed time per iteration (s): 4.13 | learning rate: 6.927E-05 | global batch size: 512 | lm loss: 1.961656E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.824 | TFLOPs: 57.71 | 7: iteration 28790/ 44073 | consumed samples: 14740480 | consumed tokens: 30188503040 | elapsed time per iteration (s): 4.17 | learning rate: 6.921E-05 | global batch size: 512 | lm loss: 1.976610E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.916 | TFLOPs: 57.28 | 7: iteration 28800/ 44073 | consumed samples: 14745600 | consumed tokens: 30198988800 | elapsed time per iteration (s): 4.15 | learning rate: 6.915E-05 | global batch size: 512 | lm loss: 1.967495E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.493 | TFLOPs: 57.55 | 7: iteration 28810/ 44073 | consumed samples: 14750720 | consumed tokens: 30209474560 | elapsed time per iteration (s): 4.19 | learning rate: 6.910E-05 | global batch size: 512 | lm loss: 1.961610E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.058 | TFLOPs: 56.89 | 7: iteration 28820/ 44073 | consumed samples: 14755840 | consumed tokens: 30219960320 | elapsed time per iteration (s): 4.17 | learning rate: 6.904E-05 | global batch size: 512 | lm loss: 1.963165E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.727 | TFLOPs: 57.20 | 7: iteration 28830/ 44073 | consumed samples: 14760960 | consumed tokens: 30230446080 | elapsed time per iteration (s): 4.18 | learning rate: 6.898E-05 | global batch size: 512 | lm loss: 1.963896E+00 | grad norm: 0.115 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.622 | TFLOPs: 57.15 | 7: iteration 28840/ 44073 | consumed samples: 14766080 | consumed tokens: 30240931840 | elapsed time per iteration (s): 4.13 | learning rate: 6.892E-05 | global batch size: 512 | lm loss: 1.961430E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.951 | TFLOPs: 57.77 | 7: iteration 28850/ 44073 | consumed samples: 14771200 | consumed tokens: 30251417600 | elapsed time per iteration (s): 4.17 | learning rate: 6.886E-05 | global batch size: 512 | lm loss: 1.992242E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.652 | TFLOPs: 57.16 | 7: iteration 28860/ 44073 | consumed samples: 14776320 | consumed tokens: 30261903360 | elapsed time per iteration (s): 4.14 | learning rate: 6.881E-05 | global batch size: 512 | lm loss: 1.969236E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.535 | TFLOPs: 57.57 | 7: iteration 28870/ 44073 | consumed samples: 14781440 | consumed tokens: 30272389120 | elapsed time per iteration (s): 4.15 | learning rate: 6.875E-05 | global batch size: 512 | lm loss: 1.958941E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.363 | TFLOPs: 57.49 | 7: iteration 28880/ 44073 | consumed samples: 14786560 | consumed tokens: 30282874880 | elapsed time per iteration (s): 4.15 | learning rate: 6.869E-05 | global batch size: 512 | lm loss: 1.979016E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.239 | TFLOPs: 57.44 | 7: iteration 28890/ 44073 | consumed samples: 14791680 | consumed tokens: 30293360640 | elapsed time per iteration (s): 4.18 | learning rate: 6.863E-05 | global batch size: 512 | lm loss: 1.956245E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.359 | TFLOPs: 57.03 | 7: iteration 28900/ 44073 | consumed samples: 14796800 | consumed tokens: 30303846400 | elapsed time per iteration (s): 4.17 | learning rate: 6.858E-05 | global batch size: 512 | lm loss: 1.976133E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.854 | TFLOPs: 57.26 | 7: iteration 28910/ 44073 | consumed samples: 14801920 | consumed tokens: 30314332160 | elapsed time per iteration (s): 4.18 | learning rate: 6.852E-05 | global batch size: 512 | lm loss: 1.965878E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.584 | TFLOPs: 57.13 | 7: iteration 28920/ 44073 | consumed samples: 14807040 | consumed tokens: 30324817920 | elapsed time per iteration (s): 4.16 | learning rate: 6.846E-05 | global batch size: 512 | lm loss: 1.948176E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.195 | TFLOPs: 57.42 | 7: iteration 28930/ 44073 | consumed samples: 14812160 | consumed tokens: 30335303680 | elapsed time per iteration (s): 4.13 | learning rate: 6.840E-05 | global batch size: 512 | lm loss: 1.969419E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.903 | TFLOPs: 57.75 | 7: iteration 28940/ 44073 | consumed samples: 14817280 | consumed tokens: 30345789440 | elapsed time per iteration (s): 4.15 | learning rate: 6.835E-05 | global batch size: 512 | lm loss: 1.967262E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.450 | TFLOPs: 57.53 | 7: iteration 28950/ 44073 | consumed samples: 14822400 | consumed tokens: 30356275200 | elapsed time per iteration (s): 4.23 | learning rate: 6.829E-05 | global batch size: 512 | lm loss: 1.958468E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.168 | TFLOPs: 56.47 | 7: iteration 28960/ 44073 | consumed samples: 14827520 | consumed tokens: 30366760960 | elapsed time per iteration (s): 4.19 | learning rate: 6.823E-05 | global batch size: 512 | lm loss: 1.968044E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.340 | TFLOPs: 57.02 | 7: iteration 28970/ 44073 | consumed samples: 14832640 | consumed tokens: 30377246720 | elapsed time per iteration (s): 4.13 | learning rate: 6.817E-05 | global batch size: 512 | lm loss: 1.973778E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.849 | TFLOPs: 57.72 | 7: iteration 28980/ 44073 | consumed samples: 14837760 | consumed tokens: 30387732480 | elapsed time per iteration (s): 4.14 | learning rate: 6.812E-05 | global batch size: 512 | lm loss: 1.963200E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.642 | TFLOPs: 57.62 | 7: iteration 28990/ 44073 | consumed samples: 14842880 | consumed tokens: 30398218240 | elapsed time per iteration (s): 4.13 | learning rate: 6.806E-05 | global batch size: 512 | lm loss: 1.955418E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.934 | TFLOPs: 57.76 | 7: iteration 29000/ 44073 | consumed samples: 14848000 | consumed tokens: 30408704000 | elapsed time per iteration (s): 4.14 | learning rate: 6.800E-05 | global batch size: 512 | lm loss: 1.987097E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.699 | TFLOPs: 57.65 | 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 29000 | lm loss value: 1.943035E+00 | lm loss PPL: 6.979902E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 29000 to checkpoints_2b2 0: [2022-11-26 20:23:49,849] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step29000 is begin to save! 0: [2022-11-26 20:23:49,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_01-model_00-model_states.pt... 0: [2022-11-26 20:23:50,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_01-model_00-model_states.pt. 0: [2022-11-26 20:23:50,183] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_03-model_00-model_states.pt... 0: [2022-11-26 20:23:50,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_03-model_00-model_states.pt. 0: [2022-11-26 20:23:50,336] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_04-model_00-model_states.pt... 0: [2022-11-26 20:23:50,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_04-model_00-model_states.pt. 0: [2022-11-26 20:23:50,484] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_05-model_00-model_states.pt... 0: [2022-11-26 20:23:50,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_05-model_00-model_states.pt. 0: [2022-11-26 20:23:50,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_06-model_00-model_states.pt... 0: [2022-11-26 20:23:50,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_06-model_00-model_states.pt. 0: [2022-11-26 20:23:50,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_07-model_00-model_states.pt... 0: [2022-11-26 20:23:50,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_07-model_00-model_states.pt. 0: [2022-11-26 20:23:50,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_08-model_00-model_states.pt... 0: [2022-11-26 20:23:51,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_08-model_00-model_states.pt. 0: [2022-11-26 20:23:51,051] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_09-model_00-model_states.pt... 0: [2022-11-26 20:23:51,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_09-model_00-model_states.pt. 0: [2022-11-26 20:23:51,183] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_10-model_00-model_states.pt... 0: [2022-11-26 20:23:51,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_10-model_00-model_states.pt. 0: [2022-11-26 20:23:51,321] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_11-model_00-model_states.pt... 0: [2022-11-26 20:23:51,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_11-model_00-model_states.pt. 0: [2022-11-26 20:23:51,451] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_12-model_00-model_states.pt... 0: [2022-11-26 20:23:51,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_12-model_00-model_states.pt. 0: [2022-11-26 20:23:51,584] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_13-model_00-model_states.pt... 0: [2022-11-26 20:23:51,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_13-model_00-model_states.pt. 0: [2022-11-26 20:23:51,716] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_14-model_00-model_states.pt... 0: [2022-11-26 20:23:51,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_14-model_00-model_states.pt. 0: [2022-11-26 20:23:51,848] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_15-model_00-model_states.pt... 0: [2022-11-26 20:23:51,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_15-model_00-model_states.pt. 0: [2022-11-26 20:23:51,981] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_16-model_00-model_states.pt... 0: [2022-11-26 20:23:52,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_16-model_00-model_states.pt. 0: [2022-11-26 20:23:52,115] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_17-model_00-model_states.pt... 0: [2022-11-26 20:23:52,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_17-model_00-model_states.pt. 0: [2022-11-26 20:23:52,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_18-model_00-model_states.pt... 0: [2022-11-26 20:23:52,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_18-model_00-model_states.pt. 0: [2022-11-26 20:23:52,377] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_19-model_00-model_states.pt... 0: [2022-11-26 20:23:52,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_19-model_00-model_states.pt. 0: [2022-11-26 20:23:52,513] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_20-model_00-model_states.pt... 0: [2022-11-26 20:23:52,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_20-model_00-model_states.pt. 0: [2022-11-26 20:23:52,642] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_21-model_00-model_states.pt... 0: [2022-11-26 20:23:52,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_21-model_00-model_states.pt. 0: [2022-11-26 20:23:52,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_22-model_00-model_states.pt... 0: [2022-11-26 20:23:52,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_22-model_00-model_states.pt. 0: [2022-11-26 20:23:52,910] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_23-model_00-model_states.pt... 0: [2022-11-26 20:23:53,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_23-model_00-model_states.pt. 0: [2022-11-26 20:23:53,040] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_24-model_00-model_states.pt... 0: [2022-11-26 20:23:53,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_24-model_00-model_states.pt. 0: [2022-11-26 20:23:53,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_25-model_00-model_states.pt... 0: [2022-11-26 20:23:53,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_25-model_00-model_states.pt. 0: [2022-11-26 20:23:53,304] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_26-model_00-model_states.pt... 0: [2022-11-26 20:23:53,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_26-model_00-model_states.pt. 0: [2022-11-26 20:23:53,437] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_27-model_00-model_states.pt... 0: [2022-11-26 20:23:53,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_27-model_00-model_states.pt. 0: [2022-11-26 20:23:53,566] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_28-model_00-model_states.pt... 0: [2022-11-26 20:23:53,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_28-model_00-model_states.pt. 0: [2022-11-26 20:23:53,699] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_29-model_00-model_states.pt... 0: [2022-11-26 20:23:53,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_29-model_00-model_states.pt. 0: [2022-11-26 20:23:53,832] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_30-model_00-model_states.pt... 0: [2022-11-26 20:23:53,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_30-model_00-model_states.pt. 0: [2022-11-26 20:23:53,959] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_31-model_00-model_states.pt... 0: [2022-11-26 20:23:54,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_31-model_00-model_states.pt. 0: [2022-11-26 20:23:54,089] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_32-model_00-model_states.pt... 0: [2022-11-26 20:23:54,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_32-model_00-model_states.pt. 0: [2022-11-26 20:23:54,221] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_33-model_00-model_states.pt... 0: [2022-11-26 20:23:54,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_33-model_00-model_states.pt. 0: [2022-11-26 20:23:54,349] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_34-model_00-model_states.pt... 0: [2022-11-26 20:23:54,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_34-model_00-model_states.pt. 0: [2022-11-26 20:23:54,479] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/layer_36-model_00-model_states.pt... 0: [2022-11-26 20:23:54,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/layer_36-model_00-model_states.pt. 0: [2022-11-26 20:23:54,484] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step29000/mp_rank_00_model_states.pt 0: [2022-11-26 20:23:54,484] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/mp_rank_00_model_states.pt... 0: [2022-11-26 20:23:54,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/mp_rank_00_model_states.pt. 0: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:23:54,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step29000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:23:55,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:23:55,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:23:55,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 20:23:55,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 2: [2022-11-26 20:23:55,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:23:55,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 20:23:55,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 0: [2022-11-26 20:23:55,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:23:55,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:23:55,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 20:23:55,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 20:23:55,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:23:55,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 0: [2022-11-26 20:23:55,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 0: [2022-11-26 20:23:55,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 20:23:55,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 2: [2022-11-26 20:23:55,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:23:55,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 20:23:55,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 2: [2022-11-26 20:23:55,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:23:55,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 20:23:55,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 2: [2022-11-26 20:23:55,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:23:55,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 20:23:55,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 20:23:55,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:23:55,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 20:23:55,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 20:23:55,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:23:55,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 20:23:55,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 2: [2022-11-26 20:23:55,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:23:55,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 20:23:55,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 2: [2022-11-26 20:23:55,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:23:55,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 20:23:55,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 2: [2022-11-26 20:23:55,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:23:55,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 20:23:55,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 3: [2022-11-26 20:23:55,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:23:55,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:23:55,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 20:23:55,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 3: [2022-11-26 20:23:55,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 20:23:55,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 2: [2022-11-26 20:23:55,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:23:55,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 20:23:55,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 0: [2022-11-26 20:23:55,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:23:55,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 20:23:55,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 0: [2022-11-26 20:23:55,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:23:55,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:23:55,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 20:23:55,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 20:23:55,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 0: [2022-11-26 20:23:55,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 3: [2022-11-26 20:23:55,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:23:55,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 20:23:55,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 3: [2022-11-26 20:23:55,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:23:55,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 20:23:55,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 3: [2022-11-26 20:23:55,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:23:55,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 20:23:55,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 3: [2022-11-26 20:23:55,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:23:55,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 20:23:55,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 3: [2022-11-26 20:23:55,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:23:55,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 20:23:55,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 3: [2022-11-26 20:23:55,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:23:55,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 20:23:55,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 20:23:55,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:23:55,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:23:55,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 20:23:55,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 20:23:55,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 20:23:55,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 20:23:55,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:23:55,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 20:23:55,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 5: [2022-11-26 20:23:55,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:23:55,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:23:55,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:23:55,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:23:55,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 20:23:55,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 20:23:55,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 20:23:55,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 20:23:55,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 5: [2022-11-26 20:23:55,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 5: [2022-11-26 20:23:55,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 5: [2022-11-26 20:23:55,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 5: [2022-11-26 20:23:55,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:23:55,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:23:55,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 20:23:55,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 20:23:55,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 5: [2022-11-26 20:23:55,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 5: [2022-11-26 20:23:55,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:23:55,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 20:23:55,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 5: [2022-11-26 20:23:55,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:23:55,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 20:23:55,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 20:23:55,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:23:55,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 20:23:55,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 20:23:55,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:23:55,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 20:23:55,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 20:23:55,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:23:55,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 20:23:55,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 6: [2022-11-26 20:23:55,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:23:55,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 20:23:55,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 6: [2022-11-26 20:23:55,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:23:55,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 20:23:55,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 6: [2022-11-26 20:23:55,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:23:55,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:23:55,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 20:23:55,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 20:23:55,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 6: [2022-11-26 20:23:55,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 0: [2022-11-26 20:23:55,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 20:23:55,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 6: [2022-11-26 20:23:55,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:23:55,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 20:23:55,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 6: [2022-11-26 20:23:55,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:23:55,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 20:23:55,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 6: [2022-11-26 20:23:55,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:23:55,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 20:23:55,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 6: [2022-11-26 20:23:55,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:23:55,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 20:23:55,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 4: [2022-11-26 20:23:55,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:23:55,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:23:55,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:23:55,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:23:55,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:23:55,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:23:55,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:23:55,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 20:23:55,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:23:55,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 20:23:55,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 20:23:55,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 20:23:55,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 4: [2022-11-26 20:23:55,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 20:23:55,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 20:23:55,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 20:23:55,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 4: [2022-11-26 20:23:55,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 20:23:55,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 4: [2022-11-26 20:23:55,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 4: [2022-11-26 20:23:55,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 4: [2022-11-26 20:23:55,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 4: [2022-11-26 20:23:55,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 4: [2022-11-26 20:23:55,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 7: [2022-11-26 20:23:55,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:23:55,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:23:55,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:23:55,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:23:55,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 20:23:55,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 20:23:55,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:23:55,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:23:55,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:23:55,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 7: [2022-11-26 20:23:55,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 7: [2022-11-26 20:23:55,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:23:55,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 20:23:55,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 20:23:55,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 20:23:55,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 20:23:55,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 20:23:55,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 7: [2022-11-26 20:23:55,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 7: [2022-11-26 20:23:55,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step29000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 20:23:55,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 7: [2022-11-26 20:23:55,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 7: [2022-11-26 20:23:55,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 7: [2022-11-26 20:23:55,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 0: successfully saved checkpoint at iteration 29000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 5943.41 7: iteration 29010/ 44073 | consumed samples: 14853120 | consumed tokens: 30419189760 | elapsed time per iteration (s): 4.93 | learning rate: 6.795E-05 | global batch size: 512 | lm loss: 1.933956E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 103.798 | TFLOPs: 48.38 | 7: iteration 29020/ 44073 | consumed samples: 14858240 | consumed tokens: 30429675520 | elapsed time per iteration (s): 4.17 | learning rate: 6.789E-05 | global batch size: 512 | lm loss: 1.961080E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.740 | TFLOPs: 57.20 | 7: iteration 29030/ 44073 | consumed samples: 14863360 | consumed tokens: 30440161280 | elapsed time per iteration (s): 4.17 | learning rate: 6.783E-05 | global batch size: 512 | lm loss: 1.980571E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.663 | TFLOPs: 57.17 | 7: iteration 29040/ 44073 | consumed samples: 14868480 | consumed tokens: 30450647040 | elapsed time per iteration (s): 4.18 | learning rate: 6.777E-05 | global batch size: 512 | lm loss: 1.959503E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.608 | TFLOPs: 57.14 | 7: iteration 29050/ 44073 | consumed samples: 14873600 | consumed tokens: 30461132800 | elapsed time per iteration (s): 4.19 | learning rate: 6.772E-05 | global batch size: 512 | lm loss: 1.986410E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.306 | TFLOPs: 57.00 | 7: iteration 29060/ 44073 | consumed samples: 14878720 | consumed tokens: 30471618560 | elapsed time per iteration (s): 4.17 | learning rate: 6.766E-05 | global batch size: 512 | lm loss: 1.952758E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.863 | TFLOPs: 57.26 | 7: iteration 29070/ 44073 | consumed samples: 14883840 | consumed tokens: 30482104320 | elapsed time per iteration (s): 4.16 | learning rate: 6.760E-05 | global batch size: 512 | lm loss: 1.966664E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.979 | TFLOPs: 57.31 | 7: iteration 29080/ 44073 | consumed samples: 14888960 | consumed tokens: 30492590080 | elapsed time per iteration (s): 4.17 | learning rate: 6.754E-05 | global batch size: 512 | lm loss: 1.958002E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.684 | TFLOPs: 57.18 | 7: iteration 29090/ 44073 | consumed samples: 14894080 | consumed tokens: 30503075840 | elapsed time per iteration (s): 4.21 | learning rate: 6.749E-05 | global batch size: 512 | lm loss: 1.942908E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.717 | TFLOPs: 56.73 | 7: iteration 29100/ 44073 | consumed samples: 14899200 | consumed tokens: 30513561600 | elapsed time per iteration (s): 4.17 | learning rate: 6.743E-05 | global batch size: 512 | lm loss: 1.991506E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.667 | TFLOPs: 57.17 | 7: iteration 29110/ 44073 | consumed samples: 14904320 | consumed tokens: 30524047360 | elapsed time per iteration (s): 4.15 | learning rate: 6.737E-05 | global batch size: 512 | lm loss: 1.957357E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.258 | TFLOPs: 57.44 | 7: iteration 29120/ 44073 | consumed samples: 14909440 | consumed tokens: 30534533120 | elapsed time per iteration (s): 4.19 | learning rate: 6.732E-05 | global batch size: 512 | lm loss: 1.976914E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.240 | TFLOPs: 56.97 | 7: iteration 29130/ 44073 | consumed samples: 14914560 | consumed tokens: 30545018880 | elapsed time per iteration (s): 4.14 | learning rate: 6.726E-05 | global batch size: 512 | lm loss: 1.977303E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.813 | TFLOPs: 57.70 | 7: iteration 29140/ 44073 | consumed samples: 14919680 | consumed tokens: 30555504640 | elapsed time per iteration (s): 4.13 | learning rate: 6.720E-05 | global batch size: 512 | lm loss: 1.921681E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.825 | TFLOPs: 57.71 | 7: iteration 29150/ 44073 | consumed samples: 14924800 | consumed tokens: 30565990400 | elapsed time per iteration (s): 4.15 | learning rate: 6.715E-05 | global batch size: 512 | lm loss: 1.968987E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.334 | TFLOPs: 57.48 | 7: iteration 29160/ 44073 | consumed samples: 14929920 | consumed tokens: 30576476160 | elapsed time per iteration (s): 4.21 | learning rate: 6.709E-05 | global batch size: 512 | lm loss: 1.970447E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.741 | TFLOPs: 56.74 | 7: iteration 29170/ 44073 | consumed samples: 14935040 | consumed tokens: 30586961920 | elapsed time per iteration (s): 4.20 | learning rate: 6.703E-05 | global batch size: 512 | lm loss: 1.940665E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.798 | TFLOPs: 56.76 | 7: iteration 29180/ 44073 | consumed samples: 14940160 | consumed tokens: 30597447680 | elapsed time per iteration (s): 4.21 | learning rate: 6.697E-05 | global batch size: 512 | lm loss: 1.978430E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.533 | TFLOPs: 56.64 | 7: iteration 29190/ 44073 | consumed samples: 14945280 | consumed tokens: 30607933440 | elapsed time per iteration (s): 4.17 | learning rate: 6.692E-05 | global batch size: 512 | lm loss: 1.966132E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.683 | TFLOPs: 57.18 | 7: iteration 29200/ 44073 | consumed samples: 14950400 | consumed tokens: 30618419200 | elapsed time per iteration (s): 4.17 | learning rate: 6.686E-05 | global batch size: 512 | lm loss: 1.956027E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.723 | TFLOPs: 57.20 | 7: iteration 29210/ 44073 | consumed samples: 14955520 | consumed tokens: 30628904960 | elapsed time per iteration (s): 4.17 | learning rate: 6.680E-05 | global batch size: 512 | lm loss: 1.984750E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.856 | TFLOPs: 57.26 | 7: iteration 29220/ 44073 | consumed samples: 14960640 | consumed tokens: 30639390720 | elapsed time per iteration (s): 4.31 | learning rate: 6.675E-05 | global batch size: 512 | lm loss: 1.948863E+00 | grad norm: 0.113 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.763 | TFLOPs: 55.35 | 7: iteration 29230/ 44073 | consumed samples: 14965760 | consumed tokens: 30649876480 | elapsed time per iteration (s): 4.15 | learning rate: 6.669E-05 | global batch size: 512 | lm loss: 1.968703E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.239 | TFLOPs: 57.44 | 7: iteration 29240/ 44073 | consumed samples: 14970880 | consumed tokens: 30660362240 | elapsed time per iteration (s): 4.17 | learning rate: 6.663E-05 | global batch size: 512 | lm loss: 1.961048E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.864 | TFLOPs: 57.26 | 7: iteration 29250/ 44073 | consumed samples: 14976000 | consumed tokens: 30670848000 | elapsed time per iteration (s): 4.22 | learning rate: 6.658E-05 | global batch size: 512 | lm loss: 1.962337E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.448 | TFLOPs: 56.60 | 7: iteration 29260/ 44073 | consumed samples: 14981120 | consumed tokens: 30681333760 | elapsed time per iteration (s): 4.16 | learning rate: 6.652E-05 | global batch size: 512 | lm loss: 1.970949E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.197 | TFLOPs: 57.42 | 7: iteration 29270/ 44073 | consumed samples: 14986240 | consumed tokens: 30691819520 | elapsed time per iteration (s): 4.20 | learning rate: 6.646E-05 | global batch size: 512 | lm loss: 1.973383E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.038 | TFLOPs: 56.88 | 7: iteration 29280/ 44073 | consumed samples: 14991360 | consumed tokens: 30702305280 | elapsed time per iteration (s): 4.24 | learning rate: 6.641E-05 | global batch size: 512 | lm loss: 1.971107E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.817 | TFLOPs: 56.31 | 7: iteration 29290/ 44073 | consumed samples: 14996480 | consumed tokens: 30712791040 | elapsed time per iteration (s): 4.14 | learning rate: 6.635E-05 | global batch size: 512 | lm loss: 1.953015E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.642 | TFLOPs: 57.62 | 7: iteration 29300/ 44073 | consumed samples: 15001600 | consumed tokens: 30723276800 | elapsed time per iteration (s): 4.19 | learning rate: 6.629E-05 | global batch size: 512 | lm loss: 1.963406E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.068 | TFLOPs: 56.89 | 7: iteration 29310/ 44073 | consumed samples: 15006720 | consumed tokens: 30733762560 | elapsed time per iteration (s): 4.15 | learning rate: 6.624E-05 | global batch size: 512 | lm loss: 1.969220E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.329 | TFLOPs: 57.48 | 7: iteration 29320/ 44073 | consumed samples: 15011840 | consumed tokens: 30744248320 | elapsed time per iteration (s): 4.16 | learning rate: 6.618E-05 | global batch size: 512 | lm loss: 1.970108E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.083 | TFLOPs: 57.36 | 7: iteration 29330/ 44073 | consumed samples: 15016960 | consumed tokens: 30754734080 | elapsed time per iteration (s): 4.17 | learning rate: 6.612E-05 | global batch size: 512 | lm loss: 1.968237E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.852 | TFLOPs: 57.26 | 7: iteration 29340/ 44073 | consumed samples: 15022080 | consumed tokens: 30765219840 | elapsed time per iteration (s): 4.23 | learning rate: 6.607E-05 | global batch size: 512 | lm loss: 1.965374E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.942 | TFLOPs: 56.36 | 7: iteration 29350/ 44073 | consumed samples: 15027200 | consumed tokens: 30775705600 | elapsed time per iteration (s): 4.35 | learning rate: 6.601E-05 | global batch size: 512 | lm loss: 1.961473E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.668 | TFLOPs: 54.84 | 7: iteration 29360/ 44073 | consumed samples: 15032320 | consumed tokens: 30786191360 | elapsed time per iteration (s): 4.15 | learning rate: 6.595E-05 | global batch size: 512 | lm loss: 1.967415E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.315 | TFLOPs: 57.47 | 7: iteration 29370/ 44073 | consumed samples: 15037440 | consumed tokens: 30796677120 | elapsed time per iteration (s): 4.16 | learning rate: 6.590E-05 | global batch size: 512 | lm loss: 1.986678E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.989 | TFLOPs: 57.32 | 7: iteration 29380/ 44073 | consumed samples: 15042560 | consumed tokens: 30807162880 | elapsed time per iteration (s): 4.16 | learning rate: 6.584E-05 | global batch size: 512 | lm loss: 1.967999E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.001 | TFLOPs: 57.32 | 7: iteration 29390/ 44073 | consumed samples: 15047680 | consumed tokens: 30817648640 | elapsed time per iteration (s): 4.19 | learning rate: 6.578E-05 | global batch size: 512 | lm loss: 1.980220E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.171 | TFLOPs: 56.94 | 7: iteration 29400/ 44073 | consumed samples: 15052800 | consumed tokens: 30828134400 | elapsed time per iteration (s): 4.21 | learning rate: 6.573E-05 | global batch size: 512 | lm loss: 1.984575E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.494 | TFLOPs: 56.62 | 7: iteration 29410/ 44073 | consumed samples: 15057920 | consumed tokens: 30838620160 | elapsed time per iteration (s): 4.17 | learning rate: 6.567E-05 | global batch size: 512 | lm loss: 1.975785E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.926 | TFLOPs: 57.29 | 7: iteration 29420/ 44073 | consumed samples: 15063040 | consumed tokens: 30849105920 | elapsed time per iteration (s): 4.16 | learning rate: 6.561E-05 | global batch size: 512 | lm loss: 1.963879E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.161 | TFLOPs: 57.40 | 7: iteration 29430/ 44073 | consumed samples: 15068160 | consumed tokens: 30859591680 | elapsed time per iteration (s): 4.15 | learning rate: 6.556E-05 | global batch size: 512 | lm loss: 1.964019E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.390 | TFLOPs: 57.51 | 7: iteration 29440/ 44073 | consumed samples: 15073280 | consumed tokens: 30870077440 | elapsed time per iteration (s): 4.17 | learning rate: 6.550E-05 | global batch size: 512 | lm loss: 1.950627E+00 | grad norm: 0.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.774 | TFLOPs: 57.22 | 7: iteration 29450/ 44073 | consumed samples: 15078400 | consumed tokens: 30880563200 | elapsed time per iteration (s): 4.17 | learning rate: 6.545E-05 | global batch size: 512 | lm loss: 1.971240E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.915 | TFLOPs: 57.28 | 7: iteration 29460/ 44073 | consumed samples: 15083520 | consumed tokens: 30891048960 | elapsed time per iteration (s): 4.17 | learning rate: 6.539E-05 | global batch size: 512 | lm loss: 1.984224E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.837 | TFLOPs: 57.25 | 7: iteration 29470/ 44073 | consumed samples: 15088640 | consumed tokens: 30901534720 | elapsed time per iteration (s): 4.17 | learning rate: 6.533E-05 | global batch size: 512 | lm loss: 1.966207E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.798 | TFLOPs: 57.23 | 7: iteration 29480/ 44073 | consumed samples: 15093760 | consumed tokens: 30912020480 | elapsed time per iteration (s): 4.21 | learning rate: 6.528E-05 | global batch size: 512 | lm loss: 1.966416E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.749 | TFLOPs: 56.74 | 7: iteration 29490/ 44073 | consumed samples: 15098880 | consumed tokens: 30922506240 | elapsed time per iteration (s): 4.18 | learning rate: 6.522E-05 | global batch size: 512 | lm loss: 1.981844E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.568 | TFLOPs: 57.12 | 7: iteration 29500/ 44073 | consumed samples: 15104000 | consumed tokens: 30932992000 | elapsed time per iteration (s): 4.16 | learning rate: 6.516E-05 | global batch size: 512 | lm loss: 1.959988E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.213 | TFLOPs: 57.42 | 7: iteration 29510/ 44073 | consumed samples: 15109120 | consumed tokens: 30943477760 | elapsed time per iteration (s): 4.21 | learning rate: 6.511E-05 | global batch size: 512 | lm loss: 1.953880E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.748 | TFLOPs: 56.74 | 7: iteration 29520/ 44073 | consumed samples: 15114240 | consumed tokens: 30953963520 | elapsed time per iteration (s): 4.18 | learning rate: 6.505E-05 | global batch size: 512 | lm loss: 1.967291E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.588 | TFLOPs: 57.13 | 7: iteration 29530/ 44073 | consumed samples: 15119360 | consumed tokens: 30964449280 | elapsed time per iteration (s): 4.15 | learning rate: 6.500E-05 | global batch size: 512 | lm loss: 1.957024E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.443 | TFLOPs: 57.53 | 7: iteration 29540/ 44073 | consumed samples: 15124480 | consumed tokens: 30974935040 | elapsed time per iteration (s): 4.13 | learning rate: 6.494E-05 | global batch size: 512 | lm loss: 1.978230E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.917 | TFLOPs: 57.75 | 7: iteration 29550/ 44073 | consumed samples: 15129600 | consumed tokens: 30985420800 | elapsed time per iteration (s): 4.16 | learning rate: 6.488E-05 | global batch size: 512 | lm loss: 1.960359E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.151 | TFLOPs: 57.39 | 7: iteration 29560/ 44073 | consumed samples: 15134720 | consumed tokens: 30995906560 | elapsed time per iteration (s): 4.17 | learning rate: 6.483E-05 | global batch size: 512 | lm loss: 1.974431E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.928 | TFLOPs: 57.29 | 7: iteration 29570/ 44073 | consumed samples: 15139840 | consumed tokens: 31006392320 | elapsed time per iteration (s): 4.19 | learning rate: 6.477E-05 | global batch size: 512 | lm loss: 1.955855E+00 | grad norm: 0.112 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.172 | TFLOPs: 56.94 | 7: iteration 29580/ 44073 | consumed samples: 15144960 | consumed tokens: 31016878080 | elapsed time per iteration (s): 4.17 | learning rate: 6.472E-05 | global batch size: 512 | lm loss: 1.962529E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.816 | TFLOPs: 57.24 | 7: iteration 29590/ 44073 | consumed samples: 15150080 | consumed tokens: 31027363840 | elapsed time per iteration (s): 4.15 | learning rate: 6.466E-05 | global batch size: 512 | lm loss: 1.975113E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.396 | TFLOPs: 57.51 | 7: iteration 29600/ 44073 | consumed samples: 15155200 | consumed tokens: 31037849600 | elapsed time per iteration (s): 4.16 | learning rate: 6.460E-05 | global batch size: 512 | lm loss: 1.967059E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.133 | TFLOPs: 57.39 | 7: iteration 29610/ 44073 | consumed samples: 15160320 | consumed tokens: 31048335360 | elapsed time per iteration (s): 4.18 | learning rate: 6.455E-05 | global batch size: 512 | lm loss: 1.957592E+00 | grad norm: 0.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.347 | TFLOPs: 57.02 | 7: iteration 29620/ 44073 | consumed samples: 15165440 | consumed tokens: 31058821120 | elapsed time per iteration (s): 4.15 | learning rate: 6.449E-05 | global batch size: 512 | lm loss: 1.960184E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.375 | TFLOPs: 57.50 | 7: iteration 29630/ 44073 | consumed samples: 15170560 | consumed tokens: 31069306880 | elapsed time per iteration (s): 4.16 | learning rate: 6.444E-05 | global batch size: 512 | lm loss: 1.964840E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.179 | TFLOPs: 57.41 | 7: iteration 29640/ 44073 | consumed samples: 15175680 | consumed tokens: 31079792640 | elapsed time per iteration (s): 4.16 | learning rate: 6.438E-05 | global batch size: 512 | lm loss: 1.976989E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.143 | TFLOPs: 57.39 | 7: iteration 29650/ 44073 | consumed samples: 15180800 | consumed tokens: 31090278400 | elapsed time per iteration (s): 4.19 | learning rate: 6.432E-05 | global batch size: 512 | lm loss: 1.987894E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.248 | TFLOPs: 56.97 | 7: iteration 29660/ 44073 | consumed samples: 15185920 | consumed tokens: 31100764160 | elapsed time per iteration (s): 4.19 | learning rate: 6.427E-05 | global batch size: 512 | lm loss: 1.951848E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.222 | TFLOPs: 56.96 | 7: iteration 29670/ 44073 | consumed samples: 15191040 | consumed tokens: 31111249920 | elapsed time per iteration (s): 4.18 | learning rate: 6.421E-05 | global batch size: 512 | lm loss: 1.969354E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.412 | TFLOPs: 57.05 | 7: iteration 29680/ 44073 | consumed samples: 15196160 | consumed tokens: 31121735680 | elapsed time per iteration (s): 4.31 | learning rate: 6.416E-05 | global batch size: 512 | lm loss: 1.968920E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.680 | TFLOPs: 55.31 | 7: iteration 29690/ 44073 | consumed samples: 15201280 | consumed tokens: 31132221440 | elapsed time per iteration (s): 4.15 | learning rate: 6.410E-05 | global batch size: 512 | lm loss: 1.965235E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.421 | TFLOPs: 57.52 | 7: iteration 29700/ 44073 | consumed samples: 15206400 | consumed tokens: 31142707200 | elapsed time per iteration (s): 4.15 | learning rate: 6.405E-05 | global batch size: 512 | lm loss: 1.970724E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.338 | TFLOPs: 57.48 | 7: iteration 29710/ 44073 | consumed samples: 15211520 | consumed tokens: 31153192960 | elapsed time per iteration (s): 4.17 | learning rate: 6.399E-05 | global batch size: 512 | lm loss: 1.976974E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.866 | TFLOPs: 57.26 | 7: iteration 29720/ 44073 | consumed samples: 15216640 | consumed tokens: 31163678720 | elapsed time per iteration (s): 4.17 | learning rate: 6.393E-05 | global batch size: 512 | lm loss: 1.961714E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.659 | TFLOPs: 57.17 | 7: iteration 29730/ 44073 | consumed samples: 15221760 | consumed tokens: 31174164480 | elapsed time per iteration (s): 4.24 | learning rate: 6.388E-05 | global batch size: 512 | lm loss: 1.946609E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.758 | TFLOPs: 56.28 | 7: iteration 29740/ 44073 | consumed samples: 15226880 | consumed tokens: 31184650240 | elapsed time per iteration (s): 4.20 | learning rate: 6.382E-05 | global batch size: 512 | lm loss: 1.974164E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.020 | TFLOPs: 56.87 | 7: iteration 29750/ 44073 | consumed samples: 15232000 | consumed tokens: 31195136000 | elapsed time per iteration (s): 4.14 | learning rate: 6.377E-05 | global batch size: 512 | lm loss: 1.956843E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.617 | TFLOPs: 57.61 | 7: iteration 29760/ 44073 | consumed samples: 15237120 | consumed tokens: 31205621760 | elapsed time per iteration (s): 4.17 | learning rate: 6.371E-05 | global batch size: 512 | lm loss: 1.953570E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.880 | TFLOPs: 57.27 | 7: iteration 29770/ 44073 | consumed samples: 15242240 | consumed tokens: 31216107520 | elapsed time per iteration (s): 4.17 | learning rate: 6.366E-05 | global batch size: 512 | lm loss: 1.965289E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.665 | TFLOPs: 57.17 | 7: iteration 29780/ 44073 | consumed samples: 15247360 | consumed tokens: 31226593280 | elapsed time per iteration (s): 4.18 | learning rate: 6.360E-05 | global batch size: 512 | lm loss: 1.964819E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.599 | TFLOPs: 57.14 | 7: iteration 29790/ 44073 | consumed samples: 15252480 | consumed tokens: 31237079040 | elapsed time per iteration (s): 4.14 | learning rate: 6.355E-05 | global batch size: 512 | lm loss: 1.932983E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.717 | TFLOPs: 57.66 | 7: iteration 29800/ 44073 | consumed samples: 15257600 | consumed tokens: 31247564800 | elapsed time per iteration (s): 4.17 | learning rate: 6.349E-05 | global batch size: 512 | lm loss: 1.975073E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.841 | TFLOPs: 57.25 | 7: iteration 29810/ 44073 | consumed samples: 15262720 | consumed tokens: 31258050560 | elapsed time per iteration (s): 4.16 | learning rate: 6.343E-05 | global batch size: 512 | lm loss: 1.957623E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.085 | TFLOPs: 57.36 | 7: iteration 29820/ 44073 | consumed samples: 15267840 | consumed tokens: 31268536320 | elapsed time per iteration (s): 4.17 | learning rate: 6.338E-05 | global batch size: 512 | lm loss: 1.972028E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.722 | TFLOPs: 57.19 | 7: iteration 29830/ 44073 | consumed samples: 15272960 | consumed tokens: 31279022080 | elapsed time per iteration (s): 4.19 | learning rate: 6.332E-05 | global batch size: 512 | lm loss: 1.968706E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.071 | TFLOPs: 56.89 | 7: iteration 29840/ 44073 | consumed samples: 15278080 | consumed tokens: 31289507840 | elapsed time per iteration (s): 4.15 | learning rate: 6.327E-05 | global batch size: 512 | lm loss: 1.961999E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.433 | TFLOPs: 57.53 | 7: iteration 29850/ 44073 | consumed samples: 15283200 | consumed tokens: 31299993600 | elapsed time per iteration (s): 4.19 | learning rate: 6.321E-05 | global batch size: 512 | lm loss: 1.977996E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.128 | TFLOPs: 56.92 | 7: iteration 29860/ 44073 | consumed samples: 15288320 | consumed tokens: 31310479360 | elapsed time per iteration (s): 4.14 | learning rate: 6.316E-05 | global batch size: 512 | lm loss: 1.949688E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.608 | TFLOPs: 57.61 | 7: iteration 29870/ 44073 | consumed samples: 15293440 | consumed tokens: 31320965120 | elapsed time per iteration (s): 4.22 | learning rate: 6.310E-05 | global batch size: 512 | lm loss: 1.962419E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.466 | TFLOPs: 56.61 | 7: iteration 29880/ 44073 | consumed samples: 15298560 | consumed tokens: 31331450880 | elapsed time per iteration (s): 4.20 | learning rate: 6.305E-05 | global batch size: 512 | lm loss: 1.948159E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.818 | TFLOPs: 56.77 | 7: iteration 29890/ 44073 | consumed samples: 15303680 | consumed tokens: 31341936640 | elapsed time per iteration (s): 4.22 | learning rate: 6.299E-05 | global batch size: 512 | lm loss: 1.969564E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.334 | TFLOPs: 56.55 | 7: iteration 29900/ 44073 | consumed samples: 15308800 | consumed tokens: 31352422400 | elapsed time per iteration (s): 4.20 | learning rate: 6.294E-05 | global batch size: 512 | lm loss: 1.959276E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.031 | TFLOPs: 56.87 | 7: iteration 29910/ 44073 | consumed samples: 15313920 | consumed tokens: 31362908160 | elapsed time per iteration (s): 4.17 | learning rate: 6.288E-05 | global batch size: 512 | lm loss: 1.968406E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.846 | TFLOPs: 57.25 | 7: iteration 29920/ 44073 | consumed samples: 15319040 | consumed tokens: 31373393920 | elapsed time per iteration (s): 4.16 | learning rate: 6.283E-05 | global batch size: 512 | lm loss: 1.967321E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.143 | TFLOPs: 57.39 | 7: iteration 29930/ 44073 | consumed samples: 15324160 | consumed tokens: 31383879680 | elapsed time per iteration (s): 4.18 | learning rate: 6.277E-05 | global batch size: 512 | lm loss: 1.963783E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.424 | TFLOPs: 57.06 | 7: iteration 29940/ 44073 | consumed samples: 15329280 | consumed tokens: 31394365440 | elapsed time per iteration (s): 4.16 | learning rate: 6.272E-05 | global batch size: 512 | lm loss: 1.961763E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.142 | TFLOPs: 57.39 | 7: iteration 29950/ 44073 | consumed samples: 15334400 | consumed tokens: 31404851200 | elapsed time per iteration (s): 4.18 | learning rate: 6.266E-05 | global batch size: 512 | lm loss: 1.960074E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.570 | TFLOPs: 57.12 | 7: iteration 29960/ 44073 | consumed samples: 15339520 | consumed tokens: 31415336960 | elapsed time per iteration (s): 4.15 | learning rate: 6.261E-05 | global batch size: 512 | lm loss: 1.975249E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.474 | TFLOPs: 57.55 | 7: iteration 29970/ 44073 | consumed samples: 15344640 | consumed tokens: 31425822720 | elapsed time per iteration (s): 4.16 | learning rate: 6.255E-05 | global batch size: 512 | lm loss: 1.946391E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.035 | TFLOPs: 57.34 | 7: iteration 29980/ 44073 | consumed samples: 15349760 | consumed tokens: 31436308480 | elapsed time per iteration (s): 4.15 | learning rate: 6.249E-05 | global batch size: 512 | lm loss: 1.972580E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.305 | TFLOPs: 57.47 | 7: iteration 29990/ 44073 | consumed samples: 15354880 | consumed tokens: 31446794240 | elapsed time per iteration (s): 4.16 | learning rate: 6.244E-05 | global batch size: 512 | lm loss: 1.961792E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.197 | TFLOPs: 57.42 | 0: [2022-11-26 21:33:35,063] [INFO] [logging.py:68:log_dist] [Rank 0] step=30000, skipped=0, lr=[6.238496655497606e-05, 6.238496655497606e-05, 6.238496655497606e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 7: iteration 30000/ 44073 | consumed samples: 15360000 | consumed tokens: 31457280000 | elapsed time per iteration (s): 4.31 | learning rate: 6.238E-05 | global batch size: 512 | lm loss: 1.956298E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.764 | TFLOPs: 55.35 | 0: steps: 30000 loss: 2.0353 iter time (s): 4.172 samples/sec: 122.736 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 30000 | lm loss value: 1.917316E+00 | lm loss PPL: 6.802677E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 30000 to checkpoints_2b2 0: [2022-11-26 21:33:36,396] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step30000 is begin to save! 0: [2022-11-26 21:33:36,401] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_01-model_00-model_states.pt... 0: [2022-11-26 21:33:36,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_01-model_00-model_states.pt. 0: [2022-11-26 21:33:36,727] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_03-model_00-model_states.pt... 0: [2022-11-26 21:33:36,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_03-model_00-model_states.pt. 0: [2022-11-26 21:33:36,880] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_04-model_00-model_states.pt... 0: [2022-11-26 21:33:37,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_04-model_00-model_states.pt. 0: [2022-11-26 21:33:37,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_05-model_00-model_states.pt... 0: [2022-11-26 21:33:37,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_05-model_00-model_states.pt. 0: [2022-11-26 21:33:37,178] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_06-model_00-model_states.pt... 0: [2022-11-26 21:33:37,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_06-model_00-model_states.pt. 0: [2022-11-26 21:33:37,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_07-model_00-model_states.pt... 0: [2022-11-26 21:33:37,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_07-model_00-model_states.pt. 0: [2022-11-26 21:33:37,452] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_08-model_00-model_states.pt... 0: [2022-11-26 21:33:37,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_08-model_00-model_states.pt. 0: [2022-11-26 21:33:37,584] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_09-model_00-model_states.pt... 0: [2022-11-26 21:33:37,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_09-model_00-model_states.pt. 0: [2022-11-26 21:33:37,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_10-model_00-model_states.pt... 0: [2022-11-26 21:33:37,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_10-model_00-model_states.pt. 0: [2022-11-26 21:33:37,843] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_11-model_00-model_states.pt... 0: [2022-11-26 21:33:37,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_11-model_00-model_states.pt. 0: [2022-11-26 21:33:37,972] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_12-model_00-model_states.pt... 0: [2022-11-26 21:33:38,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_12-model_00-model_states.pt. 0: [2022-11-26 21:33:38,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_13-model_00-model_states.pt... 0: [2022-11-26 21:33:38,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_13-model_00-model_states.pt. 0: [2022-11-26 21:33:38,229] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_14-model_00-model_states.pt... 0: [2022-11-26 21:33:38,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_14-model_00-model_states.pt. 0: [2022-11-26 21:33:38,358] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_15-model_00-model_states.pt... 0: [2022-11-26 21:33:38,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_15-model_00-model_states.pt. 0: [2022-11-26 21:33:38,484] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_16-model_00-model_states.pt... 0: [2022-11-26 21:33:38,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_16-model_00-model_states.pt. 0: [2022-11-26 21:33:38,609] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_17-model_00-model_states.pt... 0: [2022-11-26 21:33:38,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_17-model_00-model_states.pt. 0: [2022-11-26 21:33:38,732] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_18-model_00-model_states.pt... 0: [2022-11-26 21:33:38,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_18-model_00-model_states.pt. 0: [2022-11-26 21:33:38,857] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_19-model_00-model_states.pt... 0: [2022-11-26 21:33:38,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_19-model_00-model_states.pt. 0: [2022-11-26 21:33:38,980] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_20-model_00-model_states.pt... 0: [2022-11-26 21:33:39,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_20-model_00-model_states.pt. 0: [2022-11-26 21:33:39,106] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_21-model_00-model_states.pt... 0: [2022-11-26 21:33:39,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_21-model_00-model_states.pt. 0: [2022-11-26 21:33:39,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_22-model_00-model_states.pt... 0: [2022-11-26 21:33:39,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_22-model_00-model_states.pt. 0: [2022-11-26 21:33:39,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_23-model_00-model_states.pt... 0: [2022-11-26 21:33:39,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_23-model_00-model_states.pt. 0: [2022-11-26 21:33:39,476] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_24-model_00-model_states.pt... 0: [2022-11-26 21:33:39,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_24-model_00-model_states.pt. 0: [2022-11-26 21:33:39,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_25-model_00-model_states.pt... 0: [2022-11-26 21:33:39,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_25-model_00-model_states.pt. 0: [2022-11-26 21:33:39,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_26-model_00-model_states.pt... 0: [2022-11-26 21:33:39,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_26-model_00-model_states.pt. 0: [2022-11-26 21:33:39,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_27-model_00-model_states.pt... 0: [2022-11-26 21:33:39,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_27-model_00-model_states.pt. 0: [2022-11-26 21:33:39,980] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_28-model_00-model_states.pt... 0: [2022-11-26 21:33:40,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_28-model_00-model_states.pt. 0: [2022-11-26 21:33:40,107] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_29-model_00-model_states.pt... 0: [2022-11-26 21:33:40,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_29-model_00-model_states.pt. 0: [2022-11-26 21:33:40,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_30-model_00-model_states.pt... 0: [2022-11-26 21:33:40,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_30-model_00-model_states.pt. 0: [2022-11-26 21:33:40,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_31-model_00-model_states.pt... 0: [2022-11-26 21:33:40,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_31-model_00-model_states.pt. 0: [2022-11-26 21:33:40,484] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_32-model_00-model_states.pt... 0: [2022-11-26 21:33:40,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_32-model_00-model_states.pt. 0: [2022-11-26 21:33:40,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_33-model_00-model_states.pt... 0: [2022-11-26 21:33:40,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_33-model_00-model_states.pt. 0: [2022-11-26 21:33:40,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_34-model_00-model_states.pt... 0: [2022-11-26 21:33:40,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_34-model_00-model_states.pt. 0: [2022-11-26 21:33:40,856] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/layer_36-model_00-model_states.pt... 0: [2022-11-26 21:33:40,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/layer_36-model_00-model_states.pt. 0: [2022-11-26 21:33:40,861] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step30000/mp_rank_00_model_states.pt 0: [2022-11-26 21:33:40,861] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/mp_rank_00_model_states.pt... 0: [2022-11-26 21:33:40,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/mp_rank_00_model_states.pt. 0: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:33:40,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step30000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:33:41,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:33:41,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 21:33:41,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 5: [2022-11-26 21:33:41,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:33:41,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 21:33:41,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 2: [2022-11-26 21:33:41,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:33:41,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 21:33:41,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 21:33:41,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:33:41,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:33:41,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 21:33:41,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 21:33:41,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:33:41,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 21:33:41,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 21:33:41,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:33:41,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 21:33:41,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 5: [2022-11-26 21:33:41,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:33:41,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:33:41,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 21:33:41,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 21:33:41,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 5: [2022-11-26 21:33:41,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 5: [2022-11-26 21:33:41,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:33:41,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 21:33:41,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 5: [2022-11-26 21:33:41,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:33:41,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 21:33:41,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 2: [2022-11-26 21:33:41,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:33:41,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 21:33:41,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 2: [2022-11-26 21:33:41,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:33:41,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 21:33:41,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 5: [2022-11-26 21:33:41,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:33:41,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 21:33:41,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 21:33:41,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:33:41,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 21:33:41,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 21:33:41,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:33:41,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 21:33:41,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 5: [2022-11-26 21:33:41,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:33:41,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:33:41,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 21:33:41,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 21:33:41,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 5: [2022-11-26 21:33:41,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 2: [2022-11-26 21:33:41,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:33:41,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 21:33:41,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 2: [2022-11-26 21:33:41,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:33:41,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:33:41,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:33:41,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 21:33:41,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 21:33:41,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 21:33:41,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 2: [2022-11-26 21:33:41,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 2: [2022-11-26 21:33:41,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 21:33:41,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:33:41,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 21:33:41,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 21:33:41,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:33:41,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 21:33:41,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 7: [2022-11-26 21:33:41,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:33:41,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:33:41,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:33:41,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:33:41,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:33:41,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:33:41,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:33:41,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:33:41,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 21:33:41,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 21:33:41,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 21:33:41,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 21:33:41,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 21:33:41,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 21:33:41,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 21:33:41,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 21:33:41,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 7: [2022-11-26 21:33:41,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 7: [2022-11-26 21:33:41,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 7: [2022-11-26 21:33:41,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 7: [2022-11-26 21:33:41,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 7: [2022-11-26 21:33:41,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 7: [2022-11-26 21:33:41,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 7: [2022-11-26 21:33:41,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 3: [2022-11-26 21:33:41,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:33:41,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 21:33:41,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 3: [2022-11-26 21:33:41,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:33:41,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 21:33:41,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 3: [2022-11-26 21:33:41,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:33:41,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 21:33:41,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:33:41,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 3: [2022-11-26 21:33:41,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 21:33:41,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 3: [2022-11-26 21:33:41,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:33:41,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 21:33:41,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 3: [2022-11-26 21:33:41,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:33:41,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 21:33:41,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 3: [2022-11-26 21:33:41,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:33:41,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 21:33:41,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 3: [2022-11-26 21:33:41,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:33:41,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 21:33:41,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 6: [2022-11-26 21:33:41,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:33:41,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:33:41,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 21:33:41,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 21:33:41,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 6: [2022-11-26 21:33:41,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 6: [2022-11-26 21:33:41,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:33:41,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:33:41,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:33:41,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 21:33:41,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:33:41,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 21:33:41,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:33:41,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:33:41,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 6: [2022-11-26 21:33:41,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 6: [2022-11-26 21:33:41,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 21:33:41,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 21:33:41,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 21:33:41,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 21:33:41,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 6: [2022-11-26 21:33:41,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 6: [2022-11-26 21:33:41,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 6: [2022-11-26 21:33:41,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 21:33:41,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 21:33:41,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 1: [2022-11-26 21:33:42,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:33:42,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:33:42,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 21:33:42,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:33:42,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 21:33:42,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:33:42,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 21:33:42,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 1: [2022-11-26 21:33:42,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 1: [2022-11-26 21:33:42,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 1: [2022-11-26 21:33:42,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 21:33:42,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 1: [2022-11-26 21:33:42,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:33:42,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:33:42,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:33:42,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:33:42,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 21:33:42,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 21:33:42,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 21:33:42,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 1: [2022-11-26 21:33:42,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 21:33:42,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 1: [2022-11-26 21:33:42,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 1: [2022-11-26 21:33:42,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 4: [2022-11-26 21:33:42,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:33:42,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:33:42,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:33:42,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:33:42,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 21:33:42,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:33:42,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 21:33:42,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 4: [2022-11-26 21:33:42,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 21:33:42,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 21:33:42,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 21:33:42,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 4: [2022-11-26 21:33:42,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 4: [2022-11-26 21:33:42,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 4: [2022-11-26 21:33:42,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 4: [2022-11-26 21:33:42,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:33:42,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:33:42,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:33:42,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 21:33:42,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 21:33:42,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 4: [2022-11-26 21:33:42,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step30000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 21:33:42,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 4: [2022-11-26 21:33:42,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: successfully saved checkpoint at iteration 30000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 5764.80 7: iteration 30010/ 44073 | consumed samples: 15365120 | consumed tokens: 31467765760 | elapsed time per iteration (s): 4.83 | learning rate: 6.233E-05 | global batch size: 512 | lm loss: 1.938177E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 105.971 | TFLOPs: 49.39 | 7: iteration 30020/ 44073 | consumed samples: 15370240 | consumed tokens: 31478251520 | elapsed time per iteration (s): 4.15 | learning rate: 6.228E-05 | global batch size: 512 | lm loss: 1.965165E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.478 | TFLOPs: 57.55 | 7: iteration 30030/ 44073 | consumed samples: 15375360 | consumed tokens: 31488737280 | elapsed time per iteration (s): 5.56 | learning rate: 6.222E-05 | global batch size: 512 | lm loss: 1.943661E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 92.147 | TFLOPs: 42.95 | 7: iteration 30040/ 44073 | consumed samples: 15380480 | consumed tokens: 31499223040 | elapsed time per iteration (s): 4.15 | learning rate: 6.217E-05 | global batch size: 512 | lm loss: 1.954482E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.298 | TFLOPs: 57.46 | 7: iteration 30050/ 44073 | consumed samples: 15385600 | consumed tokens: 31509708800 | elapsed time per iteration (s): 4.13 | learning rate: 6.211E-05 | global batch size: 512 | lm loss: 1.968305E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.831 | TFLOPs: 57.71 | 7: iteration 30060/ 44073 | consumed samples: 15390720 | consumed tokens: 31520194560 | elapsed time per iteration (s): 4.18 | learning rate: 6.206E-05 | global batch size: 512 | lm loss: 1.954334E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.560 | TFLOPs: 57.12 | 7: iteration 30070/ 44073 | consumed samples: 15395840 | consumed tokens: 31530680320 | elapsed time per iteration (s): 4.15 | learning rate: 6.200E-05 | global batch size: 512 | lm loss: 1.968336E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.482 | TFLOPs: 57.55 | 7: iteration 30080/ 44073 | consumed samples: 15400960 | consumed tokens: 31541166080 | elapsed time per iteration (s): 4.15 | learning rate: 6.195E-05 | global batch size: 512 | lm loss: 1.967270E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.332 | TFLOPs: 57.48 | 7: iteration 30090/ 44073 | consumed samples: 15406080 | consumed tokens: 31551651840 | elapsed time per iteration (s): 4.33 | learning rate: 6.189E-05 | global batch size: 512 | lm loss: 1.963336E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.111 | TFLOPs: 55.05 | 7: iteration 30100/ 44073 | consumed samples: 15411200 | consumed tokens: 31562137600 | elapsed time per iteration (s): 4.18 | learning rate: 6.184E-05 | global batch size: 512 | lm loss: 1.985270E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.397 | TFLOPs: 57.04 | 7: iteration 30110/ 44073 | consumed samples: 15416320 | consumed tokens: 31572623360 | elapsed time per iteration (s): 4.35 | learning rate: 6.178E-05 | global batch size: 512 | lm loss: 1.964718E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.619 | TFLOPs: 54.82 | 7: iteration 30120/ 44073 | consumed samples: 15421440 | consumed tokens: 31583109120 | elapsed time per iteration (s): 4.14 | learning rate: 6.173E-05 | global batch size: 512 | lm loss: 1.949488E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.576 | TFLOPs: 57.59 | 7: iteration 30130/ 44073 | consumed samples: 15426560 | consumed tokens: 31593594880 | elapsed time per iteration (s): 4.20 | learning rate: 6.167E-05 | global batch size: 512 | lm loss: 1.951487E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.995 | TFLOPs: 56.86 | 7: iteration 30140/ 44073 | consumed samples: 15431680 | consumed tokens: 31604080640 | elapsed time per iteration (s): 4.16 | learning rate: 6.162E-05 | global batch size: 512 | lm loss: 1.964619E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.150 | TFLOPs: 57.39 | 7: iteration 30150/ 44073 | consumed samples: 15436800 | consumed tokens: 31614566400 | elapsed time per iteration (s): 4.36 | learning rate: 6.156E-05 | global batch size: 512 | lm loss: 1.957702E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.528 | TFLOPs: 54.77 | 7: iteration 30160/ 44073 | consumed samples: 15441920 | consumed tokens: 31625052160 | elapsed time per iteration (s): 4.17 | learning rate: 6.151E-05 | global batch size: 512 | lm loss: 1.946514E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.734 | TFLOPs: 57.20 | 7: iteration 30170/ 44073 | consumed samples: 15447040 | consumed tokens: 31635537920 | elapsed time per iteration (s): 4.23 | learning rate: 6.145E-05 | global batch size: 512 | lm loss: 1.970536E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.012 | TFLOPs: 56.40 | 7: iteration 30180/ 44073 | consumed samples: 15452160 | consumed tokens: 31646023680 | elapsed time per iteration (s): 4.22 | learning rate: 6.140E-05 | global batch size: 512 | lm loss: 1.958082E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.259 | TFLOPs: 56.51 | 7: iteration 30190/ 44073 | consumed samples: 15457280 | consumed tokens: 31656509440 | elapsed time per iteration (s): 4.20 | learning rate: 6.134E-05 | global batch size: 512 | lm loss: 1.954436E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.023 | TFLOPs: 56.87 | 7: iteration 30200/ 44073 | consumed samples: 15462400 | consumed tokens: 31666995200 | elapsed time per iteration (s): 4.15 | learning rate: 6.129E-05 | global batch size: 512 | lm loss: 1.978413E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.231 | TFLOPs: 57.43 | 7: iteration 30210/ 44073 | consumed samples: 15467520 | consumed tokens: 31677480960 | elapsed time per iteration (s): 4.19 | learning rate: 6.124E-05 | global batch size: 512 | lm loss: 1.961811E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.220 | TFLOPs: 56.96 | 7: iteration 30220/ 44073 | consumed samples: 15472640 | consumed tokens: 31687966720 | elapsed time per iteration (s): 4.21 | learning rate: 6.118E-05 | global batch size: 512 | lm loss: 1.952008E+00 | grad norm: 0.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.753 | TFLOPs: 56.74 | 7: iteration 30230/ 44073 | consumed samples: 15477760 | consumed tokens: 31698452480 | elapsed time per iteration (s): 4.19 | learning rate: 6.113E-05 | global batch size: 512 | lm loss: 1.969393E+00 | grad norm: 0.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.054 | TFLOPs: 56.88 | 7: iteration 30240/ 44073 | consumed samples: 15482880 | consumed tokens: 31708938240 | elapsed time per iteration (s): 4.22 | learning rate: 6.107E-05 | global batch size: 512 | lm loss: 1.970268E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.454 | TFLOPs: 56.60 | 7: iteration 30250/ 44073 | consumed samples: 15488000 | consumed tokens: 31719424000 | elapsed time per iteration (s): 4.22 | learning rate: 6.102E-05 | global batch size: 512 | lm loss: 1.957969E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.454 | TFLOPs: 56.60 | 7: iteration 30260/ 44073 | consumed samples: 15493120 | consumed tokens: 31729909760 | elapsed time per iteration (s): 4.21 | learning rate: 6.096E-05 | global batch size: 512 | lm loss: 1.933943E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.498 | TFLOPs: 56.62 | 7: iteration 30270/ 44073 | consumed samples: 15498240 | consumed tokens: 31740395520 | elapsed time per iteration (s): 4.23 | learning rate: 6.091E-05 | global batch size: 512 | lm loss: 1.947299E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.120 | TFLOPs: 56.45 | 7: iteration 30280/ 44073 | consumed samples: 15503360 | consumed tokens: 31750881280 | elapsed time per iteration (s): 4.17 | learning rate: 6.086E-05 | global batch size: 512 | lm loss: 1.971489E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.884 | TFLOPs: 57.27 | 7: iteration 30290/ 44073 | consumed samples: 15508480 | consumed tokens: 31761367040 | elapsed time per iteration (s): 4.16 | learning rate: 6.080E-05 | global batch size: 512 | lm loss: 1.960799E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.100 | TFLOPs: 57.37 | 7: iteration 30300/ 44073 | consumed samples: 15513600 | consumed tokens: 31771852800 | elapsed time per iteration (s): 4.16 | learning rate: 6.075E-05 | global batch size: 512 | lm loss: 1.953205E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.178 | TFLOPs: 57.41 | 7: iteration 30310/ 44073 | consumed samples: 15518720 | consumed tokens: 31782338560 | elapsed time per iteration (s): 4.17 | learning rate: 6.069E-05 | global batch size: 512 | lm loss: 1.970954E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.841 | TFLOPs: 57.25 | 7: iteration 30320/ 44073 | consumed samples: 15523840 | consumed tokens: 31792824320 | elapsed time per iteration (s): 4.18 | learning rate: 6.064E-05 | global batch size: 512 | lm loss: 1.962038E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.461 | TFLOPs: 57.07 | 7: iteration 30330/ 44073 | consumed samples: 15528960 | consumed tokens: 31803310080 | elapsed time per iteration (s): 4.17 | learning rate: 6.058E-05 | global batch size: 512 | lm loss: 1.952964E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.720 | TFLOPs: 57.19 | 7: iteration 30340/ 44073 | consumed samples: 15534080 | consumed tokens: 31813795840 | elapsed time per iteration (s): 4.23 | learning rate: 6.053E-05 | global batch size: 512 | lm loss: 1.954219E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.051 | TFLOPs: 56.42 | 7: iteration 30350/ 44073 | consumed samples: 15539200 | consumed tokens: 31824281600 | elapsed time per iteration (s): 4.17 | learning rate: 6.048E-05 | global batch size: 512 | lm loss: 1.970830E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.887 | TFLOPs: 57.27 | 7: iteration 30360/ 44073 | consumed samples: 15544320 | consumed tokens: 31834767360 | elapsed time per iteration (s): 4.14 | learning rate: 6.042E-05 | global batch size: 512 | lm loss: 1.970074E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.754 | TFLOPs: 57.68 | 7: iteration 30370/ 44073 | consumed samples: 15549440 | consumed tokens: 31845253120 | elapsed time per iteration (s): 4.17 | learning rate: 6.037E-05 | global batch size: 512 | lm loss: 1.964078E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.669 | TFLOPs: 57.17 | 7: iteration 30380/ 44073 | consumed samples: 15554560 | consumed tokens: 31855738880 | elapsed time per iteration (s): 4.18 | learning rate: 6.031E-05 | global batch size: 512 | lm loss: 1.967900E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.368 | TFLOPs: 57.03 | 7: iteration 30390/ 44073 | consumed samples: 15559680 | consumed tokens: 31866224640 | elapsed time per iteration (s): 4.15 | learning rate: 6.026E-05 | global batch size: 512 | lm loss: 1.958905E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.301 | TFLOPs: 57.46 | 7: iteration 30400/ 44073 | consumed samples: 15564800 | consumed tokens: 31876710400 | elapsed time per iteration (s): 4.22 | learning rate: 6.021E-05 | global batch size: 512 | lm loss: 1.946771E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.225 | TFLOPs: 56.50 | 7: iteration 30410/ 44073 | consumed samples: 15569920 | consumed tokens: 31887196160 | elapsed time per iteration (s): 4.18 | learning rate: 6.015E-05 | global batch size: 512 | lm loss: 1.961745E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.558 | TFLOPs: 57.12 | 7: iteration 30420/ 44073 | consumed samples: 15575040 | consumed tokens: 31897681920 | elapsed time per iteration (s): 4.20 | learning rate: 6.010E-05 | global batch size: 512 | lm loss: 1.949608E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.962 | TFLOPs: 56.84 | 7: iteration 30430/ 44073 | consumed samples: 15580160 | consumed tokens: 31908167680 | elapsed time per iteration (s): 4.16 | learning rate: 6.004E-05 | global batch size: 512 | lm loss: 1.946798E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.122 | TFLOPs: 57.38 | 7: iteration 30440/ 44073 | consumed samples: 15585280 | consumed tokens: 31918653440 | elapsed time per iteration (s): 4.17 | learning rate: 5.999E-05 | global batch size: 512 | lm loss: 1.964363E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.900 | TFLOPs: 57.28 | 7: iteration 30450/ 44073 | consumed samples: 15590400 | consumed tokens: 31929139200 | elapsed time per iteration (s): 4.22 | learning rate: 5.994E-05 | global batch size: 512 | lm loss: 1.978772E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.216 | TFLOPs: 56.49 | 7: iteration 30460/ 44073 | consumed samples: 15595520 | consumed tokens: 31939624960 | elapsed time per iteration (s): 4.21 | learning rate: 5.988E-05 | global batch size: 512 | lm loss: 1.968924E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.754 | TFLOPs: 56.74 | 7: iteration 30470/ 44073 | consumed samples: 15600640 | consumed tokens: 31950110720 | elapsed time per iteration (s): 4.19 | learning rate: 5.983E-05 | global batch size: 512 | lm loss: 1.955316E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.178 | TFLOPs: 56.94 | 7: iteration 30480/ 44073 | consumed samples: 15605760 | consumed tokens: 31960596480 | elapsed time per iteration (s): 4.18 | learning rate: 5.977E-05 | global batch size: 512 | lm loss: 1.941804E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.515 | TFLOPs: 57.10 | 7: iteration 30490/ 44073 | consumed samples: 15610880 | consumed tokens: 31971082240 | elapsed time per iteration (s): 4.21 | learning rate: 5.972E-05 | global batch size: 512 | lm loss: 1.968144E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.576 | TFLOPs: 56.66 | 7: iteration 30500/ 44073 | consumed samples: 15616000 | consumed tokens: 31981568000 | elapsed time per iteration (s): 4.19 | learning rate: 5.967E-05 | global batch size: 512 | lm loss: 1.952435E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.144 | TFLOPs: 56.93 | 7: iteration 30510/ 44073 | consumed samples: 15621120 | consumed tokens: 31992053760 | elapsed time per iteration (s): 4.17 | learning rate: 5.961E-05 | global batch size: 512 | lm loss: 1.947410E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.777 | TFLOPs: 57.22 | 7: iteration 30520/ 44073 | consumed samples: 15626240 | consumed tokens: 32002539520 | elapsed time per iteration (s): 4.16 | learning rate: 5.956E-05 | global batch size: 512 | lm loss: 1.963678E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.113 | TFLOPs: 57.38 | 7: iteration 30530/ 44073 | consumed samples: 15631360 | consumed tokens: 32013025280 | elapsed time per iteration (s): 4.18 | learning rate: 5.951E-05 | global batch size: 512 | lm loss: 1.957751E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.410 | TFLOPs: 57.05 | 7: iteration 30540/ 44073 | consumed samples: 15636480 | consumed tokens: 32023511040 | elapsed time per iteration (s): 4.21 | learning rate: 5.945E-05 | global batch size: 512 | lm loss: 1.961849E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.475 | TFLOPs: 56.61 | 7: iteration 30550/ 44073 | consumed samples: 15641600 | consumed tokens: 32033996800 | elapsed time per iteration (s): 4.14 | learning rate: 5.940E-05 | global batch size: 512 | lm loss: 1.971674E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.539 | TFLOPs: 57.58 | 7: iteration 30560/ 44073 | consumed samples: 15646720 | consumed tokens: 32044482560 | elapsed time per iteration (s): 4.15 | learning rate: 5.935E-05 | global batch size: 512 | lm loss: 1.945549E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.236 | TFLOPs: 57.43 | 7: iteration 30570/ 44073 | consumed samples: 15651840 | consumed tokens: 32054968320 | elapsed time per iteration (s): 4.25 | learning rate: 5.929E-05 | global batch size: 512 | lm loss: 1.940708E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.597 | TFLOPs: 56.20 | 7: iteration 30580/ 44073 | consumed samples: 15656960 | consumed tokens: 32065454080 | elapsed time per iteration (s): 4.16 | learning rate: 5.924E-05 | global batch size: 512 | lm loss: 1.959471E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.979 | TFLOPs: 57.31 | 7: iteration 30590/ 44073 | consumed samples: 15662080 | consumed tokens: 32075939840 | elapsed time per iteration (s): 4.16 | learning rate: 5.918E-05 | global batch size: 512 | lm loss: 1.972356E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.040 | TFLOPs: 57.34 | 7: iteration 30600/ 44073 | consumed samples: 15667200 | consumed tokens: 32086425600 | elapsed time per iteration (s): 4.16 | learning rate: 5.913E-05 | global batch size: 512 | lm loss: 1.967440E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.945 | TFLOPs: 57.30 | 7: iteration 30610/ 44073 | consumed samples: 15672320 | consumed tokens: 32096911360 | elapsed time per iteration (s): 4.20 | learning rate: 5.908E-05 | global batch size: 512 | lm loss: 1.969746E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.911 | TFLOPs: 56.82 | 7: iteration 30620/ 44073 | consumed samples: 15677440 | consumed tokens: 32107397120 | elapsed time per iteration (s): 4.22 | learning rate: 5.902E-05 | global batch size: 512 | lm loss: 1.957972E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.279 | TFLOPs: 56.52 | 7: iteration 30630/ 44073 | consumed samples: 15682560 | consumed tokens: 32117882880 | elapsed time per iteration (s): 4.30 | learning rate: 5.897E-05 | global batch size: 512 | lm loss: 1.955925E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.134 | TFLOPs: 55.52 | 7: iteration 30640/ 44073 | consumed samples: 15687680 | consumed tokens: 32128368640 | elapsed time per iteration (s): 4.21 | learning rate: 5.892E-05 | global batch size: 512 | lm loss: 1.945379E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.612 | TFLOPs: 56.68 | 7: iteration 30650/ 44073 | consumed samples: 15692800 | consumed tokens: 32138854400 | elapsed time per iteration (s): 4.15 | learning rate: 5.886E-05 | global batch size: 512 | lm loss: 1.949712E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.259 | TFLOPs: 57.45 | 7: iteration 30660/ 44073 | consumed samples: 15697920 | consumed tokens: 32149340160 | elapsed time per iteration (s): 4.14 | learning rate: 5.881E-05 | global batch size: 512 | lm loss: 1.964311E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.570 | TFLOPs: 57.59 | 7: iteration 30670/ 44073 | consumed samples: 15703040 | consumed tokens: 32159825920 | elapsed time per iteration (s): 4.31 | learning rate: 5.876E-05 | global batch size: 512 | lm loss: 1.960535E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.770 | TFLOPs: 55.35 | 7: iteration 30680/ 44073 | consumed samples: 15708160 | consumed tokens: 32170311680 | elapsed time per iteration (s): 4.20 | learning rate: 5.870E-05 | global batch size: 512 | lm loss: 1.951809E+00 | grad norm: 0.115 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.046 | TFLOPs: 56.88 | 7: iteration 30690/ 44073 | consumed samples: 15713280 | consumed tokens: 32180797440 | elapsed time per iteration (s): 4.16 | learning rate: 5.865E-05 | global batch size: 512 | lm loss: 1.963346E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.960 | TFLOPs: 57.31 | 7: iteration 30700/ 44073 | consumed samples: 15718400 | consumed tokens: 32191283200 | elapsed time per iteration (s): 4.17 | learning rate: 5.860E-05 | global batch size: 512 | lm loss: 1.967550E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.774 | TFLOPs: 57.22 | 7: iteration 30710/ 44073 | consumed samples: 15723520 | consumed tokens: 32201768960 | elapsed time per iteration (s): 4.17 | learning rate: 5.854E-05 | global batch size: 512 | lm loss: 1.961103E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.782 | TFLOPs: 57.22 | 7: iteration 30720/ 44073 | consumed samples: 15728640 | consumed tokens: 32212254720 | elapsed time per iteration (s): 4.17 | learning rate: 5.849E-05 | global batch size: 512 | lm loss: 1.956382E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.842 | TFLOPs: 57.25 | 7: iteration 30730/ 44073 | consumed samples: 15733760 | consumed tokens: 32222740480 | elapsed time per iteration (s): 4.21 | learning rate: 5.844E-05 | global batch size: 512 | lm loss: 1.951448E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.745 | TFLOPs: 56.74 | 7: iteration 30740/ 44073 | consumed samples: 15738880 | consumed tokens: 32233226240 | elapsed time per iteration (s): 4.17 | learning rate: 5.839E-05 | global batch size: 512 | lm loss: 1.954775E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.868 | TFLOPs: 57.26 | 7: iteration 30750/ 44073 | consumed samples: 15744000 | consumed tokens: 32243712000 | elapsed time per iteration (s): 4.17 | learning rate: 5.833E-05 | global batch size: 512 | lm loss: 1.935868E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.757 | TFLOPs: 57.21 | 7: iteration 30760/ 44073 | consumed samples: 15749120 | consumed tokens: 32254197760 | elapsed time per iteration (s): 4.18 | learning rate: 5.828E-05 | global batch size: 512 | lm loss: 1.948304E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.364 | TFLOPs: 57.03 | 7: iteration 30770/ 44073 | consumed samples: 15754240 | consumed tokens: 32264683520 | elapsed time per iteration (s): 4.15 | learning rate: 5.823E-05 | global batch size: 512 | lm loss: 1.971513E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.228 | TFLOPs: 57.43 | 7: iteration 30780/ 44073 | consumed samples: 15759360 | consumed tokens: 32275169280 | elapsed time per iteration (s): 4.17 | learning rate: 5.817E-05 | global batch size: 512 | lm loss: 1.942558E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.678 | TFLOPs: 57.17 | 7: iteration 30790/ 44073 | consumed samples: 15764480 | consumed tokens: 32285655040 | elapsed time per iteration (s): 4.17 | learning rate: 5.812E-05 | global batch size: 512 | lm loss: 1.959091E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.903 | TFLOPs: 57.28 | 7: iteration 30800/ 44073 | consumed samples: 15769600 | consumed tokens: 32296140800 | elapsed time per iteration (s): 4.15 | learning rate: 5.807E-05 | global batch size: 512 | lm loss: 1.977672E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.282 | TFLOPs: 57.46 | 7: iteration 30810/ 44073 | consumed samples: 15774720 | consumed tokens: 32306626560 | elapsed time per iteration (s): 4.14 | learning rate: 5.801E-05 | global batch size: 512 | lm loss: 1.937339E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.797 | TFLOPs: 57.70 | 7: iteration 30820/ 44073 | consumed samples: 15779840 | consumed tokens: 32317112320 | elapsed time per iteration (s): 4.17 | learning rate: 5.796E-05 | global batch size: 512 | lm loss: 1.964935E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.849 | TFLOPs: 57.25 | 7: iteration 30830/ 44073 | consumed samples: 15784960 | consumed tokens: 32327598080 | elapsed time per iteration (s): 4.17 | learning rate: 5.791E-05 | global batch size: 512 | lm loss: 1.961721E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.692 | TFLOPs: 57.18 | 7: iteration 30840/ 44073 | consumed samples: 15790080 | consumed tokens: 32338083840 | elapsed time per iteration (s): 4.16 | learning rate: 5.786E-05 | global batch size: 512 | lm loss: 1.950703E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.032 | TFLOPs: 57.34 | 7: iteration 30850/ 44073 | consumed samples: 15795200 | consumed tokens: 32348569600 | elapsed time per iteration (s): 4.15 | learning rate: 5.780E-05 | global batch size: 512 | lm loss: 1.957208E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.255 | TFLOPs: 57.44 | 7: iteration 30860/ 44073 | consumed samples: 15800320 | consumed tokens: 32359055360 | elapsed time per iteration (s): 4.14 | learning rate: 5.775E-05 | global batch size: 512 | lm loss: 1.950839E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.580 | TFLOPs: 57.59 | 7: iteration 30870/ 44073 | consumed samples: 15805440 | consumed tokens: 32369541120 | elapsed time per iteration (s): 4.22 | learning rate: 5.770E-05 | global batch size: 512 | lm loss: 1.959754E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.458 | TFLOPs: 56.61 | 7: iteration 30880/ 44073 | consumed samples: 15810560 | consumed tokens: 32380026880 | elapsed time per iteration (s): 4.18 | learning rate: 5.764E-05 | global batch size: 512 | lm loss: 1.962169E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.384 | TFLOPs: 57.04 | 7: iteration 30890/ 44073 | consumed samples: 15815680 | consumed tokens: 32390512640 | elapsed time per iteration (s): 4.17 | learning rate: 5.759E-05 | global batch size: 512 | lm loss: 1.960868E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.882 | TFLOPs: 57.27 | 7: iteration 30900/ 44073 | consumed samples: 15820800 | consumed tokens: 32400998400 | elapsed time per iteration (s): 4.15 | learning rate: 5.754E-05 | global batch size: 512 | lm loss: 1.956603E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.383 | TFLOPs: 57.50 | 7: iteration 30910/ 44073 | consumed samples: 15825920 | consumed tokens: 32411484160 | elapsed time per iteration (s): 4.18 | learning rate: 5.749E-05 | global batch size: 512 | lm loss: 1.956125E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.633 | TFLOPs: 57.15 | 7: iteration 30920/ 44073 | consumed samples: 15831040 | consumed tokens: 32421969920 | elapsed time per iteration (s): 4.14 | learning rate: 5.743E-05 | global batch size: 512 | lm loss: 1.960615E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.763 | TFLOPs: 57.68 | 7: iteration 30930/ 44073 | consumed samples: 15836160 | consumed tokens: 32432455680 | elapsed time per iteration (s): 4.18 | learning rate: 5.738E-05 | global batch size: 512 | lm loss: 1.958790E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.503 | TFLOPs: 57.09 | 7: iteration 30940/ 44073 | consumed samples: 15841280 | consumed tokens: 32442941440 | elapsed time per iteration (s): 4.24 | learning rate: 5.733E-05 | global batch size: 512 | lm loss: 1.970462E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.752 | TFLOPs: 56.28 | 7: iteration 30950/ 44073 | consumed samples: 15846400 | consumed tokens: 32453427200 | elapsed time per iteration (s): 4.14 | learning rate: 5.728E-05 | global batch size: 512 | lm loss: 1.973433E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.638 | TFLOPs: 57.62 | 7: iteration 30960/ 44073 | consumed samples: 15851520 | consumed tokens: 32463912960 | elapsed time per iteration (s): 4.15 | learning rate: 5.722E-05 | global batch size: 512 | lm loss: 1.945096E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.315 | TFLOPs: 57.47 | 7: iteration 30970/ 44073 | consumed samples: 15856640 | consumed tokens: 32474398720 | elapsed time per iteration (s): 4.17 | learning rate: 5.717E-05 | global batch size: 512 | lm loss: 1.960462E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.857 | TFLOPs: 57.26 | 7: iteration 30980/ 44073 | consumed samples: 15861760 | consumed tokens: 32484884480 | elapsed time per iteration (s): 4.15 | learning rate: 5.712E-05 | global batch size: 512 | lm loss: 1.963045E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.227 | TFLOPs: 57.43 | 7: iteration 30990/ 44073 | consumed samples: 15866880 | consumed tokens: 32495370240 | elapsed time per iteration (s): 4.16 | learning rate: 5.707E-05 | global batch size: 512 | lm loss: 1.966776E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.941 | TFLOPs: 57.30 | 7: iteration 31000/ 44073 | consumed samples: 15872000 | consumed tokens: 32505856000 | elapsed time per iteration (s): 4.16 | learning rate: 5.701E-05 | global batch size: 512 | lm loss: 1.949697E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.966 | TFLOPs: 57.31 | 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 31000 | lm loss value: 1.946021E+00 | lm loss PPL: 7.000777E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 31000 to checkpoints_2b2 0: [2022-11-26 22:43:41,367] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step31000 is begin to save! 0: [2022-11-26 22:43:41,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_01-model_00-model_states.pt... 0: [2022-11-26 22:43:41,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_01-model_00-model_states.pt. 0: [2022-11-26 22:43:41,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_03-model_00-model_states.pt... 0: [2022-11-26 22:43:41,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_03-model_00-model_states.pt. 0: [2022-11-26 22:43:41,819] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_04-model_00-model_states.pt... 0: [2022-11-26 22:43:41,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_04-model_00-model_states.pt. 0: [2022-11-26 22:43:41,969] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_05-model_00-model_states.pt... 0: [2022-11-26 22:43:42,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_05-model_00-model_states.pt. 0: [2022-11-26 22:43:42,098] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_06-model_00-model_states.pt... 0: [2022-11-26 22:43:42,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_06-model_00-model_states.pt. 0: [2022-11-26 22:43:42,228] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_07-model_00-model_states.pt... 0: [2022-11-26 22:43:42,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_07-model_00-model_states.pt. 0: [2022-11-26 22:43:42,355] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_08-model_00-model_states.pt... 0: [2022-11-26 22:43:42,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_08-model_00-model_states.pt. 0: [2022-11-26 22:43:42,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_09-model_00-model_states.pt... 0: [2022-11-26 22:43:42,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_09-model_00-model_states.pt. 0: [2022-11-26 22:43:42,606] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_10-model_00-model_states.pt... 0: [2022-11-26 22:43:42,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_10-model_00-model_states.pt. 0: [2022-11-26 22:43:42,732] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_11-model_00-model_states.pt... 0: [2022-11-26 22:43:42,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_11-model_00-model_states.pt. 0: [2022-11-26 22:43:42,858] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_12-model_00-model_states.pt... 0: [2022-11-26 22:43:42,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_12-model_00-model_states.pt. 0: [2022-11-26 22:43:42,983] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_13-model_00-model_states.pt... 0: [2022-11-26 22:43:43,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_13-model_00-model_states.pt. 0: [2022-11-26 22:43:43,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_14-model_00-model_states.pt... 0: [2022-11-26 22:43:43,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_14-model_00-model_states.pt. 0: [2022-11-26 22:43:43,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_15-model_00-model_states.pt... 0: [2022-11-26 22:43:43,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_15-model_00-model_states.pt. 0: [2022-11-26 22:43:43,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_16-model_00-model_states.pt... 0: [2022-11-26 22:43:43,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_16-model_00-model_states.pt. 0: [2022-11-26 22:43:43,483] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_17-model_00-model_states.pt... 0: [2022-11-26 22:43:43,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_17-model_00-model_states.pt. 0: [2022-11-26 22:43:43,608] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_18-model_00-model_states.pt... 0: [2022-11-26 22:43:43,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_18-model_00-model_states.pt. 0: [2022-11-26 22:43:43,733] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_19-model_00-model_states.pt... 0: [2022-11-26 22:43:43,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_19-model_00-model_states.pt. 0: [2022-11-26 22:43:43,858] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_20-model_00-model_states.pt... 0: [2022-11-26 22:43:43,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_20-model_00-model_states.pt. 0: [2022-11-26 22:43:43,981] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_21-model_00-model_states.pt... 0: [2022-11-26 22:43:44,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_21-model_00-model_states.pt. 0: [2022-11-26 22:43:44,105] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_22-model_00-model_states.pt... 0: [2022-11-26 22:43:44,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_22-model_00-model_states.pt. 0: [2022-11-26 22:43:44,228] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_23-model_00-model_states.pt... 0: [2022-11-26 22:43:44,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_23-model_00-model_states.pt. 0: [2022-11-26 22:43:44,356] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_24-model_00-model_states.pt... 0: [2022-11-26 22:43:44,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_24-model_00-model_states.pt. 0: [2022-11-26 22:43:44,479] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_25-model_00-model_states.pt... 0: [2022-11-26 22:43:44,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_25-model_00-model_states.pt. 0: [2022-11-26 22:43:44,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_26-model_00-model_states.pt... 0: [2022-11-26 22:43:44,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_26-model_00-model_states.pt. 0: [2022-11-26 22:43:44,726] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_27-model_00-model_states.pt... 0: [2022-11-26 22:43:44,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_27-model_00-model_states.pt. 0: [2022-11-26 22:43:44,849] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_28-model_00-model_states.pt... 0: [2022-11-26 22:43:44,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_28-model_00-model_states.pt. 0: [2022-11-26 22:43:44,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_29-model_00-model_states.pt... 0: [2022-11-26 22:43:45,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_29-model_00-model_states.pt. 0: [2022-11-26 22:43:45,098] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_30-model_00-model_states.pt... 0: [2022-11-26 22:43:45,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_30-model_00-model_states.pt. 0: [2022-11-26 22:43:45,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_31-model_00-model_states.pt... 0: [2022-11-26 22:43:45,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_31-model_00-model_states.pt. 0: [2022-11-26 22:43:45,345] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_32-model_00-model_states.pt... 0: [2022-11-26 22:43:45,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_32-model_00-model_states.pt. 0: [2022-11-26 22:43:45,471] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_33-model_00-model_states.pt... 0: [2022-11-26 22:43:45,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_33-model_00-model_states.pt. 0: [2022-11-26 22:43:45,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_34-model_00-model_states.pt... 0: [2022-11-26 22:43:45,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_34-model_00-model_states.pt. 0: [2022-11-26 22:43:45,718] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/layer_36-model_00-model_states.pt... 0: [2022-11-26 22:43:45,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/layer_36-model_00-model_states.pt. 0: [2022-11-26 22:43:45,724] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step31000/mp_rank_00_model_states.pt 0: [2022-11-26 22:43:45,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/mp_rank_00_model_states.pt... 0: [2022-11-26 22:43:45,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/mp_rank_00_model_states.pt. 5: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:43:45,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step31000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:43:46,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:43:46,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:43:46,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 22:43:46,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 0: [2022-11-26 22:43:46,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:43:46,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 22:43:46,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 3: [2022-11-26 22:43:46,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:43:46,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 22:43:46,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 3: [2022-11-26 22:43:46,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:43:46,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 22:43:46,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 0: [2022-11-26 22:43:46,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:43:46,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 22:43:46,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 6: [2022-11-26 22:43:46,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:43:46,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 22:43:46,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 6: [2022-11-26 22:43:46,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:43:46,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 22:43:46,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 0: [2022-11-26 22:43:46,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:43:46,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 22:43:46,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 0: [2022-11-26 22:43:46,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:43:46,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:43:46,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 22:43:46,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 22:43:46,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 0: [2022-11-26 22:43:46,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 3: [2022-11-26 22:43:46,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:43:46,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 22:43:46,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 3: [2022-11-26 22:43:46,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:43:46,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 22:43:46,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 3: [2022-11-26 22:43:46,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:43:46,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 22:43:46,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 6: [2022-11-26 22:43:46,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:43:46,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 22:43:46,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:43:46,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 6: [2022-11-26 22:43:46,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:43:46,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:43:46,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 22:43:46,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 22:43:46,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 22:43:46,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 6: [2022-11-26 22:43:46,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 6: [2022-11-26 22:43:46,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 3: [2022-11-26 22:43:46,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:43:46,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 22:43:46,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 5: [2022-11-26 22:43:46,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:43:46,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:43:46,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:43:46,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 22:43:46,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:43:46,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 5: [2022-11-26 22:43:46,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 22:43:46,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 22:43:46,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 22:43:46,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 5: [2022-11-26 22:43:46,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 5: [2022-11-26 22:43:46,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 5: [2022-11-26 22:43:46,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:43:46,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 22:43:46,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 5: [2022-11-26 22:43:46,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:43:46,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 22:43:46,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 5: [2022-11-26 22:43:46,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:43:46,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 22:43:46,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 5: [2022-11-26 22:43:46,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:43:46,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 22:43:46,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 3: [2022-11-26 22:43:46,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:43:46,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 22:43:46,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 6: [2022-11-26 22:43:46,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:43:46,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 22:43:46,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 6: [2022-11-26 22:43:46,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:43:46,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 22:43:46,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 3: [2022-11-26 22:43:46,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:43:46,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 22:43:46,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 1: [2022-11-26 22:43:46,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:43:46,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:43:46,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:43:46,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:43:46,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 22:43:46,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 22:43:46,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 1: [2022-11-26 22:43:46,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 22:43:46,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 22:43:46,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 1: [2022-11-26 22:43:46,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 1: [2022-11-26 22:43:46,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 0: [2022-11-26 22:43:46,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:43:46,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 22:43:46,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 1: [2022-11-26 22:43:46,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:43:46,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 22:43:46,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 1: [2022-11-26 22:43:46,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:43:46,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 22:43:46,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 1: [2022-11-26 22:43:46,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:43:46,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 22:43:46,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 1: [2022-11-26 22:43:46,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:43:46,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 22:43:46,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 4: [2022-11-26 22:43:46,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:43:46,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 22:43:46,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 4: [2022-11-26 22:43:46,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:43:46,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:43:46,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:43:46,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:43:46,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:43:46,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:43:46,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:43:46,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 22:43:46,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 22:43:46,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 22:43:46,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 22:43:46,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 22:43:46,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 4: [2022-11-26 22:43:46,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 22:43:46,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 4: [2022-11-26 22:43:46,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 4: [2022-11-26 22:43:46,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 4: [2022-11-26 22:43:46,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 22:43:46,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 4: [2022-11-26 22:43:46,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 4: [2022-11-26 22:43:46,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 2: [2022-11-26 22:43:46,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:43:46,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:43:46,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:43:46,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 22:43:46,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 22:43:46,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 2: [2022-11-26 22:43:46,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 2: [2022-11-26 22:43:46,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 22:43:46,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 2: [2022-11-26 22:43:46,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:43:46,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 22:43:46,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 2: [2022-11-26 22:43:46,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:43:46,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 22:43:46,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 2: [2022-11-26 22:43:46,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:43:46,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 22:43:46,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 2: [2022-11-26 22:43:46,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:43:46,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 22:43:46,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 2: [2022-11-26 22:43:46,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:43:46,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 22:43:46,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 0: [2022-11-26 22:43:46,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 22:43:46,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 7: [2022-11-26 22:43:47,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:43:47,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:43:47,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:43:47,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:43:47,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:43:47,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:43:47,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:43:47,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 22:43:47,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:43:47,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 22:43:47,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 22:43:47,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 22:43:47,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 22:43:47,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 22:43:47,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 22:43:47,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 7: [2022-11-26 22:43:47,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step31000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 22:43:47,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 7: [2022-11-26 22:43:47,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 7: [2022-11-26 22:43:47,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 7: [2022-11-26 22:43:47,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 7: [2022-11-26 22:43:47,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 7: [2022-11-26 22:43:47,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 7: [2022-11-26 22:43:47,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 0: successfully saved checkpoint at iteration 31000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 5642.62 7: iteration 31010/ 44073 | consumed samples: 15877120 | consumed tokens: 32516341760 | elapsed time per iteration (s): 4.83 | learning rate: 5.696E-05 | global batch size: 512 | lm loss: 1.945226E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 105.956 | TFLOPs: 49.38 | 7: iteration 31020/ 44073 | consumed samples: 15882240 | consumed tokens: 32526827520 | elapsed time per iteration (s): 4.32 | learning rate: 5.691E-05 | global batch size: 512 | lm loss: 1.947918E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.634 | TFLOPs: 55.29 | 7: iteration 31030/ 44073 | consumed samples: 15887360 | consumed tokens: 32537313280 | elapsed time per iteration (s): 4.15 | learning rate: 5.686E-05 | global batch size: 512 | lm loss: 1.951649E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.465 | TFLOPs: 57.54 | 7: iteration 31040/ 44073 | consumed samples: 15892480 | consumed tokens: 32547799040 | elapsed time per iteration (s): 4.34 | learning rate: 5.680E-05 | global batch size: 512 | lm loss: 1.975941E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.906 | TFLOPs: 54.95 | 7: iteration 31050/ 44073 | consumed samples: 15897600 | consumed tokens: 32558284800 | elapsed time per iteration (s): 4.18 | learning rate: 5.675E-05 | global batch size: 512 | lm loss: 1.943099E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.585 | TFLOPs: 57.13 | 7: iteration 31060/ 44073 | consumed samples: 15902720 | consumed tokens: 32568770560 | elapsed time per iteration (s): 4.30 | learning rate: 5.670E-05 | global batch size: 512 | lm loss: 1.953470E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.065 | TFLOPs: 55.49 | 7: iteration 31070/ 44073 | consumed samples: 15907840 | consumed tokens: 32579256320 | elapsed time per iteration (s): 4.16 | learning rate: 5.665E-05 | global batch size: 512 | lm loss: 1.962148E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.998 | TFLOPs: 57.32 | 7: iteration 31080/ 44073 | consumed samples: 15912960 | consumed tokens: 32589742080 | elapsed time per iteration (s): 4.36 | learning rate: 5.660E-05 | global batch size: 512 | lm loss: 1.947871E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.310 | TFLOPs: 54.67 | 7: iteration 31090/ 44073 | consumed samples: 15918080 | consumed tokens: 32600227840 | elapsed time per iteration (s): 4.15 | learning rate: 5.654E-05 | global batch size: 512 | lm loss: 1.974279E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.370 | TFLOPs: 57.50 | 7: iteration 31100/ 44073 | consumed samples: 15923200 | consumed tokens: 32610713600 | elapsed time per iteration (s): 4.14 | learning rate: 5.649E-05 | global batch size: 512 | lm loss: 1.959518E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.737 | TFLOPs: 57.67 | 7: iteration 31110/ 44073 | consumed samples: 15928320 | consumed tokens: 32621199360 | elapsed time per iteration (s): 4.15 | learning rate: 5.644E-05 | global batch size: 512 | lm loss: 1.970712E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.285 | TFLOPs: 57.46 | 7: iteration 31120/ 44073 | consumed samples: 15933440 | consumed tokens: 32631685120 | elapsed time per iteration (s): 4.18 | learning rate: 5.639E-05 | global batch size: 512 | lm loss: 1.963643E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.460 | TFLOPs: 57.07 | 7: iteration 31130/ 44073 | consumed samples: 15938560 | consumed tokens: 32642170880 | elapsed time per iteration (s): 4.20 | learning rate: 5.634E-05 | global batch size: 512 | lm loss: 1.951831E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.768 | TFLOPs: 56.75 | 7: iteration 31140/ 44073 | consumed samples: 15943680 | consumed tokens: 32652656640 | elapsed time per iteration (s): 4.19 | learning rate: 5.628E-05 | global batch size: 512 | lm loss: 1.954274E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.207 | TFLOPs: 56.95 | 7: iteration 31150/ 44073 | consumed samples: 15948800 | consumed tokens: 32663142400 | elapsed time per iteration (s): 4.18 | learning rate: 5.623E-05 | global batch size: 512 | lm loss: 1.958652E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.498 | TFLOPs: 57.09 | 7: iteration 31160/ 44073 | consumed samples: 15953920 | consumed tokens: 32673628160 | elapsed time per iteration (s): 4.21 | learning rate: 5.618E-05 | global batch size: 512 | lm loss: 1.967073E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.710 | TFLOPs: 56.72 | 7: iteration 31170/ 44073 | consumed samples: 15959040 | consumed tokens: 32684113920 | elapsed time per iteration (s): 4.18 | learning rate: 5.613E-05 | global batch size: 512 | lm loss: 1.954029E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.440 | TFLOPs: 57.06 | 7: iteration 31180/ 44073 | consumed samples: 15964160 | consumed tokens: 32694599680 | elapsed time per iteration (s): 4.17 | learning rate: 5.608E-05 | global batch size: 512 | lm loss: 1.968900E+00 | grad norm: 0.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.818 | TFLOPs: 57.24 | 7: iteration 31190/ 44073 | consumed samples: 15969280 | consumed tokens: 32705085440 | elapsed time per iteration (s): 4.19 | learning rate: 5.602E-05 | global batch size: 512 | lm loss: 1.963737E+00 | grad norm: 0.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.123 | TFLOPs: 56.92 | 7: iteration 31200/ 44073 | consumed samples: 15974400 | consumed tokens: 32715571200 | elapsed time per iteration (s): 4.15 | learning rate: 5.597E-05 | global batch size: 512 | lm loss: 1.949740E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.325 | TFLOPs: 57.48 | 7: iteration 31210/ 44073 | consumed samples: 15979520 | consumed tokens: 32726056960 | elapsed time per iteration (s): 4.15 | learning rate: 5.592E-05 | global batch size: 512 | lm loss: 1.960901E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.228 | TFLOPs: 57.43 | 7: iteration 31220/ 44073 | consumed samples: 15984640 | consumed tokens: 32736542720 | elapsed time per iteration (s): 4.16 | learning rate: 5.587E-05 | global batch size: 512 | lm loss: 1.958646E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.046 | TFLOPs: 57.35 | 7: iteration 31230/ 44073 | consumed samples: 15989760 | consumed tokens: 32747028480 | elapsed time per iteration (s): 4.60 | learning rate: 5.582E-05 | global batch size: 512 | lm loss: 1.939856E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 111.322 | TFLOPs: 51.88 | 7: iteration 31240/ 44073 | consumed samples: 15994880 | consumed tokens: 32757514240 | elapsed time per iteration (s): 4.20 | learning rate: 5.577E-05 | global batch size: 512 | lm loss: 1.942752E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.958 | TFLOPs: 56.84 | 7: iteration 31250/ 44073 | consumed samples: 16000000 | consumed tokens: 32768000000 | elapsed time per iteration (s): 4.14 | learning rate: 5.571E-05 | global batch size: 512 | lm loss: 1.970014E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.664 | TFLOPs: 57.63 | 7: iteration 31260/ 44073 | consumed samples: 16005120 | consumed tokens: 32778485760 | elapsed time per iteration (s): 4.18 | learning rate: 5.566E-05 | global batch size: 512 | lm loss: 1.947098E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.547 | TFLOPs: 57.11 | 7: iteration 31270/ 44073 | consumed samples: 16010240 | consumed tokens: 32788971520 | elapsed time per iteration (s): 4.15 | learning rate: 5.561E-05 | global batch size: 512 | lm loss: 1.978720E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.238 | TFLOPs: 57.44 | 7: iteration 31280/ 44073 | consumed samples: 16015360 | consumed tokens: 32799457280 | elapsed time per iteration (s): 4.20 | learning rate: 5.556E-05 | global batch size: 512 | lm loss: 1.966202E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.949 | TFLOPs: 56.83 | 7: iteration 31290/ 44073 | consumed samples: 16020480 | consumed tokens: 32809943040 | elapsed time per iteration (s): 4.17 | learning rate: 5.551E-05 | global batch size: 512 | lm loss: 1.954725E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.810 | TFLOPs: 57.24 | 7: iteration 31300/ 44073 | consumed samples: 16025600 | consumed tokens: 32820428800 | elapsed time per iteration (s): 4.17 | learning rate: 5.546E-05 | global batch size: 512 | lm loss: 1.981286E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.820 | TFLOPs: 57.24 | 7: iteration 31310/ 44073 | consumed samples: 16030720 | consumed tokens: 32830914560 | elapsed time per iteration (s): 4.16 | learning rate: 5.540E-05 | global batch size: 512 | lm loss: 1.964875E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.942 | TFLOPs: 57.30 | 7: iteration 31320/ 44073 | consumed samples: 16035840 | consumed tokens: 32841400320 | elapsed time per iteration (s): 4.23 | learning rate: 5.535E-05 | global batch size: 512 | lm loss: 1.962687E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.927 | TFLOPs: 56.36 | 7: iteration 31330/ 44073 | consumed samples: 16040960 | consumed tokens: 32851886080 | elapsed time per iteration (s): 4.14 | learning rate: 5.530E-05 | global batch size: 512 | lm loss: 1.938685E+00 | grad norm: 0.114 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.768 | TFLOPs: 57.68 | 7: iteration 31340/ 44073 | consumed samples: 16046080 | consumed tokens: 32862371840 | elapsed time per iteration (s): 4.16 | learning rate: 5.525E-05 | global batch size: 512 | lm loss: 1.952987E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.214 | TFLOPs: 57.42 | 7: iteration 31350/ 44073 | consumed samples: 16051200 | consumed tokens: 32872857600 | elapsed time per iteration (s): 4.15 | learning rate: 5.520E-05 | global batch size: 512 | lm loss: 1.968034E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.516 | TFLOPs: 57.56 | 7: iteration 31360/ 44073 | consumed samples: 16056320 | consumed tokens: 32883343360 | elapsed time per iteration (s): 4.18 | learning rate: 5.515E-05 | global batch size: 512 | lm loss: 1.952507E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.461 | TFLOPs: 57.07 | 7: iteration 31370/ 44073 | consumed samples: 16061440 | consumed tokens: 32893829120 | elapsed time per iteration (s): 4.17 | learning rate: 5.510E-05 | global batch size: 512 | lm loss: 1.944283E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.683 | TFLOPs: 57.18 | 7: iteration 31380/ 44073 | consumed samples: 16066560 | consumed tokens: 32904314880 | elapsed time per iteration (s): 4.15 | learning rate: 5.504E-05 | global batch size: 512 | lm loss: 1.952876E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.315 | TFLOPs: 57.47 | 7: iteration 31390/ 44073 | consumed samples: 16071680 | consumed tokens: 32914800640 | elapsed time per iteration (s): 4.20 | learning rate: 5.499E-05 | global batch size: 512 | lm loss: 1.944931E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.018 | TFLOPs: 56.87 | 7: iteration 31400/ 44073 | consumed samples: 16076800 | consumed tokens: 32925286400 | elapsed time per iteration (s): 4.15 | learning rate: 5.494E-05 | global batch size: 512 | lm loss: 1.954345E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.409 | TFLOPs: 57.51 | 7: iteration 31410/ 44073 | consumed samples: 16081920 | consumed tokens: 32935772160 | elapsed time per iteration (s): 4.18 | learning rate: 5.489E-05 | global batch size: 512 | lm loss: 1.985744E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.383 | TFLOPs: 57.04 | 7: iteration 31420/ 44073 | consumed samples: 16087040 | consumed tokens: 32946257920 | elapsed time per iteration (s): 4.20 | learning rate: 5.484E-05 | global batch size: 512 | lm loss: 1.973307E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.884 | TFLOPs: 56.80 | 7: iteration 31430/ 44073 | consumed samples: 16092160 | consumed tokens: 32956743680 | elapsed time per iteration (s): 4.16 | learning rate: 5.479E-05 | global batch size: 512 | lm loss: 1.937037E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.117 | TFLOPs: 57.38 | 7: iteration 31440/ 44073 | consumed samples: 16097280 | consumed tokens: 32967229440 | elapsed time per iteration (s): 4.16 | learning rate: 5.474E-05 | global batch size: 512 | lm loss: 1.955630E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.998 | TFLOPs: 57.32 | 7: iteration 31450/ 44073 | consumed samples: 16102400 | consumed tokens: 32977715200 | elapsed time per iteration (s): 4.15 | learning rate: 5.469E-05 | global batch size: 512 | lm loss: 1.962283E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.505 | TFLOPs: 57.56 | 7: iteration 31460/ 44073 | consumed samples: 16107520 | consumed tokens: 32988200960 | elapsed time per iteration (s): 4.14 | learning rate: 5.463E-05 | global batch size: 512 | lm loss: 1.960085E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.608 | TFLOPs: 57.61 | 7: iteration 31470/ 44073 | consumed samples: 16112640 | consumed tokens: 32998686720 | elapsed time per iteration (s): 4.14 | learning rate: 5.458E-05 | global batch size: 512 | lm loss: 1.932141E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.579 | TFLOPs: 57.59 | 7: iteration 31480/ 44073 | consumed samples: 16117760 | consumed tokens: 33009172480 | elapsed time per iteration (s): 4.14 | learning rate: 5.453E-05 | global batch size: 512 | lm loss: 1.958889E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.798 | TFLOPs: 57.70 | 7: iteration 31490/ 44073 | consumed samples: 16122880 | consumed tokens: 33019658240 | elapsed time per iteration (s): 4.19 | learning rate: 5.448E-05 | global batch size: 512 | lm loss: 1.954976E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.330 | TFLOPs: 57.01 | 7: iteration 31500/ 44073 | consumed samples: 16128000 | consumed tokens: 33030144000 | elapsed time per iteration (s): 4.18 | learning rate: 5.443E-05 | global batch size: 512 | lm loss: 1.954887E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.412 | TFLOPs: 57.05 | 7: iteration 31510/ 44073 | consumed samples: 16133120 | consumed tokens: 33040629760 | elapsed time per iteration (s): 4.21 | learning rate: 5.438E-05 | global batch size: 512 | lm loss: 1.930448E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.584 | TFLOPs: 56.66 | 7: iteration 31520/ 44073 | consumed samples: 16138240 | consumed tokens: 33051115520 | elapsed time per iteration (s): 4.18 | learning rate: 5.433E-05 | global batch size: 512 | lm loss: 1.978760E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.553 | TFLOPs: 57.12 | 7: iteration 31530/ 44073 | consumed samples: 16143360 | consumed tokens: 33061601280 | elapsed time per iteration (s): 4.17 | learning rate: 5.428E-05 | global batch size: 512 | lm loss: 1.973522E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.868 | TFLOPs: 57.26 | 7: iteration 31540/ 44073 | consumed samples: 16148480 | consumed tokens: 33072087040 | elapsed time per iteration (s): 4.25 | learning rate: 5.423E-05 | global batch size: 512 | lm loss: 1.957811E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.412 | TFLOPs: 56.12 | 7: iteration 31550/ 44073 | consumed samples: 16153600 | consumed tokens: 33082572800 | elapsed time per iteration (s): 4.19 | learning rate: 5.418E-05 | global batch size: 512 | lm loss: 1.960138E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.284 | TFLOPs: 56.99 | 7: iteration 31560/ 44073 | consumed samples: 16158720 | consumed tokens: 33093058560 | elapsed time per iteration (s): 4.17 | learning rate: 5.412E-05 | global batch size: 512 | lm loss: 1.935040E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.788 | TFLOPs: 57.23 | 7: iteration 31570/ 44073 | consumed samples: 16163840 | consumed tokens: 33103544320 | elapsed time per iteration (s): 4.25 | learning rate: 5.407E-05 | global batch size: 512 | lm loss: 1.956532E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.367 | TFLOPs: 56.10 | 7: iteration 31580/ 44073 | consumed samples: 16168960 | consumed tokens: 33114030080 | elapsed time per iteration (s): 4.24 | learning rate: 5.402E-05 | global batch size: 512 | lm loss: 1.950014E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.804 | TFLOPs: 56.30 | 7: iteration 31590/ 44073 | consumed samples: 16174080 | consumed tokens: 33124515840 | elapsed time per iteration (s): 4.14 | learning rate: 5.397E-05 | global batch size: 512 | lm loss: 1.946185E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.701 | TFLOPs: 57.65 | 7: iteration 31600/ 44073 | consumed samples: 16179200 | consumed tokens: 33135001600 | elapsed time per iteration (s): 4.20 | learning rate: 5.392E-05 | global batch size: 512 | lm loss: 1.958393E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.014 | TFLOPs: 56.86 | 7: iteration 31610/ 44073 | consumed samples: 16184320 | consumed tokens: 33145487360 | elapsed time per iteration (s): 4.14 | learning rate: 5.387E-05 | global batch size: 512 | lm loss: 1.954568E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.542 | TFLOPs: 57.58 | 7: iteration 31620/ 44073 | consumed samples: 16189440 | consumed tokens: 33155973120 | elapsed time per iteration (s): 4.20 | learning rate: 5.382E-05 | global batch size: 512 | lm loss: 1.954945E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.007 | TFLOPs: 56.86 | 7: iteration 31630/ 44073 | consumed samples: 16194560 | consumed tokens: 33166458880 | elapsed time per iteration (s): 4.14 | learning rate: 5.377E-05 | global batch size: 512 | lm loss: 1.928561E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.682 | TFLOPs: 57.64 | 7: iteration 31640/ 44073 | consumed samples: 16199680 | consumed tokens: 33176944640 | elapsed time per iteration (s): 4.14 | learning rate: 5.372E-05 | global batch size: 512 | lm loss: 1.943369E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.618 | TFLOPs: 57.61 | 7: iteration 31650/ 44073 | consumed samples: 16204800 | consumed tokens: 33187430400 | elapsed time per iteration (s): 4.16 | learning rate: 5.367E-05 | global batch size: 512 | lm loss: 1.944312E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.963 | TFLOPs: 57.31 | 7: iteration 31660/ 44073 | consumed samples: 16209920 | consumed tokens: 33197916160 | elapsed time per iteration (s): 4.32 | learning rate: 5.362E-05 | global batch size: 512 | lm loss: 1.957428E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.509 | TFLOPs: 55.23 | 7: iteration 31670/ 44073 | consumed samples: 16215040 | consumed tokens: 33208401920 | elapsed time per iteration (s): 4.19 | learning rate: 5.357E-05 | global batch size: 512 | lm loss: 1.946002E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.294 | TFLOPs: 57.00 | 7: iteration 31680/ 44073 | consumed samples: 16220160 | consumed tokens: 33218887680 | elapsed time per iteration (s): 4.18 | learning rate: 5.352E-05 | global batch size: 512 | lm loss: 1.961722E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.514 | TFLOPs: 57.10 | 7: iteration 31690/ 44073 | consumed samples: 16225280 | consumed tokens: 33229373440 | elapsed time per iteration (s): 4.19 | learning rate: 5.347E-05 | global batch size: 512 | lm loss: 1.950257E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.111 | TFLOPs: 56.91 | 7: iteration 31700/ 44073 | consumed samples: 16230400 | consumed tokens: 33239859200 | elapsed time per iteration (s): 4.21 | learning rate: 5.342E-05 | global batch size: 512 | lm loss: 1.958797E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.676 | TFLOPs: 56.71 | 7: iteration 31710/ 44073 | consumed samples: 16235520 | consumed tokens: 33250344960 | elapsed time per iteration (s): 4.22 | learning rate: 5.337E-05 | global batch size: 512 | lm loss: 1.956244E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.447 | TFLOPs: 56.60 | 7: iteration 31720/ 44073 | consumed samples: 16240640 | consumed tokens: 33260830720 | elapsed time per iteration (s): 4.13 | learning rate: 5.332E-05 | global batch size: 512 | lm loss: 1.964492E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.959 | TFLOPs: 57.77 | 7: iteration 31730/ 44073 | consumed samples: 16245760 | consumed tokens: 33271316480 | elapsed time per iteration (s): 4.23 | learning rate: 5.327E-05 | global batch size: 512 | lm loss: 1.941195E+00 | grad norm: 0.115 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.179 | TFLOPs: 56.48 | 7: iteration 31740/ 44073 | consumed samples: 16250880 | consumed tokens: 33281802240 | elapsed time per iteration (s): 4.14 | learning rate: 5.322E-05 | global batch size: 512 | lm loss: 1.957235E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.791 | TFLOPs: 57.69 | 7: iteration 31750/ 44073 | consumed samples: 16256000 | consumed tokens: 33292288000 | elapsed time per iteration (s): 4.14 | learning rate: 5.316E-05 | global batch size: 512 | lm loss: 1.968036E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.566 | TFLOPs: 57.59 | 7: iteration 31760/ 44073 | consumed samples: 16261120 | consumed tokens: 33302773760 | elapsed time per iteration (s): 4.16 | learning rate: 5.311E-05 | global batch size: 512 | lm loss: 1.960993E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.168 | TFLOPs: 57.40 | 7: iteration 31770/ 44073 | consumed samples: 16266240 | consumed tokens: 33313259520 | elapsed time per iteration (s): 4.16 | learning rate: 5.306E-05 | global batch size: 512 | lm loss: 1.946672E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.045 | TFLOPs: 57.35 | 7: iteration 31780/ 44073 | consumed samples: 16271360 | consumed tokens: 33323745280 | elapsed time per iteration (s): 4.19 | learning rate: 5.301E-05 | global batch size: 512 | lm loss: 1.975004E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.055 | TFLOPs: 56.88 | 7: iteration 31790/ 44073 | consumed samples: 16276480 | consumed tokens: 33334231040 | elapsed time per iteration (s): 4.15 | learning rate: 5.296E-05 | global batch size: 512 | lm loss: 1.958238E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.389 | TFLOPs: 57.51 | 7: iteration 31800/ 44073 | consumed samples: 16281600 | consumed tokens: 33344716800 | elapsed time per iteration (s): 4.16 | learning rate: 5.291E-05 | global batch size: 512 | lm loss: 1.952151E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.055 | TFLOPs: 57.35 | 7: iteration 31810/ 44073 | consumed samples: 16286720 | consumed tokens: 33355202560 | elapsed time per iteration (s): 4.20 | learning rate: 5.286E-05 | global batch size: 512 | lm loss: 1.980512E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.784 | TFLOPs: 56.76 | 7: iteration 31820/ 44073 | consumed samples: 16291840 | consumed tokens: 33365688320 | elapsed time per iteration (s): 4.18 | learning rate: 5.281E-05 | global batch size: 512 | lm loss: 1.957608E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.617 | TFLOPs: 57.15 | 7: iteration 31830/ 44073 | consumed samples: 16296960 | consumed tokens: 33376174080 | elapsed time per iteration (s): 4.17 | learning rate: 5.276E-05 | global batch size: 512 | lm loss: 1.947719E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.923 | TFLOPs: 57.29 | 7: iteration 31840/ 44073 | consumed samples: 16302080 | consumed tokens: 33386659840 | elapsed time per iteration (s): 4.27 | learning rate: 5.271E-05 | global batch size: 512 | lm loss: 1.943908E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.020 | TFLOPs: 55.94 | 7: iteration 31850/ 44073 | consumed samples: 16307200 | consumed tokens: 33397145600 | elapsed time per iteration (s): 4.21 | learning rate: 5.266E-05 | global batch size: 512 | lm loss: 1.942066E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.715 | TFLOPs: 56.73 | 7: iteration 31860/ 44073 | consumed samples: 16312320 | consumed tokens: 33407631360 | elapsed time per iteration (s): 4.15 | learning rate: 5.261E-05 | global batch size: 512 | lm loss: 1.959184E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.251 | TFLOPs: 57.44 | 7: iteration 31870/ 44073 | consumed samples: 16317440 | consumed tokens: 33418117120 | elapsed time per iteration (s): 4.19 | learning rate: 5.256E-05 | global batch size: 512 | lm loss: 1.948231E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.186 | TFLOPs: 56.95 | 7: iteration 31880/ 44073 | consumed samples: 16322560 | consumed tokens: 33428602880 | elapsed time per iteration (s): 4.16 | learning rate: 5.251E-05 | global batch size: 512 | lm loss: 1.964068E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.221 | TFLOPs: 57.43 | 7: iteration 31890/ 44073 | consumed samples: 16327680 | consumed tokens: 33439088640 | elapsed time per iteration (s): 4.18 | learning rate: 5.246E-05 | global batch size: 512 | lm loss: 1.952672E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.541 | TFLOPs: 57.11 | 7: iteration 31900/ 44073 | consumed samples: 16332800 | consumed tokens: 33449574400 | elapsed time per iteration (s): 4.19 | learning rate: 5.241E-05 | global batch size: 512 | lm loss: 1.970066E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.113 | TFLOPs: 56.91 | 7: iteration 31910/ 44073 | consumed samples: 16337920 | consumed tokens: 33460060160 | elapsed time per iteration (s): 4.18 | learning rate: 5.236E-05 | global batch size: 512 | lm loss: 1.944443E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.443 | TFLOPs: 57.06 | 7: iteration 31920/ 44073 | consumed samples: 16343040 | consumed tokens: 33470545920 | elapsed time per iteration (s): 4.19 | learning rate: 5.231E-05 | global batch size: 512 | lm loss: 1.966290E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.136 | TFLOPs: 56.92 | 7: iteration 31930/ 44073 | consumed samples: 16348160 | consumed tokens: 33481031680 | elapsed time per iteration (s): 4.16 | learning rate: 5.227E-05 | global batch size: 512 | lm loss: 1.953181E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.091 | TFLOPs: 57.37 | 7: iteration 31940/ 44073 | consumed samples: 16353280 | consumed tokens: 33491517440 | elapsed time per iteration (s): 4.21 | learning rate: 5.222E-05 | global batch size: 512 | lm loss: 1.958445E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.504 | TFLOPs: 56.63 | 7: iteration 31950/ 44073 | consumed samples: 16358400 | consumed tokens: 33502003200 | elapsed time per iteration (s): 4.40 | learning rate: 5.217E-05 | global batch size: 512 | lm loss: 1.951255E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 116.363 | TFLOPs: 54.23 | 7: iteration 31960/ 44073 | consumed samples: 16363520 | consumed tokens: 33512488960 | elapsed time per iteration (s): 4.20 | learning rate: 5.212E-05 | global batch size: 512 | lm loss: 1.956729E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.850 | TFLOPs: 56.79 | 7: iteration 31970/ 44073 | consumed samples: 16368640 | consumed tokens: 33522974720 | elapsed time per iteration (s): 4.52 | learning rate: 5.207E-05 | global batch size: 512 | lm loss: 1.941509E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 113.260 | TFLOPs: 52.78 | 7: iteration 31980/ 44073 | consumed samples: 16373760 | consumed tokens: 33533460480 | elapsed time per iteration (s): 4.17 | learning rate: 5.202E-05 | global batch size: 512 | lm loss: 1.955109E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.866 | TFLOPs: 57.26 | 7: iteration 31990/ 44073 | consumed samples: 16378880 | consumed tokens: 33543946240 | elapsed time per iteration (s): 4.16 | learning rate: 5.197E-05 | global batch size: 512 | lm loss: 1.937933E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.950 | TFLOPs: 57.30 | 0: [2022-11-26 23:53:39,895] [INFO] [logging.py:68:log_dist] [Rank 0] step=32000, skipped=0, lr=[5.191798931443464e-05, 5.191798931443464e-05, 5.191798931443464e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 7: iteration 32000/ 44073 | consumed samples: 16384000 | consumed tokens: 33554432000 | elapsed time per iteration (s): 4.17 | learning rate: 5.192E-05 | global batch size: 512 | lm loss: 1.962654E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.803 | TFLOPs: 57.23 | 0: steps: 32000 loss: 1.9881 iter time (s): 4.192 samples/sec: 122.149 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 32000 | lm loss value: 1.922786E+00 | lm loss PPL: 6.839990E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 32000 to checkpoints_2b2 0: [2022-11-26 23:53:41,236] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step32000 is begin to save! 0: [2022-11-26 23:53:41,241] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_01-model_00-model_states.pt... 0: [2022-11-26 23:53:41,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_01-model_00-model_states.pt. 0: [2022-11-26 23:53:41,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_03-model_00-model_states.pt... 0: [2022-11-26 23:53:41,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_03-model_00-model_states.pt. 0: [2022-11-26 23:53:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_04-model_00-model_states.pt... 0: [2022-11-26 23:53:41,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_04-model_00-model_states.pt. 0: [2022-11-26 23:53:41,855] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_05-model_00-model_states.pt... 0: [2022-11-26 23:53:41,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_05-model_00-model_states.pt. 0: [2022-11-26 23:53:41,984] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_06-model_00-model_states.pt... 0: [2022-11-26 23:53:42,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_06-model_00-model_states.pt. 0: [2022-11-26 23:53:42,113] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_07-model_00-model_states.pt... 0: [2022-11-26 23:53:42,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_07-model_00-model_states.pt. 0: [2022-11-26 23:53:42,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_08-model_00-model_states.pt... 0: [2022-11-26 23:53:42,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_08-model_00-model_states.pt. 0: [2022-11-26 23:53:42,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_09-model_00-model_states.pt... 0: [2022-11-26 23:53:42,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_09-model_00-model_states.pt. 0: [2022-11-26 23:53:42,488] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_10-model_00-model_states.pt... 0: [2022-11-26 23:53:42,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_10-model_00-model_states.pt. 0: [2022-11-26 23:53:42,614] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_11-model_00-model_states.pt... 0: [2022-11-26 23:53:42,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_11-model_00-model_states.pt. 0: [2022-11-26 23:53:42,742] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_12-model_00-model_states.pt... 0: [2022-11-26 23:53:42,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_12-model_00-model_states.pt. 0: [2022-11-26 23:53:42,870] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_13-model_00-model_states.pt... 0: [2022-11-26 23:53:43,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_13-model_00-model_states.pt. 0: [2022-11-26 23:53:43,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_14-model_00-model_states.pt... 0: [2022-11-26 23:53:43,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_14-model_00-model_states.pt. 0: [2022-11-26 23:53:43,149] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_15-model_00-model_states.pt... 0: [2022-11-26 23:53:43,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_15-model_00-model_states.pt. 0: [2022-11-26 23:53:43,286] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_16-model_00-model_states.pt... 0: [2022-11-26 23:53:43,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_16-model_00-model_states.pt. 0: [2022-11-26 23:53:43,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_17-model_00-model_states.pt... 0: [2022-11-26 23:53:43,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_17-model_00-model_states.pt. 0: [2022-11-26 23:53:43,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_18-model_00-model_states.pt... 0: [2022-11-26 23:53:43,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_18-model_00-model_states.pt. 0: [2022-11-26 23:53:43,696] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_19-model_00-model_states.pt... 0: [2022-11-26 23:53:43,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_19-model_00-model_states.pt. 0: [2022-11-26 23:53:43,838] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_20-model_00-model_states.pt... 0: [2022-11-26 23:53:43,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_20-model_00-model_states.pt. 0: [2022-11-26 23:53:43,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_21-model_00-model_states.pt... 0: [2022-11-26 23:53:44,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_21-model_00-model_states.pt. 0: [2022-11-26 23:53:44,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_22-model_00-model_states.pt... 0: [2022-11-26 23:53:44,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_22-model_00-model_states.pt. 0: [2022-11-26 23:53:44,249] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_23-model_00-model_states.pt... 0: [2022-11-26 23:53:44,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_23-model_00-model_states.pt. 0: [2022-11-26 23:53:44,387] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_24-model_00-model_states.pt... 0: [2022-11-26 23:53:44,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_24-model_00-model_states.pt. 0: [2022-11-26 23:53:44,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_25-model_00-model_states.pt... 0: [2022-11-26 23:53:44,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_25-model_00-model_states.pt. 0: [2022-11-26 23:53:44,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_26-model_00-model_states.pt... 0: [2022-11-26 23:53:44,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_26-model_00-model_states.pt. 0: [2022-11-26 23:53:44,809] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_27-model_00-model_states.pt... 0: [2022-11-26 23:53:44,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_27-model_00-model_states.pt. 0: [2022-11-26 23:53:44,946] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_28-model_00-model_states.pt... 0: [2022-11-26 23:53:45,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_28-model_00-model_states.pt. 0: [2022-11-26 23:53:45,085] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_29-model_00-model_states.pt... 0: [2022-11-26 23:53:45,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_29-model_00-model_states.pt. 0: [2022-11-26 23:53:45,221] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_30-model_00-model_states.pt... 0: [2022-11-26 23:53:45,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_30-model_00-model_states.pt. 0: [2022-11-26 23:53:45,357] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_31-model_00-model_states.pt... 0: [2022-11-26 23:53:45,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_31-model_00-model_states.pt. 0: [2022-11-26 23:53:45,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_32-model_00-model_states.pt... 0: [2022-11-26 23:53:45,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_32-model_00-model_states.pt. 0: [2022-11-26 23:53:45,632] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_33-model_00-model_states.pt... 0: [2022-11-26 23:53:45,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_33-model_00-model_states.pt. 0: [2022-11-26 23:53:45,769] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_34-model_00-model_states.pt... 0: [2022-11-26 23:53:45,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_34-model_00-model_states.pt. 0: [2022-11-26 23:53:45,904] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/layer_36-model_00-model_states.pt... 0: [2022-11-26 23:53:45,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/layer_36-model_00-model_states.pt. 0: [2022-11-26 23:53:45,909] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step32000/mp_rank_00_model_states.pt 0: [2022-11-26 23:53:45,909] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/mp_rank_00_model_states.pt... 0: [2022-11-26 23:53:45,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/mp_rank_00_model_states.pt. 0: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:53:45,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step32000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:53:46,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:53:46,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:53:46,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 23:53:46,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: [2022-11-26 23:53:46,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:53:46,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 23:53:46,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: [2022-11-26 23:53:46,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:53:46,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 23:53:46,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: [2022-11-26 23:53:46,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:53:46,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 23:53:46,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: [2022-11-26 23:53:46,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:53:46,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 23:53:46,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: [2022-11-26 23:53:46,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:53:46,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 23:53:46,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: [2022-11-26 23:53:46,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:53:46,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 23:53:46,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 6: [2022-11-26 23:53:46,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:53:46,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:53:46,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 23:53:46,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 23:53:46,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 6: [2022-11-26 23:53:46,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 6: [2022-11-26 23:53:46,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:53:46,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 23:53:46,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 6: [2022-11-26 23:53:46,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:53:46,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 23:53:46,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 6: [2022-11-26 23:53:46,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:53:46,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 23:53:46,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 6: [2022-11-26 23:53:46,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:53:46,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 23:53:46,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 6: [2022-11-26 23:53:46,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:53:46,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:53:46,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 23:53:46,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 23:53:46,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 6: [2022-11-26 23:53:46,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 1: [2022-11-26 23:53:46,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:53:46,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:53:46,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:53:46,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:53:46,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:53:46,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:53:46,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 23:53:46,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 23:53:46,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 23:53:46,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:53:46,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 23:53:46,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 23:53:46,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 23:53:46,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 1: [2022-11-26 23:53:46,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 1: [2022-11-26 23:53:46,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 1: [2022-11-26 23:53:46,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 1: [2022-11-26 23:53:46,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 23:53:46,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 1: [2022-11-26 23:53:46,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 1: [2022-11-26 23:53:46,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 1: [2022-11-26 23:53:46,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:53:46,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 23:53:46,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 3: [2022-11-26 23:53:46,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:53:46,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:53:46,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:53:46,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:53:46,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 23:53:46,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 23:53:46,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 23:53:46,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 23:53:46,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 3: [2022-11-26 23:53:46,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 3: [2022-11-26 23:53:46,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 3: [2022-11-26 23:53:46,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 3: [2022-11-26 23:53:46,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:53:46,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 23:53:46,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 3: [2022-11-26 23:53:46,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:53:46,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 23:53:46,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 3: [2022-11-26 23:53:46,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:53:46,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 23:53:46,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 3: [2022-11-26 23:53:46,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:53:46,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 23:53:46,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 2: [2022-11-26 23:53:46,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:53:46,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:53:46,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:53:46,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 23:53:46,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 23:53:46,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 23:53:46,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 2: [2022-11-26 23:53:46,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 2: [2022-11-26 23:53:46,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 2: [2022-11-26 23:53:46,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:53:46,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 23:53:46,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 2: [2022-11-26 23:53:46,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:53:46,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 23:53:46,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 2: [2022-11-26 23:53:46,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:53:46,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 23:53:46,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 2: [2022-11-26 23:53:46,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:53:46,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 23:53:46,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 2: [2022-11-26 23:53:46,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:53:46,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 23:53:46,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 4: [2022-11-26 23:53:46,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:53:46,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 23:53:46,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:53:46,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:53:46,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 4: [2022-11-26 23:53:46,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 23:53:46,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 23:53:46,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 4: [2022-11-26 23:53:46,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 4: [2022-11-26 23:53:46,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:53:46,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:53:46,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:53:46,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 23:53:46,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:53:46,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 23:53:46,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:53:46,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 4: [2022-11-26 23:53:46,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 4: [2022-11-26 23:53:46,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 23:53:46,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 23:53:46,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 23:53:46,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 4: [2022-11-26 23:53:46,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 4: [2022-11-26 23:53:46,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 5: [2022-11-26 23:53:46,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:53:46,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:53:46,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 23:53:46,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 23:53:46,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:53:46,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:53:46,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 5: [2022-11-26 23:53:46,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 5: [2022-11-26 23:53:46,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 23:53:46,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 23:53:46,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 5: [2022-11-26 23:53:46,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 5: [2022-11-26 23:53:46,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:53:46,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:53:46,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:53:46,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:53:46,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 23:53:46,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 23:53:46,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 23:53:46,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 5: [2022-11-26 23:53:46,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 23:53:46,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 5: [2022-11-26 23:53:46,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 5: [2022-11-26 23:53:46,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: [2022-11-26 23:53:47,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 23:53:47,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 7: [2022-11-26 23:53:47,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:53:47,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:53:47,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:53:47,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 23:53:47,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 23:53:47,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 23:53:47,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 7: [2022-11-26 23:53:47,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 7: [2022-11-26 23:53:47,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 7: [2022-11-26 23:53:47,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:53:47,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 23:53:47,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 7: [2022-11-26 23:53:47,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:53:47,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 23:53:47,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 7: [2022-11-26 23:53:47,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:53:47,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:53:47,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:53:47,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 23:53:47,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 23:53:47,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step32000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 23:53:47,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 7: [2022-11-26 23:53:47,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 7: [2022-11-26 23:53:47,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: successfully saved checkpoint at iteration 32000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 5957.83 7: iteration 32010/ 44073 | consumed samples: 16389120 | consumed tokens: 33564917760 | elapsed time per iteration (s): 5.04 | learning rate: 5.187E-05 | global batch size: 512 | lm loss: 1.942101E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 101.497 | TFLOPs: 47.30 | 7: iteration 32020/ 44073 | consumed samples: 16394240 | consumed tokens: 33575403520 | elapsed time per iteration (s): 4.20 | learning rate: 5.182E-05 | global batch size: 512 | lm loss: 1.946298E+00 | grad norm: 0.115 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.801 | TFLOPs: 56.77 | 7: iteration 32030/ 44073 | consumed samples: 16399360 | consumed tokens: 33585889280 | elapsed time per iteration (s): 4.36 | learning rate: 5.177E-05 | global batch size: 512 | lm loss: 1.970271E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.332 | TFLOPs: 54.68 | 7: iteration 32040/ 44073 | consumed samples: 16404480 | consumed tokens: 33596375040 | elapsed time per iteration (s): 4.16 | learning rate: 5.172E-05 | global batch size: 512 | lm loss: 1.966453E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.978 | TFLOPs: 57.31 | 7: iteration 32050/ 44073 | consumed samples: 16409600 | consumed tokens: 33606860800 | elapsed time per iteration (s): 4.18 | learning rate: 5.167E-05 | global batch size: 512 | lm loss: 1.934859E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.627 | TFLOPs: 57.15 | 7: iteration 32060/ 44073 | consumed samples: 16414720 | consumed tokens: 33617346560 | elapsed time per iteration (s): 4.17 | learning rate: 5.162E-05 | global batch size: 512 | lm loss: 1.950790E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.839 | TFLOPs: 57.25 | 7: iteration 32070/ 44073 | consumed samples: 16419840 | consumed tokens: 33627832320 | elapsed time per iteration (s): 4.18 | learning rate: 5.157E-05 | global batch size: 512 | lm loss: 1.944144E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.612 | TFLOPs: 57.14 | 7: iteration 32080/ 44073 | consumed samples: 16424960 | consumed tokens: 33638318080 | elapsed time per iteration (s): 4.18 | learning rate: 5.152E-05 | global batch size: 512 | lm loss: 1.951678E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.514 | TFLOPs: 57.10 | 7: iteration 32090/ 44073 | consumed samples: 16430080 | consumed tokens: 33648803840 | elapsed time per iteration (s): 6.47 | learning rate: 5.147E-05 | global batch size: 512 | lm loss: 1.952485E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 79.079 | TFLOPs: 36.85 | 7: iteration 32100/ 44073 | consumed samples: 16435200 | consumed tokens: 33659289600 | elapsed time per iteration (s): 4.16 | learning rate: 5.142E-05 | global batch size: 512 | lm loss: 1.951492E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.212 | TFLOPs: 57.42 | 7: iteration 32110/ 44073 | consumed samples: 16440320 | consumed tokens: 33669775360 | elapsed time per iteration (s): 4.20 | learning rate: 5.138E-05 | global batch size: 512 | lm loss: 1.973287E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.796 | TFLOPs: 56.76 | 7: iteration 32120/ 44073 | consumed samples: 16445440 | consumed tokens: 33680261120 | elapsed time per iteration (s): 4.18 | learning rate: 5.133E-05 | global batch size: 512 | lm loss: 1.947427E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.604 | TFLOPs: 57.14 | 7: iteration 32130/ 44073 | consumed samples: 16450560 | consumed tokens: 33690746880 | elapsed time per iteration (s): 4.15 | learning rate: 5.128E-05 | global batch size: 512 | lm loss: 1.936645E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.474 | TFLOPs: 57.55 | 7: iteration 32140/ 44073 | consumed samples: 16455680 | consumed tokens: 33701232640 | elapsed time per iteration (s): 4.17 | learning rate: 5.123E-05 | global batch size: 512 | lm loss: 1.963186E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.667 | TFLOPs: 57.17 | 7: iteration 32150/ 44073 | consumed samples: 16460800 | consumed tokens: 33711718400 | elapsed time per iteration (s): 4.17 | learning rate: 5.118E-05 | global batch size: 512 | lm loss: 1.958639E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.811 | TFLOPs: 57.24 | 7: iteration 32160/ 44073 | consumed samples: 16465920 | consumed tokens: 33722204160 | elapsed time per iteration (s): 4.16 | learning rate: 5.113E-05 | global batch size: 512 | lm loss: 1.963330E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.941 | TFLOPs: 57.30 | 7: iteration 32170/ 44073 | consumed samples: 16471040 | consumed tokens: 33732689920 | elapsed time per iteration (s): 4.14 | learning rate: 5.108E-05 | global batch size: 512 | lm loss: 1.958721E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.577 | TFLOPs: 57.59 | 7: iteration 32180/ 44073 | consumed samples: 16476160 | consumed tokens: 33743175680 | elapsed time per iteration (s): 4.14 | learning rate: 5.103E-05 | global batch size: 512 | lm loss: 1.958897E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.757 | TFLOPs: 57.68 | 7: iteration 32190/ 44073 | consumed samples: 16481280 | consumed tokens: 33753661440 | elapsed time per iteration (s): 4.23 | learning rate: 5.098E-05 | global batch size: 512 | lm loss: 1.952666E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.153 | TFLOPs: 56.46 | 7: iteration 32200/ 44073 | consumed samples: 16486400 | consumed tokens: 33764147200 | elapsed time per iteration (s): 4.17 | learning rate: 5.093E-05 | global batch size: 512 | lm loss: 1.966660E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.786 | TFLOPs: 57.22 | 7: iteration 32210/ 44073 | consumed samples: 16491520 | consumed tokens: 33774632960 | elapsed time per iteration (s): 4.16 | learning rate: 5.089E-05 | global batch size: 512 | lm loss: 1.954049E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.087 | TFLOPs: 57.36 | 7: iteration 32220/ 44073 | consumed samples: 16496640 | consumed tokens: 33785118720 | elapsed time per iteration (s): 4.17 | learning rate: 5.084E-05 | global batch size: 512 | lm loss: 1.942203E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.709 | TFLOPs: 57.19 | 7: iteration 32230/ 44073 | consumed samples: 16501760 | consumed tokens: 33795604480 | elapsed time per iteration (s): 4.16 | learning rate: 5.079E-05 | global batch size: 512 | lm loss: 1.940616E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.143 | TFLOPs: 57.39 | 7: iteration 32240/ 44073 | consumed samples: 16506880 | consumed tokens: 33806090240 | elapsed time per iteration (s): 4.21 | learning rate: 5.074E-05 | global batch size: 512 | lm loss: 1.956090E+00 | grad norm: 0.115 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.621 | TFLOPs: 56.68 | 7: iteration 32250/ 44073 | consumed samples: 16512000 | consumed tokens: 33816576000 | elapsed time per iteration (s): 4.20 | learning rate: 5.069E-05 | global batch size: 512 | lm loss: 1.956470E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.887 | TFLOPs: 56.81 | 7: iteration 32260/ 44073 | consumed samples: 16517120 | consumed tokens: 33827061760 | elapsed time per iteration (s): 4.21 | learning rate: 5.064E-05 | global batch size: 512 | lm loss: 1.960373E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.723 | TFLOPs: 56.73 | 7: iteration 32270/ 44073 | consumed samples: 16522240 | consumed tokens: 33837547520 | elapsed time per iteration (s): 4.16 | learning rate: 5.059E-05 | global batch size: 512 | lm loss: 1.961127E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.132 | TFLOPs: 57.39 | 7: iteration 32280/ 44073 | consumed samples: 16527360 | consumed tokens: 33848033280 | elapsed time per iteration (s): 4.19 | learning rate: 5.054E-05 | global batch size: 512 | lm loss: 1.942024E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.106 | TFLOPs: 56.91 | 7: iteration 32290/ 44073 | consumed samples: 16532480 | consumed tokens: 33858519040 | elapsed time per iteration (s): 4.15 | learning rate: 5.050E-05 | global batch size: 512 | lm loss: 1.963255E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.252 | TFLOPs: 57.44 | 7: iteration 32300/ 44073 | consumed samples: 16537600 | consumed tokens: 33869004800 | elapsed time per iteration (s): 4.14 | learning rate: 5.045E-05 | global batch size: 512 | lm loss: 1.926156E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.761 | TFLOPs: 57.68 | 7: iteration 32310/ 44073 | consumed samples: 16542720 | consumed tokens: 33879490560 | elapsed time per iteration (s): 4.16 | learning rate: 5.040E-05 | global batch size: 512 | lm loss: 1.932732E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.058 | TFLOPs: 57.35 | 7: iteration 32320/ 44073 | consumed samples: 16547840 | consumed tokens: 33889976320 | elapsed time per iteration (s): 4.16 | learning rate: 5.035E-05 | global batch size: 512 | lm loss: 1.949865E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.170 | TFLOPs: 57.40 | 7: iteration 32330/ 44073 | consumed samples: 16552960 | consumed tokens: 33900462080 | elapsed time per iteration (s): 4.15 | learning rate: 5.030E-05 | global batch size: 512 | lm loss: 1.965299E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.360 | TFLOPs: 57.49 | 7: iteration 32340/ 44073 | consumed samples: 16558080 | consumed tokens: 33910947840 | elapsed time per iteration (s): 5.73 | learning rate: 5.025E-05 | global batch size: 512 | lm loss: 1.930464E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 89.364 | TFLOPs: 41.65 | 7: iteration 32350/ 44073 | consumed samples: 16563200 | consumed tokens: 33921433600 | elapsed time per iteration (s): 4.15 | learning rate: 5.020E-05 | global batch size: 512 | lm loss: 1.968347E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.425 | TFLOPs: 57.52 | 7: iteration 32360/ 44073 | consumed samples: 16568320 | consumed tokens: 33931919360 | elapsed time per iteration (s): 4.14 | learning rate: 5.016E-05 | global batch size: 512 | lm loss: 1.941400E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.702 | TFLOPs: 57.65 | 7: iteration 32370/ 44073 | consumed samples: 16573440 | consumed tokens: 33942405120 | elapsed time per iteration (s): 4.16 | learning rate: 5.011E-05 | global batch size: 512 | lm loss: 1.940801E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.967 | TFLOPs: 57.31 | 7: iteration 32380/ 44073 | consumed samples: 16578560 | consumed tokens: 33952890880 | elapsed time per iteration (s): 4.24 | learning rate: 5.006E-05 | global batch size: 512 | lm loss: 1.935699E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.615 | TFLOPs: 56.21 | 7: iteration 32390/ 44073 | consumed samples: 16583680 | consumed tokens: 33963376640 | elapsed time per iteration (s): 4.20 | learning rate: 5.001E-05 | global batch size: 512 | lm loss: 1.950172E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.880 | TFLOPs: 56.80 | 7: iteration 32400/ 44073 | consumed samples: 16588800 | consumed tokens: 33973862400 | elapsed time per iteration (s): 4.22 | learning rate: 4.996E-05 | global batch size: 512 | lm loss: 1.938225E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.251 | TFLOPs: 56.51 | 7: iteration 32410/ 44073 | consumed samples: 16593920 | consumed tokens: 33984348160 | elapsed time per iteration (s): 4.17 | learning rate: 4.991E-05 | global batch size: 512 | lm loss: 1.953032E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.772 | TFLOPs: 57.22 | 7: iteration 32420/ 44073 | consumed samples: 16599040 | consumed tokens: 33994833920 | elapsed time per iteration (s): 4.17 | learning rate: 4.987E-05 | global batch size: 512 | lm loss: 1.930448E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.738 | TFLOPs: 57.20 | 7: iteration 32430/ 44073 | consumed samples: 16604160 | consumed tokens: 34005319680 | elapsed time per iteration (s): 4.19 | learning rate: 4.982E-05 | global batch size: 512 | lm loss: 1.946955E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.233 | TFLOPs: 56.97 | 7: iteration 32440/ 44073 | consumed samples: 16609280 | consumed tokens: 34015805440 | elapsed time per iteration (s): 4.17 | learning rate: 4.977E-05 | global batch size: 512 | lm loss: 1.947313E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.746 | TFLOPs: 57.21 | 7: iteration 32450/ 44073 | consumed samples: 16614400 | consumed tokens: 34026291200 | elapsed time per iteration (s): 4.18 | learning rate: 4.972E-05 | global batch size: 512 | lm loss: 1.943987E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.472 | TFLOPs: 57.08 | 7: iteration 32460/ 44073 | consumed samples: 16619520 | consumed tokens: 34036776960 | elapsed time per iteration (s): 4.15 | learning rate: 4.967E-05 | global batch size: 512 | lm loss: 1.960321E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.305 | TFLOPs: 57.47 | 7: iteration 32470/ 44073 | consumed samples: 16624640 | consumed tokens: 34047262720 | elapsed time per iteration (s): 4.15 | learning rate: 4.963E-05 | global batch size: 512 | lm loss: 1.951261E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.255 | TFLOPs: 57.44 | 7: iteration 32480/ 44073 | consumed samples: 16629760 | consumed tokens: 34057748480 | elapsed time per iteration (s): 4.19 | learning rate: 4.958E-05 | global batch size: 512 | lm loss: 1.944259E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.183 | TFLOPs: 56.94 | 7: iteration 32490/ 44073 | consumed samples: 16634880 | consumed tokens: 34068234240 | elapsed time per iteration (s): 4.14 | learning rate: 4.953E-05 | global batch size: 512 | lm loss: 1.948589E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.571 | TFLOPs: 57.59 | 7: iteration 32500/ 44073 | consumed samples: 16640000 | consumed tokens: 34078720000 | elapsed time per iteration (s): 4.14 | learning rate: 4.948E-05 | global batch size: 512 | lm loss: 1.949113E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.683 | TFLOPs: 57.64 | 7: iteration 32510/ 44073 | consumed samples: 16645120 | consumed tokens: 34089205760 | elapsed time per iteration (s): 4.19 | learning rate: 4.943E-05 | global batch size: 512 | lm loss: 1.945713E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.325 | TFLOPs: 57.01 | 7: iteration 32520/ 44073 | consumed samples: 16650240 | consumed tokens: 34099691520 | elapsed time per iteration (s): 4.33 | learning rate: 4.939E-05 | global batch size: 512 | lm loss: 1.945504E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.166 | TFLOPs: 55.07 | 7: iteration 32530/ 44073 | consumed samples: 16655360 | consumed tokens: 34110177280 | elapsed time per iteration (s): 4.16 | learning rate: 4.934E-05 | global batch size: 512 | lm loss: 1.964680E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.983 | TFLOPs: 57.32 | 7: iteration 32540/ 44073 | consumed samples: 16660480 | consumed tokens: 34120663040 | elapsed time per iteration (s): 4.14 | learning rate: 4.929E-05 | global batch size: 512 | lm loss: 1.960892E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.638 | TFLOPs: 57.62 | 7: iteration 32550/ 44073 | consumed samples: 16665600 | consumed tokens: 34131148800 | elapsed time per iteration (s): 4.17 | learning rate: 4.924E-05 | global batch size: 512 | lm loss: 1.952140E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.778 | TFLOPs: 57.22 | 7: iteration 32560/ 44073 | consumed samples: 16670720 | consumed tokens: 34141634560 | elapsed time per iteration (s): 4.17 | learning rate: 4.919E-05 | global batch size: 512 | lm loss: 1.917668E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.669 | TFLOPs: 57.17 | 7: iteration 32570/ 44073 | consumed samples: 16675840 | consumed tokens: 34152120320 | elapsed time per iteration (s): 4.16 | learning rate: 4.915E-05 | global batch size: 512 | lm loss: 1.972579E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.107 | TFLOPs: 57.37 | 7: iteration 32580/ 44073 | consumed samples: 16680960 | consumed tokens: 34162606080 | elapsed time per iteration (s): 4.36 | learning rate: 4.910E-05 | global batch size: 512 | lm loss: 1.957631E+00 | grad norm: 0.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.521 | TFLOPs: 54.77 | 7: iteration 32590/ 44073 | consumed samples: 16686080 | consumed tokens: 34173091840 | elapsed time per iteration (s): 4.19 | learning rate: 4.905E-05 | global batch size: 512 | lm loss: 1.957795E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.325 | TFLOPs: 57.01 | 7: iteration 32600/ 44073 | consumed samples: 16691200 | consumed tokens: 34183577600 | elapsed time per iteration (s): 4.17 | learning rate: 4.900E-05 | global batch size: 512 | lm loss: 1.969868E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.802 | TFLOPs: 57.23 | 7: iteration 32610/ 44073 | consumed samples: 16696320 | consumed tokens: 34194063360 | elapsed time per iteration (s): 4.18 | learning rate: 4.896E-05 | global batch size: 512 | lm loss: 1.942551E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.397 | TFLOPs: 57.04 | 7: iteration 32620/ 44073 | consumed samples: 16701440 | consumed tokens: 34204549120 | elapsed time per iteration (s): 4.15 | learning rate: 4.891E-05 | global batch size: 512 | lm loss: 1.950139E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.277 | TFLOPs: 57.45 | 7: iteration 32630/ 44073 | consumed samples: 16706560 | consumed tokens: 34215034880 | elapsed time per iteration (s): 4.15 | learning rate: 4.886E-05 | global batch size: 512 | lm loss: 1.968612E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.232 | TFLOPs: 57.43 | 7: iteration 32640/ 44073 | consumed samples: 16711680 | consumed tokens: 34225520640 | elapsed time per iteration (s): 4.18 | learning rate: 4.881E-05 | global batch size: 512 | lm loss: 1.959905E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.437 | TFLOPs: 57.06 | 7: iteration 32650/ 44073 | consumed samples: 16716800 | consumed tokens: 34236006400 | elapsed time per iteration (s): 4.25 | learning rate: 4.877E-05 | global batch size: 512 | lm loss: 1.963343E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.413 | TFLOPs: 56.12 | 7: iteration 32660/ 44073 | consumed samples: 16721920 | consumed tokens: 34246492160 | elapsed time per iteration (s): 4.19 | learning rate: 4.872E-05 | global batch size: 512 | lm loss: 1.953547E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.101 | TFLOPs: 56.91 | 7: iteration 32670/ 44073 | consumed samples: 16727040 | consumed tokens: 34256977920 | elapsed time per iteration (s): 4.15 | learning rate: 4.867E-05 | global batch size: 512 | lm loss: 1.942506E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.456 | TFLOPs: 57.54 | 7: iteration 32680/ 44073 | consumed samples: 16732160 | consumed tokens: 34267463680 | elapsed time per iteration (s): 4.18 | learning rate: 4.862E-05 | global batch size: 512 | lm loss: 1.939082E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.527 | TFLOPs: 57.10 | 7: iteration 32690/ 44073 | consumed samples: 16737280 | consumed tokens: 34277949440 | elapsed time per iteration (s): 4.17 | learning rate: 4.858E-05 | global batch size: 512 | lm loss: 1.965629E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.688 | TFLOPs: 57.18 | 7: iteration 32700/ 44073 | consumed samples: 16742400 | consumed tokens: 34288435200 | elapsed time per iteration (s): 4.16 | learning rate: 4.853E-05 | global batch size: 512 | lm loss: 1.934778E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.136 | TFLOPs: 57.39 | 7: iteration 32710/ 44073 | consumed samples: 16747520 | consumed tokens: 34298920960 | elapsed time per iteration (s): 4.14 | learning rate: 4.848E-05 | global batch size: 512 | lm loss: 1.955817E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.562 | TFLOPs: 57.59 | 7: iteration 32720/ 44073 | consumed samples: 16752640 | consumed tokens: 34309406720 | elapsed time per iteration (s): 4.16 | learning rate: 4.843E-05 | global batch size: 512 | lm loss: 1.960301E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.984 | TFLOPs: 57.32 | 7: iteration 32730/ 44073 | consumed samples: 16757760 | consumed tokens: 34319892480 | elapsed time per iteration (s): 4.18 | learning rate: 4.839E-05 | global batch size: 512 | lm loss: 1.946865E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.533 | TFLOPs: 57.11 | 7: iteration 32740/ 44073 | consumed samples: 16762880 | consumed tokens: 34330378240 | elapsed time per iteration (s): 4.18 | learning rate: 4.834E-05 | global batch size: 512 | lm loss: 1.952098E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.535 | TFLOPs: 57.11 | 7: iteration 32750/ 44073 | consumed samples: 16768000 | consumed tokens: 34340864000 | elapsed time per iteration (s): 4.16 | learning rate: 4.829E-05 | global batch size: 512 | lm loss: 1.953789E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.022 | TFLOPs: 57.33 | 7: iteration 32760/ 44073 | consumed samples: 16773120 | consumed tokens: 34351349760 | elapsed time per iteration (s): 4.16 | learning rate: 4.824E-05 | global batch size: 512 | lm loss: 1.939144E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.141 | TFLOPs: 57.39 | 7: iteration 32770/ 44073 | consumed samples: 16778240 | consumed tokens: 34361835520 | elapsed time per iteration (s): 4.15 | learning rate: 4.820E-05 | global batch size: 512 | lm loss: 1.938121E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.418 | TFLOPs: 57.52 | 7: iteration 32780/ 44073 | consumed samples: 16783360 | consumed tokens: 34372321280 | elapsed time per iteration (s): 4.15 | learning rate: 4.815E-05 | global batch size: 512 | lm loss: 1.939067E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.438 | TFLOPs: 57.53 | 7: iteration 32790/ 44073 | consumed samples: 16788480 | consumed tokens: 34382807040 | elapsed time per iteration (s): 4.15 | learning rate: 4.810E-05 | global batch size: 512 | lm loss: 1.956247E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.316 | TFLOPs: 57.47 | 7: iteration 32800/ 44073 | consumed samples: 16793600 | consumed tokens: 34393292800 | elapsed time per iteration (s): 4.13 | learning rate: 4.806E-05 | global batch size: 512 | lm loss: 1.954855E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.974 | TFLOPs: 57.78 | 7: iteration 32810/ 44073 | consumed samples: 16798720 | consumed tokens: 34403778560 | elapsed time per iteration (s): 4.17 | learning rate: 4.801E-05 | global batch size: 512 | lm loss: 1.946652E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.878 | TFLOPs: 57.27 | 7: iteration 32820/ 44073 | consumed samples: 16803840 | consumed tokens: 34414264320 | elapsed time per iteration (s): 4.15 | learning rate: 4.796E-05 | global batch size: 512 | lm loss: 1.928932E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.257 | TFLOPs: 57.44 | 7: iteration 32830/ 44073 | consumed samples: 16808960 | consumed tokens: 34424750080 | elapsed time per iteration (s): 4.15 | learning rate: 4.792E-05 | global batch size: 512 | lm loss: 1.949027E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.357 | TFLOPs: 57.49 | 7: iteration 32840/ 44073 | consumed samples: 16814080 | consumed tokens: 34435235840 | elapsed time per iteration (s): 4.13 | learning rate: 4.787E-05 | global batch size: 512 | lm loss: 1.944868E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.892 | TFLOPs: 57.74 | 7: iteration 32850/ 44073 | consumed samples: 16819200 | consumed tokens: 34445721600 | elapsed time per iteration (s): 4.17 | learning rate: 4.782E-05 | global batch size: 512 | lm loss: 1.944268E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.679 | TFLOPs: 57.17 | 7: iteration 32860/ 44073 | consumed samples: 16824320 | consumed tokens: 34456207360 | elapsed time per iteration (s): 4.16 | learning rate: 4.778E-05 | global batch size: 512 | lm loss: 1.930056E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.107 | TFLOPs: 57.37 | 7: iteration 32870/ 44073 | consumed samples: 16829440 | consumed tokens: 34466693120 | elapsed time per iteration (s): 4.16 | learning rate: 4.773E-05 | global batch size: 512 | lm loss: 1.955361E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.173 | TFLOPs: 57.41 | 7: iteration 32880/ 44073 | consumed samples: 16834560 | consumed tokens: 34477178880 | elapsed time per iteration (s): 4.17 | learning rate: 4.768E-05 | global batch size: 512 | lm loss: 1.964414E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.898 | TFLOPs: 57.28 | 7: iteration 32890/ 44073 | consumed samples: 16839680 | consumed tokens: 34487664640 | elapsed time per iteration (s): 4.35 | learning rate: 4.763E-05 | global batch size: 512 | lm loss: 1.937081E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.729 | TFLOPs: 54.87 | 7: iteration 32900/ 44073 | consumed samples: 16844800 | consumed tokens: 34498150400 | elapsed time per iteration (s): 4.16 | learning rate: 4.759E-05 | global batch size: 512 | lm loss: 1.942980E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.031 | TFLOPs: 57.34 | 7: iteration 32910/ 44073 | consumed samples: 16849920 | consumed tokens: 34508636160 | elapsed time per iteration (s): 4.19 | learning rate: 4.754E-05 | global batch size: 512 | lm loss: 1.957458E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.092 | TFLOPs: 56.90 | 7: iteration 32920/ 44073 | consumed samples: 16855040 | consumed tokens: 34519121920 | elapsed time per iteration (s): 4.21 | learning rate: 4.749E-05 | global batch size: 512 | lm loss: 1.939099E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.634 | TFLOPs: 56.69 | 7: iteration 32930/ 44073 | consumed samples: 16860160 | consumed tokens: 34529607680 | elapsed time per iteration (s): 4.16 | learning rate: 4.745E-05 | global batch size: 512 | lm loss: 1.942186E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.139 | TFLOPs: 57.39 | 7: iteration 32940/ 44073 | consumed samples: 16865280 | consumed tokens: 34540093440 | elapsed time per iteration (s): 4.34 | learning rate: 4.740E-05 | global batch size: 512 | lm loss: 1.960878E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.977 | TFLOPs: 54.98 | 7: iteration 32950/ 44073 | consumed samples: 16870400 | consumed tokens: 34550579200 | elapsed time per iteration (s): 4.33 | learning rate: 4.735E-05 | global batch size: 512 | lm loss: 1.944419E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.116 | TFLOPs: 55.05 | 7: iteration 32960/ 44073 | consumed samples: 16875520 | consumed tokens: 34561064960 | elapsed time per iteration (s): 4.14 | learning rate: 4.731E-05 | global batch size: 512 | lm loss: 1.947214E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.594 | TFLOPs: 57.60 | 7: iteration 32970/ 44073 | consumed samples: 16880640 | consumed tokens: 34571550720 | elapsed time per iteration (s): 4.15 | learning rate: 4.726E-05 | global batch size: 512 | lm loss: 1.960412E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.425 | TFLOPs: 57.52 | 7: iteration 32980/ 44073 | consumed samples: 16885760 | consumed tokens: 34582036480 | elapsed time per iteration (s): 4.15 | learning rate: 4.722E-05 | global batch size: 512 | lm loss: 1.956433E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.443 | TFLOPs: 57.53 | 7: iteration 32990/ 44073 | consumed samples: 16890880 | consumed tokens: 34592522240 | elapsed time per iteration (s): 4.15 | learning rate: 4.717E-05 | global batch size: 512 | lm loss: 1.950343E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.441 | TFLOPs: 57.53 | 7: iteration 33000/ 44073 | consumed samples: 16896000 | consumed tokens: 34603008000 | elapsed time per iteration (s): 4.16 | learning rate: 4.712E-05 | global batch size: 512 | lm loss: 1.949138E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.204 | TFLOPs: 57.42 | 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 33000 | lm loss value: 1.894893E+00 | lm loss PPL: 6.651836E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 33000 to checkpoints_2b2 0: [2022-11-27 01:04:07,846] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step33000 is begin to save! 0: [2022-11-27 01:04:07,852] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_01-model_00-model_states.pt... 0: [2022-11-27 01:04:08,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_01-model_00-model_states.pt. 0: [2022-11-27 01:04:08,157] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_03-model_00-model_states.pt... 0: [2022-11-27 01:04:08,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_03-model_00-model_states.pt. 0: [2022-11-27 01:04:08,303] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_04-model_00-model_states.pt... 0: [2022-11-27 01:04:08,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_04-model_00-model_states.pt. 0: [2022-11-27 01:04:08,453] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_05-model_00-model_states.pt... 0: [2022-11-27 01:04:08,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_05-model_00-model_states.pt. 0: [2022-11-27 01:04:08,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_06-model_00-model_states.pt... 0: [2022-11-27 01:04:08,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_06-model_00-model_states.pt. 0: [2022-11-27 01:04:08,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_07-model_00-model_states.pt... 0: [2022-11-27 01:04:08,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_07-model_00-model_states.pt. 0: [2022-11-27 01:04:08,881] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_08-model_00-model_states.pt... 0: [2022-11-27 01:04:09,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_08-model_00-model_states.pt. 0: [2022-11-27 01:04:09,006] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_09-model_00-model_states.pt... 0: [2022-11-27 01:04:09,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_09-model_00-model_states.pt. 0: [2022-11-27 01:04:09,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_10-model_00-model_states.pt... 0: [2022-11-27 01:04:09,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_10-model_00-model_states.pt. 0: [2022-11-27 01:04:09,259] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_11-model_00-model_states.pt... 0: [2022-11-27 01:04:09,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_11-model_00-model_states.pt. 0: [2022-11-27 01:04:09,386] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_12-model_00-model_states.pt... 0: [2022-11-27 01:04:09,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_12-model_00-model_states.pt. 0: [2022-11-27 01:04:09,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_13-model_00-model_states.pt... 0: [2022-11-27 01:04:09,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_13-model_00-model_states.pt. 0: [2022-11-27 01:04:09,634] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_14-model_00-model_states.pt... 0: [2022-11-27 01:04:09,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_14-model_00-model_states.pt. 0: [2022-11-27 01:04:09,762] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_15-model_00-model_states.pt... 0: [2022-11-27 01:04:09,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_15-model_00-model_states.pt. 0: [2022-11-27 01:04:09,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_16-model_00-model_states.pt... 0: [2022-11-27 01:04:10,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_16-model_00-model_states.pt. 0: [2022-11-27 01:04:10,008] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_17-model_00-model_states.pt... 0: [2022-11-27 01:04:10,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_17-model_00-model_states.pt. 0: [2022-11-27 01:04:10,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_18-model_00-model_states.pt... 0: [2022-11-27 01:04:10,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_18-model_00-model_states.pt. 0: [2022-11-27 01:04:10,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_19-model_00-model_states.pt... 0: [2022-11-27 01:04:10,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_19-model_00-model_states.pt. 0: [2022-11-27 01:04:10,377] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_20-model_00-model_states.pt... 0: [2022-11-27 01:04:10,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_20-model_00-model_states.pt. 0: [2022-11-27 01:04:10,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_21-model_00-model_states.pt... 0: [2022-11-27 01:04:10,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_21-model_00-model_states.pt. 0: [2022-11-27 01:04:10,622] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_22-model_00-model_states.pt... 0: [2022-11-27 01:04:10,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_22-model_00-model_states.pt. 0: [2022-11-27 01:04:10,745] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_23-model_00-model_states.pt... 0: [2022-11-27 01:04:10,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_23-model_00-model_states.pt. 0: [2022-11-27 01:04:10,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_24-model_00-model_states.pt... 0: [2022-11-27 01:04:10,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_24-model_00-model_states.pt. 0: [2022-11-27 01:04:10,991] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_25-model_00-model_states.pt... 0: [2022-11-27 01:04:11,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_25-model_00-model_states.pt. 0: [2022-11-27 01:04:11,116] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_26-model_00-model_states.pt... 0: [2022-11-27 01:04:11,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_26-model_00-model_states.pt. 0: [2022-11-27 01:04:11,240] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_27-model_00-model_states.pt... 0: [2022-11-27 01:04:11,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_27-model_00-model_states.pt. 0: [2022-11-27 01:04:11,362] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_28-model_00-model_states.pt... 0: [2022-11-27 01:04:11,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_28-model_00-model_states.pt. 0: [2022-11-27 01:04:11,484] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_29-model_00-model_states.pt... 0: [2022-11-27 01:04:11,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_29-model_00-model_states.pt. 0: [2022-11-27 01:04:11,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_30-model_00-model_states.pt... 0: [2022-11-27 01:04:11,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_30-model_00-model_states.pt. 0: [2022-11-27 01:04:11,730] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_31-model_00-model_states.pt... 0: [2022-11-27 01:04:11,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_31-model_00-model_states.pt. 0: [2022-11-27 01:04:11,852] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_32-model_00-model_states.pt... 0: [2022-11-27 01:04:11,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_32-model_00-model_states.pt. 0: [2022-11-27 01:04:11,974] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_33-model_00-model_states.pt... 0: [2022-11-27 01:04:12,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_33-model_00-model_states.pt. 0: [2022-11-27 01:04:12,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_34-model_00-model_states.pt... 0: [2022-11-27 01:04:12,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_34-model_00-model_states.pt. 0: [2022-11-27 01:04:12,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/layer_36-model_00-model_states.pt... 0: [2022-11-27 01:04:12,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/layer_36-model_00-model_states.pt. 0: [2022-11-27 01:04:12,229] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step33000/mp_rank_00_model_states.pt 0: [2022-11-27 01:04:12,229] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/mp_rank_00_model_states.pt... 0: [2022-11-27 01:04:12,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/mp_rank_00_model_states.pt. 0: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:04:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step33000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:04:12,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:04:12,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 01:04:12,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: [2022-11-27 01:04:12,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:04:12,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:04:12,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 01:04:12,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: [2022-11-27 01:04:12,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:04:12,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 01:04:12,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: [2022-11-27 01:04:12,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:04:12,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 01:04:12,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: [2022-11-27 01:04:12,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:04:12,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 01:04:12,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: [2022-11-27 01:04:12,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:04:12,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 01:04:12,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: [2022-11-27 01:04:12,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:04:12,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 01:04:12,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 5: [2022-11-27 01:04:12,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:04:12,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:04:12,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:04:12,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:04:12,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 01:04:12,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 01:04:12,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 01:04:12,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 5: [2022-11-27 01:04:12,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 01:04:12,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 5: [2022-11-27 01:04:12,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 5: [2022-11-27 01:04:12,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 5: [2022-11-27 01:04:12,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:04:12,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 01:04:12,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 5: [2022-11-27 01:04:12,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:04:12,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 01:04:12,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 5: [2022-11-27 01:04:12,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:04:12,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 01:04:12,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:04:12,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 5: [2022-11-27 01:04:12,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 01:04:12,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 3: [2022-11-27 01:04:12,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:04:12,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:04:12,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:04:12,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:04:12,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 01:04:12,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 01:04:12,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 3: [2022-11-27 01:04:12,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 01:04:12,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 3: [2022-11-27 01:04:12,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 01:04:12,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 3: [2022-11-27 01:04:12,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 7: [2022-11-27 01:04:12,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:04:12,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:04:12,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:04:12,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:04:12,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:04:12,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:04:12,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:04:12,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:04:12,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 01:04:12,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 01:04:12,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 01:04:12,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 01:04:12,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 01:04:12,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 01:04:12,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 01:04:12,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 01:04:12,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 7: [2022-11-27 01:04:12,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 7: [2022-11-27 01:04:12,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 7: [2022-11-27 01:04:12,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 7: [2022-11-27 01:04:12,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 7: [2022-11-27 01:04:12,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 7: [2022-11-27 01:04:12,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 7: [2022-11-27 01:04:12,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 3: [2022-11-27 01:04:12,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:04:12,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 01:04:12,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 3: [2022-11-27 01:04:12,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:04:12,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 01:04:12,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 3: [2022-11-27 01:04:12,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:04:12,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 01:04:12,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 3: [2022-11-27 01:04:12,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:04:12,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 01:04:12,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-27 01:04:13,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:04:13,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:04:13,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:04:13,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:04:13,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:04:13,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:04:13,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 01:04:13,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 01:04:13,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:04:13,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 01:04:13,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:04:13,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-27 01:04:13,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-27 01:04:13,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 01:04:13,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 01:04:13,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 01:04:13,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-27 01:04:13,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 01:04:13,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 01:04:13,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-27 01:04:13,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-27 01:04:13,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-27 01:04:13,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-27 01:04:13,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 2: [2022-11-27 01:04:13,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:04:13,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 01:04:13,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 2: [2022-11-27 01:04:13,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:04:13,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 01:04:13,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 2: [2022-11-27 01:04:13,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:04:13,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 01:04:13,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 2: [2022-11-27 01:04:13,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:04:13,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 01:04:13,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 6: [2022-11-27 01:04:13,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:04:13,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:04:13,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:04:13,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 01:04:13,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 01:04:13,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 01:04:13,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 6: [2022-11-27 01:04:13,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 6: [2022-11-27 01:04:13,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 2: [2022-11-27 01:04:13,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:04:13,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 01:04:13,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 2: [2022-11-27 01:04:13,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:04:13,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 01:04:13,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 2: [2022-11-27 01:04:13,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:04:13,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 01:04:13,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 2: [2022-11-27 01:04:13,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:04:13,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 01:04:13,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 6: [2022-11-27 01:04:13,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:04:13,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 01:04:13,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 6: [2022-11-27 01:04:13,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:04:13,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 01:04:13,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 6: [2022-11-27 01:04:13,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:04:13,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 01:04:13,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 6: [2022-11-27 01:04:13,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:04:13,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 01:04:13,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 6: [2022-11-27 01:04:13,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:04:13,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 01:04:13,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: [2022-11-27 01:04:13,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 01:04:13,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 4: [2022-11-27 01:04:13,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:04:13,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:04:13,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:04:13,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:04:13,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:04:13,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:04:13,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:04:13,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 01:04:13,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 01:04:13,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 01:04:13,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 01:04:13,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 01:04:13,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 01:04:13,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 01:04:13,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 4: [2022-11-27 01:04:13,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 4: [2022-11-27 01:04:13,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 4: [2022-11-27 01:04:13,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 4: [2022-11-27 01:04:13,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 4: [2022-11-27 01:04:13,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 4: [2022-11-27 01:04:13,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 4: [2022-11-27 01:04:13,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:04:13,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step33000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 01:04:13,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: successfully saved checkpoint at iteration 33000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 5749.89 7: iteration 33010/ 44073 | consumed samples: 16901120 | consumed tokens: 34613493760 | elapsed time per iteration (s): 4.85 | learning rate: 4.708E-05 | global batch size: 512 | lm loss: 1.947036E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 105.675 | TFLOPs: 49.25 | 7: iteration 33020/ 44073 | consumed samples: 16906240 | consumed tokens: 34623979520 | elapsed time per iteration (s): 4.17 | learning rate: 4.703E-05 | global batch size: 512 | lm loss: 1.961392E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.919 | TFLOPs: 57.29 | 7: iteration 33030/ 44073 | consumed samples: 16911360 | consumed tokens: 34634465280 | elapsed time per iteration (s): 4.18 | learning rate: 4.698E-05 | global batch size: 512 | lm loss: 1.959560E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.374 | TFLOPs: 57.03 | 7: iteration 33040/ 44073 | consumed samples: 16916480 | consumed tokens: 34644951040 | elapsed time per iteration (s): 4.35 | learning rate: 4.694E-05 | global batch size: 512 | lm loss: 1.953958E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.592 | TFLOPs: 54.80 | 7: iteration 33050/ 44073 | consumed samples: 16921600 | consumed tokens: 34655436800 | elapsed time per iteration (s): 4.15 | learning rate: 4.689E-05 | global batch size: 512 | lm loss: 1.951166E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.474 | TFLOPs: 57.54 | 7: iteration 33060/ 44073 | consumed samples: 16926720 | consumed tokens: 34665922560 | elapsed time per iteration (s): 4.17 | learning rate: 4.685E-05 | global batch size: 512 | lm loss: 1.944424E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.917 | TFLOPs: 57.29 | 7: iteration 33070/ 44073 | consumed samples: 16931840 | consumed tokens: 34676408320 | elapsed time per iteration (s): 4.18 | learning rate: 4.680E-05 | global batch size: 512 | lm loss: 1.943172E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.501 | TFLOPs: 57.09 | 7: iteration 33080/ 44073 | consumed samples: 16936960 | consumed tokens: 34686894080 | elapsed time per iteration (s): 4.18 | learning rate: 4.675E-05 | global batch size: 512 | lm loss: 1.945486E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.458 | TFLOPs: 57.07 | 7: iteration 33090/ 44073 | consumed samples: 16942080 | consumed tokens: 34697379840 | elapsed time per iteration (s): 4.19 | learning rate: 4.671E-05 | global batch size: 512 | lm loss: 1.935137E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.129 | TFLOPs: 56.92 | 7: iteration 33100/ 44073 | consumed samples: 16947200 | consumed tokens: 34707865600 | elapsed time per iteration (s): 4.17 | learning rate: 4.666E-05 | global batch size: 512 | lm loss: 1.945436E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.787 | TFLOPs: 57.22 | 7: iteration 33110/ 44073 | consumed samples: 16952320 | consumed tokens: 34718351360 | elapsed time per iteration (s): 4.14 | learning rate: 4.661E-05 | global batch size: 512 | lm loss: 1.949664E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.796 | TFLOPs: 57.70 | 7: iteration 33120/ 44073 | consumed samples: 16957440 | consumed tokens: 34728837120 | elapsed time per iteration (s): 4.15 | learning rate: 4.657E-05 | global batch size: 512 | lm loss: 1.968038E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.262 | TFLOPs: 57.45 | 7: iteration 33130/ 44073 | consumed samples: 16962560 | consumed tokens: 34739322880 | elapsed time per iteration (s): 4.16 | learning rate: 4.652E-05 | global batch size: 512 | lm loss: 1.948632E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.148 | TFLOPs: 57.39 | 7: iteration 33140/ 44073 | consumed samples: 16967680 | consumed tokens: 34749808640 | elapsed time per iteration (s): 4.14 | learning rate: 4.648E-05 | global batch size: 512 | lm loss: 1.929122E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.560 | TFLOPs: 57.59 | 7: iteration 33150/ 44073 | consumed samples: 16972800 | consumed tokens: 34760294400 | elapsed time per iteration (s): 4.19 | learning rate: 4.643E-05 | global batch size: 512 | lm loss: 1.948341E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.238 | TFLOPs: 56.97 | 7: iteration 33160/ 44073 | consumed samples: 16977920 | consumed tokens: 34770780160 | elapsed time per iteration (s): 4.16 | learning rate: 4.639E-05 | global batch size: 512 | lm loss: 1.933267E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.202 | TFLOPs: 57.42 | 7: iteration 33170/ 44073 | consumed samples: 16983040 | consumed tokens: 34781265920 | elapsed time per iteration (s): 4.18 | learning rate: 4.634E-05 | global batch size: 512 | lm loss: 1.949885E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.634 | TFLOPs: 57.15 | 7: iteration 33180/ 44073 | consumed samples: 16988160 | consumed tokens: 34791751680 | elapsed time per iteration (s): 4.20 | learning rate: 4.629E-05 | global batch size: 512 | lm loss: 1.963554E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.954 | TFLOPs: 56.84 | 7: iteration 33190/ 44073 | consumed samples: 16993280 | consumed tokens: 34802237440 | elapsed time per iteration (s): 4.16 | learning rate: 4.625E-05 | global batch size: 512 | lm loss: 1.964494E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.097 | TFLOPs: 57.37 | 7: iteration 33200/ 44073 | consumed samples: 16998400 | consumed tokens: 34812723200 | elapsed time per iteration (s): 4.14 | learning rate: 4.620E-05 | global batch size: 512 | lm loss: 1.946225E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.710 | TFLOPs: 57.66 | 7: iteration 33210/ 44073 | consumed samples: 17003520 | consumed tokens: 34823208960 | elapsed time per iteration (s): 4.17 | learning rate: 4.616E-05 | global batch size: 512 | lm loss: 1.936043E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.730 | TFLOPs: 57.20 | 7: iteration 33220/ 44073 | consumed samples: 17008640 | consumed tokens: 34833694720 | elapsed time per iteration (s): 4.17 | learning rate: 4.611E-05 | global batch size: 512 | lm loss: 1.969133E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.642 | TFLOPs: 57.16 | 7: iteration 33230/ 44073 | consumed samples: 17013760 | consumed tokens: 34844180480 | elapsed time per iteration (s): 4.17 | learning rate: 4.607E-05 | global batch size: 512 | lm loss: 1.944081E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.846 | TFLOPs: 57.25 | 7: iteration 33240/ 44073 | consumed samples: 17018880 | consumed tokens: 34854666240 | elapsed time per iteration (s): 4.16 | learning rate: 4.602E-05 | global batch size: 512 | lm loss: 1.940789E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.001 | TFLOPs: 57.32 | 7: iteration 33250/ 44073 | consumed samples: 17024000 | consumed tokens: 34865152000 | elapsed time per iteration (s): 4.15 | learning rate: 4.597E-05 | global batch size: 512 | lm loss: 1.955085E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.295 | TFLOPs: 57.46 | 7: iteration 33260/ 44073 | consumed samples: 17029120 | consumed tokens: 34875637760 | elapsed time per iteration (s): 4.18 | learning rate: 4.593E-05 | global batch size: 512 | lm loss: 1.962694E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.629 | TFLOPs: 57.15 | 7: iteration 33270/ 44073 | consumed samples: 17034240 | consumed tokens: 34886123520 | elapsed time per iteration (s): 4.17 | learning rate: 4.588E-05 | global batch size: 512 | lm loss: 1.953497E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.887 | TFLOPs: 57.27 | 7: iteration 33280/ 44073 | consumed samples: 17039360 | consumed tokens: 34896609280 | elapsed time per iteration (s): 4.18 | learning rate: 4.584E-05 | global batch size: 512 | lm loss: 1.969511E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.620 | TFLOPs: 57.15 | 7: iteration 33290/ 44073 | consumed samples: 17044480 | consumed tokens: 34907095040 | elapsed time per iteration (s): 4.17 | learning rate: 4.579E-05 | global batch size: 512 | lm loss: 1.937046E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.880 | TFLOPs: 57.27 | 7: iteration 33300/ 44073 | consumed samples: 17049600 | consumed tokens: 34917580800 | elapsed time per iteration (s): 4.18 | learning rate: 4.575E-05 | global batch size: 512 | lm loss: 1.961308E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.501 | TFLOPs: 57.09 | 7: iteration 33310/ 44073 | consumed samples: 17054720 | consumed tokens: 34928066560 | elapsed time per iteration (s): 4.17 | learning rate: 4.570E-05 | global batch size: 512 | lm loss: 1.952983E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.719 | TFLOPs: 57.19 | 7: iteration 33320/ 44073 | consumed samples: 17059840 | consumed tokens: 34938552320 | elapsed time per iteration (s): 4.16 | learning rate: 4.566E-05 | global batch size: 512 | lm loss: 1.946508E+00 | grad norm: 0.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.164 | TFLOPs: 57.40 | 7: iteration 33330/ 44073 | consumed samples: 17064960 | consumed tokens: 34949038080 | elapsed time per iteration (s): 4.19 | learning rate: 4.561E-05 | global batch size: 512 | lm loss: 1.933826E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.331 | TFLOPs: 57.01 | 7: iteration 33340/ 44073 | consumed samples: 17070080 | consumed tokens: 34959523840 | elapsed time per iteration (s): 4.22 | learning rate: 4.557E-05 | global batch size: 512 | lm loss: 1.959557E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.206 | TFLOPs: 56.49 | 7: iteration 33350/ 44073 | consumed samples: 17075200 | consumed tokens: 34970009600 | elapsed time per iteration (s): 4.15 | learning rate: 4.552E-05 | global batch size: 512 | lm loss: 1.943701E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.413 | TFLOPs: 57.52 | 7: iteration 33360/ 44073 | consumed samples: 17080320 | consumed tokens: 34980495360 | elapsed time per iteration (s): 4.26 | learning rate: 4.547E-05 | global batch size: 512 | lm loss: 1.944692E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.236 | TFLOPs: 56.04 | 7: iteration 33370/ 44073 | consumed samples: 17085440 | consumed tokens: 34990981120 | elapsed time per iteration (s): 4.15 | learning rate: 4.543E-05 | global batch size: 512 | lm loss: 1.957286E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.300 | TFLOPs: 57.46 | 7: iteration 33380/ 44073 | consumed samples: 17090560 | consumed tokens: 35001466880 | elapsed time per iteration (s): 4.16 | learning rate: 4.538E-05 | global batch size: 512 | lm loss: 1.933191E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.000 | TFLOPs: 57.32 | 7: iteration 33390/ 44073 | consumed samples: 17095680 | consumed tokens: 35011952640 | elapsed time per iteration (s): 4.19 | learning rate: 4.534E-05 | global batch size: 512 | lm loss: 1.947640E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.267 | TFLOPs: 56.98 | 7: iteration 33400/ 44073 | consumed samples: 17100800 | consumed tokens: 35022438400 | elapsed time per iteration (s): 4.19 | learning rate: 4.529E-05 | global batch size: 512 | lm loss: 1.926470E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.281 | TFLOPs: 56.99 | 7: iteration 33410/ 44073 | consumed samples: 17105920 | consumed tokens: 35032924160 | elapsed time per iteration (s): 4.20 | learning rate: 4.525E-05 | global batch size: 512 | lm loss: 1.946953E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.896 | TFLOPs: 56.81 | 7: iteration 33420/ 44073 | consumed samples: 17111040 | consumed tokens: 35043409920 | elapsed time per iteration (s): 4.14 | learning rate: 4.520E-05 | global batch size: 512 | lm loss: 1.954333E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.546 | TFLOPs: 57.58 | 7: iteration 33430/ 44073 | consumed samples: 17116160 | consumed tokens: 35053895680 | elapsed time per iteration (s): 4.18 | learning rate: 4.516E-05 | global batch size: 512 | lm loss: 1.936162E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.587 | TFLOPs: 57.13 | 7: iteration 33440/ 44073 | consumed samples: 17121280 | consumed tokens: 35064381440 | elapsed time per iteration (s): 4.15 | learning rate: 4.511E-05 | global batch size: 512 | lm loss: 1.942886E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.372 | TFLOPs: 57.50 | 7: iteration 33450/ 44073 | consumed samples: 17126400 | consumed tokens: 35074867200 | elapsed time per iteration (s): 4.14 | learning rate: 4.507E-05 | global batch size: 512 | lm loss: 1.964360E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.729 | TFLOPs: 57.66 | 7: iteration 33460/ 44073 | consumed samples: 17131520 | consumed tokens: 35085352960 | elapsed time per iteration (s): 4.14 | learning rate: 4.502E-05 | global batch size: 512 | lm loss: 1.957647E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.719 | TFLOPs: 57.66 | 7: iteration 33470/ 44073 | consumed samples: 17136640 | consumed tokens: 35095838720 | elapsed time per iteration (s): 4.14 | learning rate: 4.498E-05 | global batch size: 512 | lm loss: 1.950191E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.602 | TFLOPs: 57.60 | 7: iteration 33480/ 44073 | consumed samples: 17141760 | consumed tokens: 35106324480 | elapsed time per iteration (s): 4.19 | learning rate: 4.494E-05 | global batch size: 512 | lm loss: 1.978103E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.254 | TFLOPs: 56.98 | 7: iteration 33490/ 44073 | consumed samples: 17146880 | consumed tokens: 35116810240 | elapsed time per iteration (s): 4.16 | learning rate: 4.489E-05 | global batch size: 512 | lm loss: 1.946108E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.198 | TFLOPs: 57.42 | 7: iteration 33500/ 44073 | consumed samples: 17152000 | consumed tokens: 35127296000 | elapsed time per iteration (s): 4.15 | learning rate: 4.485E-05 | global batch size: 512 | lm loss: 1.973652E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.266 | TFLOPs: 57.45 | 7: iteration 33510/ 44073 | consumed samples: 17157120 | consumed tokens: 35137781760 | elapsed time per iteration (s): 4.18 | learning rate: 4.480E-05 | global batch size: 512 | lm loss: 1.944226E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.565 | TFLOPs: 57.12 | 7: iteration 33520/ 44073 | consumed samples: 17162240 | consumed tokens: 35148267520 | elapsed time per iteration (s): 4.31 | learning rate: 4.476E-05 | global batch size: 512 | lm loss: 1.955884E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.747 | TFLOPs: 55.34 | 7: iteration 33530/ 44073 | consumed samples: 17167360 | consumed tokens: 35158753280 | elapsed time per iteration (s): 4.13 | learning rate: 4.471E-05 | global batch size: 512 | lm loss: 1.935583E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.830 | TFLOPs: 57.71 | 7: iteration 33540/ 44073 | consumed samples: 17172480 | consumed tokens: 35169239040 | elapsed time per iteration (s): 4.21 | learning rate: 4.467E-05 | global batch size: 512 | lm loss: 1.942463E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.573 | TFLOPs: 56.66 | 7: iteration 33550/ 44073 | consumed samples: 17177600 | consumed tokens: 35179724800 | elapsed time per iteration (s): 4.18 | learning rate: 4.462E-05 | global batch size: 512 | lm loss: 1.964543E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.524 | TFLOPs: 57.10 | 7: iteration 33560/ 44073 | consumed samples: 17182720 | consumed tokens: 35190210560 | elapsed time per iteration (s): 4.17 | learning rate: 4.458E-05 | global batch size: 512 | lm loss: 1.936729E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.855 | TFLOPs: 57.26 | 7: iteration 33570/ 44073 | consumed samples: 17187840 | consumed tokens: 35200696320 | elapsed time per iteration (s): 4.16 | learning rate: 4.453E-05 | global batch size: 512 | lm loss: 1.951041E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.989 | TFLOPs: 57.32 | 7: iteration 33580/ 44073 | consumed samples: 17192960 | consumed tokens: 35211182080 | elapsed time per iteration (s): 4.15 | learning rate: 4.449E-05 | global batch size: 512 | lm loss: 1.933262E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.498 | TFLOPs: 57.56 | 7: iteration 33590/ 44073 | consumed samples: 17198080 | consumed tokens: 35221667840 | elapsed time per iteration (s): 4.15 | learning rate: 4.444E-05 | global batch size: 512 | lm loss: 1.955496E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.500 | TFLOPs: 57.56 | 7: iteration 33600/ 44073 | consumed samples: 17203200 | consumed tokens: 35232153600 | elapsed time per iteration (s): 4.15 | learning rate: 4.440E-05 | global batch size: 512 | lm loss: 1.944547E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.422 | TFLOPs: 57.52 | 7: iteration 33610/ 44073 | consumed samples: 17208320 | consumed tokens: 35242639360 | elapsed time per iteration (s): 4.14 | learning rate: 4.436E-05 | global batch size: 512 | lm loss: 1.951534E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.761 | TFLOPs: 57.68 | 7: iteration 33620/ 44073 | consumed samples: 17213440 | consumed tokens: 35253125120 | elapsed time per iteration (s): 4.14 | learning rate: 4.431E-05 | global batch size: 512 | lm loss: 1.956161E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.809 | TFLOPs: 57.70 | 7: iteration 33630/ 44073 | consumed samples: 17218560 | consumed tokens: 35263610880 | elapsed time per iteration (s): 4.14 | learning rate: 4.427E-05 | global batch size: 512 | lm loss: 1.956354E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.642 | TFLOPs: 57.62 | 7: iteration 33640/ 44073 | consumed samples: 17223680 | consumed tokens: 35274096640 | elapsed time per iteration (s): 4.18 | learning rate: 4.422E-05 | global batch size: 512 | lm loss: 1.936593E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.580 | TFLOPs: 57.13 | 7: iteration 33650/ 44073 | consumed samples: 17228800 | consumed tokens: 35284582400 | elapsed time per iteration (s): 4.13 | learning rate: 4.418E-05 | global batch size: 512 | lm loss: 1.944875E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.966 | TFLOPs: 57.77 | 7: iteration 33660/ 44073 | consumed samples: 17233920 | consumed tokens: 35295068160 | elapsed time per iteration (s): 4.14 | learning rate: 4.413E-05 | global batch size: 512 | lm loss: 1.950381E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.777 | TFLOPs: 57.69 | 7: iteration 33670/ 44073 | consumed samples: 17239040 | consumed tokens: 35305553920 | elapsed time per iteration (s): 4.16 | learning rate: 4.409E-05 | global batch size: 512 | lm loss: 1.948977E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.161 | TFLOPs: 57.40 | 7: iteration 33680/ 44073 | consumed samples: 17244160 | consumed tokens: 35316039680 | elapsed time per iteration (s): 4.14 | learning rate: 4.405E-05 | global batch size: 512 | lm loss: 1.951041E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.648 | TFLOPs: 57.63 | 7: iteration 33690/ 44073 | consumed samples: 17249280 | consumed tokens: 35326525440 | elapsed time per iteration (s): 4.16 | learning rate: 4.400E-05 | global batch size: 512 | lm loss: 1.943764E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.111 | TFLOPs: 57.38 | 7: iteration 33700/ 44073 | consumed samples: 17254400 | consumed tokens: 35337011200 | elapsed time per iteration (s): 4.15 | learning rate: 4.396E-05 | global batch size: 512 | lm loss: 1.929085E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.495 | TFLOPs: 57.55 | 7: iteration 33710/ 44073 | consumed samples: 17259520 | consumed tokens: 35347496960 | elapsed time per iteration (s): 4.14 | learning rate: 4.391E-05 | global batch size: 512 | lm loss: 1.961036E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.651 | TFLOPs: 57.63 | 7: iteration 33720/ 44073 | consumed samples: 17264640 | consumed tokens: 35357982720 | elapsed time per iteration (s): 4.14 | learning rate: 4.387E-05 | global batch size: 512 | lm loss: 1.929548E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.787 | TFLOPs: 57.69 | 7: iteration 33730/ 44073 | consumed samples: 17269760 | consumed tokens: 35368468480 | elapsed time per iteration (s): 4.15 | learning rate: 4.383E-05 | global batch size: 512 | lm loss: 1.935448E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.371 | TFLOPs: 57.50 | 7: iteration 33740/ 44073 | consumed samples: 17274880 | consumed tokens: 35378954240 | elapsed time per iteration (s): 4.16 | learning rate: 4.378E-05 | global batch size: 512 | lm loss: 1.941663E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.934 | TFLOPs: 57.29 | 7: iteration 33750/ 44073 | consumed samples: 17280000 | consumed tokens: 35389440000 | elapsed time per iteration (s): 4.18 | learning rate: 4.374E-05 | global batch size: 512 | lm loss: 1.960415E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.558 | TFLOPs: 57.12 | 7: iteration 33760/ 44073 | consumed samples: 17285120 | consumed tokens: 35399925760 | elapsed time per iteration (s): 4.16 | learning rate: 4.370E-05 | global batch size: 512 | lm loss: 1.944173E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.153 | TFLOPs: 57.40 | 7: iteration 33770/ 44073 | consumed samples: 17290240 | consumed tokens: 35410411520 | elapsed time per iteration (s): 4.15 | learning rate: 4.365E-05 | global batch size: 512 | lm loss: 1.934691E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.324 | TFLOPs: 57.48 | 7: iteration 33780/ 44073 | consumed samples: 17295360 | consumed tokens: 35420897280 | elapsed time per iteration (s): 4.20 | learning rate: 4.361E-05 | global batch size: 512 | lm loss: 1.944016E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.986 | TFLOPs: 56.85 | 7: iteration 33790/ 44073 | consumed samples: 17300480 | consumed tokens: 35431383040 | elapsed time per iteration (s): 4.18 | learning rate: 4.356E-05 | global batch size: 512 | lm loss: 1.947141E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.466 | TFLOPs: 57.08 | 7: iteration 33800/ 44073 | consumed samples: 17305600 | consumed tokens: 35441868800 | elapsed time per iteration (s): 4.19 | learning rate: 4.352E-05 | global batch size: 512 | lm loss: 1.938698E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.117 | TFLOPs: 56.91 | 7: iteration 33810/ 44073 | consumed samples: 17310720 | consumed tokens: 35452354560 | elapsed time per iteration (s): 4.17 | learning rate: 4.348E-05 | global batch size: 512 | lm loss: 1.945438E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.878 | TFLOPs: 57.27 | 7: iteration 33820/ 44073 | consumed samples: 17315840 | consumed tokens: 35462840320 | elapsed time per iteration (s): 4.18 | learning rate: 4.343E-05 | global batch size: 512 | lm loss: 1.928028E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.570 | TFLOPs: 57.12 | 7: iteration 33830/ 44073 | consumed samples: 17320960 | consumed tokens: 35473326080 | elapsed time per iteration (s): 4.17 | learning rate: 4.339E-05 | global batch size: 512 | lm loss: 1.960465E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.638 | TFLOPs: 57.16 | 7: iteration 33840/ 44073 | consumed samples: 17326080 | consumed tokens: 35483811840 | elapsed time per iteration (s): 4.18 | learning rate: 4.335E-05 | global batch size: 512 | lm loss: 1.937069E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.471 | TFLOPs: 57.08 | 7: iteration 33850/ 44073 | consumed samples: 17331200 | consumed tokens: 35494297600 | elapsed time per iteration (s): 4.20 | learning rate: 4.330E-05 | global batch size: 512 | lm loss: 1.950779E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.893 | TFLOPs: 56.81 | 7: iteration 33860/ 44073 | consumed samples: 17336320 | consumed tokens: 35504783360 | elapsed time per iteration (s): 4.19 | learning rate: 4.326E-05 | global batch size: 512 | lm loss: 1.952878E+00 | grad norm: 0.115 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.230 | TFLOPs: 56.97 | 7: iteration 33870/ 44073 | consumed samples: 17341440 | consumed tokens: 35515269120 | elapsed time per iteration (s): 4.31 | learning rate: 4.322E-05 | global batch size: 512 | lm loss: 1.957088E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.854 | TFLOPs: 55.39 | 7: iteration 33880/ 44073 | consumed samples: 17346560 | consumed tokens: 35525754880 | elapsed time per iteration (s): 4.16 | learning rate: 4.317E-05 | global batch size: 512 | lm loss: 1.952705E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.171 | TFLOPs: 57.40 | 7: iteration 33890/ 44073 | consumed samples: 17351680 | consumed tokens: 35536240640 | elapsed time per iteration (s): 4.34 | learning rate: 4.313E-05 | global batch size: 512 | lm loss: 1.935417E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.102 | TFLOPs: 55.04 | 7: iteration 33900/ 44073 | consumed samples: 17356800 | consumed tokens: 35546726400 | elapsed time per iteration (s): 4.15 | learning rate: 4.308E-05 | global batch size: 512 | lm loss: 1.938102E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.383 | TFLOPs: 57.50 | 7: iteration 33910/ 44073 | consumed samples: 17361920 | consumed tokens: 35557212160 | elapsed time per iteration (s): 4.17 | learning rate: 4.304E-05 | global batch size: 512 | lm loss: 1.957607E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.636 | TFLOPs: 57.15 | 7: iteration 33920/ 44073 | consumed samples: 17367040 | consumed tokens: 35567697920 | elapsed time per iteration (s): 4.16 | learning rate: 4.300E-05 | global batch size: 512 | lm loss: 1.957978E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.990 | TFLOPs: 57.32 | 7: iteration 33930/ 44073 | consumed samples: 17372160 | consumed tokens: 35578183680 | elapsed time per iteration (s): 4.20 | learning rate: 4.296E-05 | global batch size: 512 | lm loss: 1.938037E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.889 | TFLOPs: 56.81 | 7: iteration 33940/ 44073 | consumed samples: 17377280 | consumed tokens: 35588669440 | elapsed time per iteration (s): 4.21 | learning rate: 4.291E-05 | global batch size: 512 | lm loss: 1.957692E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.565 | TFLOPs: 56.66 | 7: iteration 33950/ 44073 | consumed samples: 17382400 | consumed tokens: 35599155200 | elapsed time per iteration (s): 4.15 | learning rate: 4.287E-05 | global batch size: 512 | lm loss: 1.949558E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.337 | TFLOPs: 57.48 | 7: iteration 33960/ 44073 | consumed samples: 17387520 | consumed tokens: 35609640960 | elapsed time per iteration (s): 4.35 | learning rate: 4.283E-05 | global batch size: 512 | lm loss: 1.935379E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.684 | TFLOPs: 54.85 | 7: iteration 33970/ 44073 | consumed samples: 17392640 | consumed tokens: 35620126720 | elapsed time per iteration (s): 4.37 | learning rate: 4.278E-05 | global batch size: 512 | lm loss: 1.947331E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.184 | TFLOPs: 54.61 | 7: iteration 33980/ 44073 | consumed samples: 17397760 | consumed tokens: 35630612480 | elapsed time per iteration (s): 4.19 | learning rate: 4.274E-05 | global batch size: 512 | lm loss: 1.925986E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.233 | TFLOPs: 56.97 | 7: iteration 33990/ 44073 | consumed samples: 17402880 | consumed tokens: 35641098240 | elapsed time per iteration (s): 4.16 | learning rate: 4.270E-05 | global batch size: 512 | lm loss: 1.938312E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.061 | TFLOPs: 57.35 | 0: [2022-11-27 02:13:50,291] [INFO] [logging.py:68:log_dist] [Rank 0] step=34000, skipped=0, lr=[4.2653340074087434e-05, 4.2653340074087434e-05, 4.2653340074087434e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 7: iteration 34000/ 44073 | consumed samples: 17408000 | consumed tokens: 35651584000 | elapsed time per iteration (s): 4.17 | learning rate: 4.265E-05 | global batch size: 512 | lm loss: 1.943437E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.782 | TFLOPs: 57.22 | 0: steps: 34000 loss: 1.9576 iter time (s): 4.194 samples/sec: 122.075 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 34000 | lm loss value: 1.855716E+00 | lm loss PPL: 6.396278E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 34000 to checkpoints_2b2 0: [2022-11-27 02:13:51,633] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step34000 is begin to save! 0: [2022-11-27 02:13:51,639] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_01-model_00-model_states.pt... 0: [2022-11-27 02:13:51,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_01-model_00-model_states.pt. 0: [2022-11-27 02:13:51,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_03-model_00-model_states.pt... 0: [2022-11-27 02:13:52,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_03-model_00-model_states.pt. 0: [2022-11-27 02:13:52,096] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_04-model_00-model_states.pt... 0: [2022-11-27 02:13:52,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_04-model_00-model_states.pt. 0: [2022-11-27 02:13:52,240] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_05-model_00-model_states.pt... 0: [2022-11-27 02:13:52,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_05-model_00-model_states.pt. 0: [2022-11-27 02:13:52,384] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_06-model_00-model_states.pt... 0: [2022-11-27 02:13:52,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_06-model_00-model_states.pt. 0: [2022-11-27 02:13:52,524] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_07-model_00-model_states.pt... 0: [2022-11-27 02:13:52,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_07-model_00-model_states.pt. 0: [2022-11-27 02:13:52,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_08-model_00-model_states.pt... 0: [2022-11-27 02:13:52,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_08-model_00-model_states.pt. 0: [2022-11-27 02:13:52,809] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_09-model_00-model_states.pt... 0: [2022-11-27 02:13:52,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_09-model_00-model_states.pt. 0: [2022-11-27 02:13:52,948] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_10-model_00-model_states.pt... 0: [2022-11-27 02:13:53,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_10-model_00-model_states.pt. 0: [2022-11-27 02:13:53,092] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_11-model_00-model_states.pt... 0: [2022-11-27 02:13:53,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_11-model_00-model_states.pt. 0: [2022-11-27 02:13:53,229] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_12-model_00-model_states.pt... 0: [2022-11-27 02:13:53,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_12-model_00-model_states.pt. 0: [2022-11-27 02:13:53,371] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_13-model_00-model_states.pt... 0: [2022-11-27 02:13:53,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_13-model_00-model_states.pt. 0: [2022-11-27 02:13:53,513] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_14-model_00-model_states.pt... 0: [2022-11-27 02:13:53,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_14-model_00-model_states.pt. 0: [2022-11-27 02:13:53,653] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_15-model_00-model_states.pt... 0: [2022-11-27 02:13:53,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_15-model_00-model_states.pt. 0: [2022-11-27 02:13:53,794] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_16-model_00-model_states.pt... 0: [2022-11-27 02:13:53,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_16-model_00-model_states.pt. 0: [2022-11-27 02:13:53,939] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_17-model_00-model_states.pt... 0: [2022-11-27 02:13:54,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_17-model_00-model_states.pt. 0: [2022-11-27 02:13:54,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_18-model_00-model_states.pt... 0: [2022-11-27 02:13:54,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_18-model_00-model_states.pt. 0: [2022-11-27 02:13:54,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_19-model_00-model_states.pt... 0: [2022-11-27 02:13:54,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_19-model_00-model_states.pt. 0: [2022-11-27 02:13:54,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_20-model_00-model_states.pt... 0: [2022-11-27 02:13:54,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_20-model_00-model_states.pt. 0: [2022-11-27 02:13:54,504] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_21-model_00-model_states.pt... 0: [2022-11-27 02:13:54,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_21-model_00-model_states.pt. 0: [2022-11-27 02:13:54,646] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_22-model_00-model_states.pt... 0: [2022-11-27 02:13:54,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_22-model_00-model_states.pt. 0: [2022-11-27 02:13:54,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_23-model_00-model_states.pt... 0: [2022-11-27 02:13:54,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_23-model_00-model_states.pt. 0: [2022-11-27 02:13:54,926] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_24-model_00-model_states.pt... 0: [2022-11-27 02:13:55,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_24-model_00-model_states.pt. 0: [2022-11-27 02:13:55,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_25-model_00-model_states.pt... 0: [2022-11-27 02:13:55,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_25-model_00-model_states.pt. 0: [2022-11-27 02:13:55,202] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_26-model_00-model_states.pt... 0: [2022-11-27 02:13:55,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_26-model_00-model_states.pt. 0: [2022-11-27 02:13:55,341] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_27-model_00-model_states.pt... 0: [2022-11-27 02:13:55,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_27-model_00-model_states.pt. 0: [2022-11-27 02:13:55,478] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_28-model_00-model_states.pt... 0: [2022-11-27 02:13:55,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_28-model_00-model_states.pt. 0: [2022-11-27 02:13:55,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_29-model_00-model_states.pt... 0: [2022-11-27 02:13:55,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_29-model_00-model_states.pt. 0: [2022-11-27 02:13:55,756] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_30-model_00-model_states.pt... 0: [2022-11-27 02:13:55,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_30-model_00-model_states.pt. 0: [2022-11-27 02:13:55,897] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_31-model_00-model_states.pt... 0: [2022-11-27 02:13:56,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_31-model_00-model_states.pt. 0: [2022-11-27 02:13:56,035] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_32-model_00-model_states.pt... 0: [2022-11-27 02:13:56,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_32-model_00-model_states.pt. 0: [2022-11-27 02:13:56,177] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_33-model_00-model_states.pt... 0: [2022-11-27 02:13:56,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_33-model_00-model_states.pt. 0: [2022-11-27 02:13:56,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_34-model_00-model_states.pt... 0: [2022-11-27 02:13:56,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_34-model_00-model_states.pt. 0: [2022-11-27 02:13:56,449] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/layer_36-model_00-model_states.pt... 0: [2022-11-27 02:13:56,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/layer_36-model_00-model_states.pt. 0: [2022-11-27 02:13:56,454] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step34000/mp_rank_00_model_states.pt 0: [2022-11-27 02:13:56,454] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/mp_rank_00_model_states.pt... 0: [2022-11-27 02:13:56,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/mp_rank_00_model_states.pt. 0: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:13:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step34000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:13:57,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:13:57,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:13:57,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 02:13:57,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 2: [2022-11-27 02:13:57,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:13:57,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 02:13:57,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 2: [2022-11-27 02:13:57,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:13:57,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 02:13:57,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: [2022-11-27 02:13:57,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:13:57,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:13:57,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 02:13:57,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: [2022-11-27 02:13:57,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 02:13:57,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 2: [2022-11-27 02:13:57,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:13:57,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 02:13:57,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 2: [2022-11-27 02:13:57,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:13:57,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 02:13:57,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: [2022-11-27 02:13:57,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:13:57,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 02:13:57,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 2: [2022-11-27 02:13:57,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:13:57,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 02:13:57,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 2: [2022-11-27 02:13:57,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:13:57,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 02:13:57,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: [2022-11-27 02:13:57,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:13:57,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 02:13:57,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: [2022-11-27 02:13:57,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:13:57,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 02:13:57,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: [2022-11-27 02:13:57,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:13:57,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 02:13:57,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 2: [2022-11-27 02:13:57,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:13:57,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 02:13:57,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 2: [2022-11-27 02:13:57,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:13:57,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 02:13:57,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-27 02:13:57,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:13:57,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:13:57,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:13:57,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:13:57,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 02:13:57,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 02:13:57,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 02:13:57,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-27 02:13:57,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 02:13:57,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-27 02:13:57,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-27 02:13:57,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-27 02:13:57,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:13:57,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 02:13:57,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-27 02:13:57,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:13:57,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 02:13:57,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-27 02:13:57,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:13:57,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 02:13:57,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-27 02:13:57,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:13:57,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 02:13:57,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 4: [2022-11-27 02:13:57,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:13:57,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:13:57,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:13:57,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:13:57,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 02:13:57,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 02:13:57,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 02:13:57,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 4: [2022-11-27 02:13:57,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 02:13:57,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 4: [2022-11-27 02:13:57,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 4: [2022-11-27 02:13:57,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 4: [2022-11-27 02:13:57,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:13:57,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 02:13:57,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 4: [2022-11-27 02:13:57,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:13:57,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 02:13:57,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 4: [2022-11-27 02:13:57,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:13:57,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:13:57,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 02:13:57,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 4: [2022-11-27 02:13:57,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 02:13:57,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 6: [2022-11-27 02:13:57,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:13:57,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:13:57,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:13:57,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 02:13:57,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 02:13:57,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 02:13:57,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 6: [2022-11-27 02:13:57,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 6: [2022-11-27 02:13:57,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 6: [2022-11-27 02:13:57,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:13:57,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 02:13:57,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 6: [2022-11-27 02:13:57,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:13:57,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 02:13:57,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 6: [2022-11-27 02:13:57,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:13:57,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 02:13:57,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 6: [2022-11-27 02:13:57,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:13:57,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 02:13:57,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 6: [2022-11-27 02:13:57,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:13:57,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 02:13:57,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: [2022-11-27 02:13:57,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 02:13:57,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 3: [2022-11-27 02:13:57,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:13:57,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:13:57,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:13:57,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 02:13:57,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 02:13:57,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 02:13:57,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 3: [2022-11-27 02:13:57,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 3: [2022-11-27 02:13:57,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 3: [2022-11-27 02:13:57,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:13:57,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 02:13:57,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 3: [2022-11-27 02:13:57,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:13:57,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 02:13:57,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 3: [2022-11-27 02:13:57,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:13:57,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 02:13:57,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 3: [2022-11-27 02:13:57,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:13:57,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 02:13:57,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 3: [2022-11-27 02:13:57,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:13:57,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 02:13:57,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 1: [2022-11-27 02:13:57,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:13:57,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:13:57,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 02:13:57,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:13:57,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 02:13:57,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 1: [2022-11-27 02:13:57,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 1: [2022-11-27 02:13:57,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 02:13:57,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 1: [2022-11-27 02:13:57,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:13:57,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:13:57,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:13:57,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:13:57,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 02:13:57,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 02:13:57,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 02:13:57,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 1: [2022-11-27 02:13:57,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 02:13:57,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 1: [2022-11-27 02:13:57,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 1: [2022-11-27 02:13:57,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 1: [2022-11-27 02:13:57,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:13:57,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 02:13:57,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 7: [2022-11-27 02:13:57,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:13:57,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:13:57,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:13:57,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:13:57,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:13:57,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:13:57,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:13:57,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 02:13:57,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 02:13:57,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 02:13:57,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:13:57,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 02:13:57,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 02:13:57,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 7: [2022-11-27 02:13:57,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 7: [2022-11-27 02:13:57,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 02:13:57,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 02:13:57,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 7: [2022-11-27 02:13:57,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 7: [2022-11-27 02:13:57,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 7: [2022-11-27 02:13:57,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step34000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 02:13:57,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 7: [2022-11-27 02:13:57,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 7: [2022-11-27 02:13:57,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: successfully saved checkpoint at iteration 34000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 6021.67 7: iteration 34010/ 44073 | consumed samples: 17413120 | consumed tokens: 35662069760 | elapsed time per iteration (s): 4.89 | learning rate: 4.261E-05 | global batch size: 512 | lm loss: 1.943145E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 104.719 | TFLOPs: 48.80 | 7: iteration 34020/ 44073 | consumed samples: 17418240 | consumed tokens: 35672555520 | elapsed time per iteration (s): 4.14 | learning rate: 4.257E-05 | global batch size: 512 | lm loss: 1.939317E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.720 | TFLOPs: 57.66 | 7: iteration 34030/ 44073 | consumed samples: 17423360 | consumed tokens: 35683041280 | elapsed time per iteration (s): 4.13 | learning rate: 4.252E-05 | global batch size: 512 | lm loss: 1.947211E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.973 | TFLOPs: 57.78 | 7: iteration 34040/ 44073 | consumed samples: 17428480 | consumed tokens: 35693527040 | elapsed time per iteration (s): 4.35 | learning rate: 4.248E-05 | global batch size: 512 | lm loss: 1.938752E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.587 | TFLOPs: 54.80 | 7: iteration 34050/ 44073 | consumed samples: 17433600 | consumed tokens: 35704012800 | elapsed time per iteration (s): 4.15 | learning rate: 4.244E-05 | global batch size: 512 | lm loss: 1.970114E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.291 | TFLOPs: 57.46 | 7: iteration 34060/ 44073 | consumed samples: 17438720 | consumed tokens: 35714498560 | elapsed time per iteration (s): 4.15 | learning rate: 4.240E-05 | global batch size: 512 | lm loss: 1.942350E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.352 | TFLOPs: 57.49 | 7: iteration 34070/ 44073 | consumed samples: 17443840 | consumed tokens: 35724984320 | elapsed time per iteration (s): 4.17 | learning rate: 4.235E-05 | global batch size: 512 | lm loss: 1.951802E+00 | grad norm: 0.114 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.807 | TFLOPs: 57.23 | 7: iteration 34080/ 44073 | consumed samples: 17448960 | consumed tokens: 35735470080 | elapsed time per iteration (s): 4.17 | learning rate: 4.231E-05 | global batch size: 512 | lm loss: 1.948751E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.789 | TFLOPs: 57.23 | 7: iteration 34090/ 44073 | consumed samples: 17454080 | consumed tokens: 35745955840 | elapsed time per iteration (s): 4.16 | learning rate: 4.227E-05 | global batch size: 512 | lm loss: 1.957523E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.142 | TFLOPs: 57.39 | 7: iteration 34100/ 44073 | consumed samples: 17459200 | consumed tokens: 35756441600 | elapsed time per iteration (s): 4.18 | learning rate: 4.223E-05 | global batch size: 512 | lm loss: 1.961341E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.584 | TFLOPs: 57.13 | 7: iteration 34110/ 44073 | consumed samples: 17464320 | consumed tokens: 35766927360 | elapsed time per iteration (s): 4.14 | learning rate: 4.218E-05 | global batch size: 512 | lm loss: 1.939221E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.668 | TFLOPs: 57.64 | 7: iteration 34120/ 44073 | consumed samples: 17469440 | consumed tokens: 35777413120 | elapsed time per iteration (s): 4.16 | learning rate: 4.214E-05 | global batch size: 512 | lm loss: 1.968823E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.089 | TFLOPs: 57.37 | 7: iteration 34130/ 44073 | consumed samples: 17474560 | consumed tokens: 35787898880 | elapsed time per iteration (s): 4.14 | learning rate: 4.210E-05 | global batch size: 512 | lm loss: 1.950833E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.633 | TFLOPs: 57.62 | 7: iteration 34140/ 44073 | consumed samples: 17479680 | consumed tokens: 35798384640 | elapsed time per iteration (s): 4.17 | learning rate: 4.205E-05 | global batch size: 512 | lm loss: 1.935128E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.752 | TFLOPs: 57.21 | 7: iteration 34150/ 44073 | consumed samples: 17484800 | consumed tokens: 35808870400 | elapsed time per iteration (s): 5.88 | learning rate: 4.201E-05 | global batch size: 512 | lm loss: 1.949990E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 87.141 | TFLOPs: 40.61 | 7: iteration 34160/ 44073 | consumed samples: 17489920 | consumed tokens: 35819356160 | elapsed time per iteration (s): 4.18 | learning rate: 4.197E-05 | global batch size: 512 | lm loss: 1.948827E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.573 | TFLOPs: 57.13 | 7: iteration 34170/ 44073 | consumed samples: 17495040 | consumed tokens: 35829841920 | elapsed time per iteration (s): 4.20 | learning rate: 4.193E-05 | global batch size: 512 | lm loss: 1.937571E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.891 | TFLOPs: 56.81 | 7: iteration 34180/ 44073 | consumed samples: 17500160 | consumed tokens: 35840327680 | elapsed time per iteration (s): 4.15 | learning rate: 4.189E-05 | global batch size: 512 | lm loss: 1.941108E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.270 | TFLOPs: 57.45 | 7: iteration 34190/ 44073 | consumed samples: 17505280 | consumed tokens: 35850813440 | elapsed time per iteration (s): 4.14 | learning rate: 4.184E-05 | global batch size: 512 | lm loss: 1.951184E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.669 | TFLOPs: 57.64 | 7: iteration 34200/ 44073 | consumed samples: 17510400 | consumed tokens: 35861299200 | elapsed time per iteration (s): 4.14 | learning rate: 4.180E-05 | global batch size: 512 | lm loss: 1.973294E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.580 | TFLOPs: 57.59 | 7: iteration 34210/ 44073 | consumed samples: 17515520 | consumed tokens: 35871784960 | elapsed time per iteration (s): 4.16 | learning rate: 4.176E-05 | global batch size: 512 | lm loss: 1.935549E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.216 | TFLOPs: 57.42 | 7: iteration 34220/ 44073 | consumed samples: 17520640 | consumed tokens: 35882270720 | elapsed time per iteration (s): 4.13 | learning rate: 4.172E-05 | global batch size: 512 | lm loss: 1.965183E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.909 | TFLOPs: 57.75 | 7: iteration 34230/ 44073 | consumed samples: 17525760 | consumed tokens: 35892756480 | elapsed time per iteration (s): 4.18 | learning rate: 4.167E-05 | global batch size: 512 | lm loss: 1.913116E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.599 | TFLOPs: 57.14 | 7: iteration 34240/ 44073 | consumed samples: 17530880 | consumed tokens: 35903242240 | elapsed time per iteration (s): 4.18 | learning rate: 4.163E-05 | global batch size: 512 | lm loss: 1.947068E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.474 | TFLOPs: 57.08 | 7: iteration 34250/ 44073 | consumed samples: 17536000 | consumed tokens: 35913728000 | elapsed time per iteration (s): 4.19 | learning rate: 4.159E-05 | global batch size: 512 | lm loss: 1.949891E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.152 | TFLOPs: 56.93 | 7: iteration 34260/ 44073 | consumed samples: 17541120 | consumed tokens: 35924213760 | elapsed time per iteration (s): 4.14 | learning rate: 4.155E-05 | global batch size: 512 | lm loss: 1.944949E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.567 | TFLOPs: 57.59 | 7: iteration 34270/ 44073 | consumed samples: 17546240 | consumed tokens: 35934699520 | elapsed time per iteration (s): 4.14 | learning rate: 4.151E-05 | global batch size: 512 | lm loss: 1.948257E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.645 | TFLOPs: 57.62 | 7: iteration 34280/ 44073 | consumed samples: 17551360 | consumed tokens: 35945185280 | elapsed time per iteration (s): 4.18 | learning rate: 4.146E-05 | global batch size: 512 | lm loss: 1.946840E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.555 | TFLOPs: 57.12 | 7: iteration 34290/ 44073 | consumed samples: 17556480 | consumed tokens: 35955671040 | elapsed time per iteration (s): 4.18 | learning rate: 4.142E-05 | global batch size: 512 | lm loss: 1.956171E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.515 | TFLOPs: 57.10 | 7: iteration 34300/ 44073 | consumed samples: 17561600 | consumed tokens: 35966156800 | elapsed time per iteration (s): 4.19 | learning rate: 4.138E-05 | global batch size: 512 | lm loss: 1.949474E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.100 | TFLOPs: 56.90 | 7: iteration 34310/ 44073 | consumed samples: 17566720 | consumed tokens: 35976642560 | elapsed time per iteration (s): 4.18 | learning rate: 4.134E-05 | global batch size: 512 | lm loss: 1.969539E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.472 | TFLOPs: 57.08 | 7: iteration 34320/ 44073 | consumed samples: 17571840 | consumed tokens: 35987128320 | elapsed time per iteration (s): 4.16 | learning rate: 4.130E-05 | global batch size: 512 | lm loss: 1.969713E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.181 | TFLOPs: 57.41 | 7: iteration 34330/ 44073 | consumed samples: 17576960 | consumed tokens: 35997614080 | elapsed time per iteration (s): 4.21 | learning rate: 4.125E-05 | global batch size: 512 | lm loss: 1.940752E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.616 | TFLOPs: 56.68 | 7: iteration 34340/ 44073 | consumed samples: 17582080 | consumed tokens: 36008099840 | elapsed time per iteration (s): 4.15 | learning rate: 4.121E-05 | global batch size: 512 | lm loss: 1.957577E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.430 | TFLOPs: 57.52 | 7: iteration 34350/ 44073 | consumed samples: 17587200 | consumed tokens: 36018585600 | elapsed time per iteration (s): 4.15 | learning rate: 4.117E-05 | global batch size: 512 | lm loss: 1.942555E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.303 | TFLOPs: 57.47 | 7: iteration 34360/ 44073 | consumed samples: 17592320 | consumed tokens: 36029071360 | elapsed time per iteration (s): 4.17 | learning rate: 4.113E-05 | global batch size: 512 | lm loss: 1.940533E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.913 | TFLOPs: 57.28 | 7: iteration 34370/ 44073 | consumed samples: 17597440 | consumed tokens: 36039557120 | elapsed time per iteration (s): 4.16 | learning rate: 4.109E-05 | global batch size: 512 | lm loss: 1.948876E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.157 | TFLOPs: 57.40 | 7: iteration 34380/ 44073 | consumed samples: 17602560 | consumed tokens: 36050042880 | elapsed time per iteration (s): 4.16 | learning rate: 4.105E-05 | global batch size: 512 | lm loss: 1.946687E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.207 | TFLOPs: 57.42 | 7: iteration 34390/ 44073 | consumed samples: 17607680 | consumed tokens: 36060528640 | elapsed time per iteration (s): 4.14 | learning rate: 4.100E-05 | global batch size: 512 | lm loss: 1.933243E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.700 | TFLOPs: 57.65 | 7: iteration 34400/ 44073 | consumed samples: 17612800 | consumed tokens: 36071014400 | elapsed time per iteration (s): 4.13 | learning rate: 4.096E-05 | global batch size: 512 | lm loss: 1.938750E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.828 | TFLOPs: 57.71 | 7: iteration 34410/ 44073 | consumed samples: 17617920 | consumed tokens: 36081500160 | elapsed time per iteration (s): 4.15 | learning rate: 4.092E-05 | global batch size: 512 | lm loss: 1.941512E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.267 | TFLOPs: 57.45 | 7: iteration 34420/ 44073 | consumed samples: 17623040 | consumed tokens: 36091985920 | elapsed time per iteration (s): 4.16 | learning rate: 4.088E-05 | global batch size: 512 | lm loss: 1.935994E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.153 | TFLOPs: 57.40 | 7: iteration 34430/ 44073 | consumed samples: 17628160 | consumed tokens: 36102471680 | elapsed time per iteration (s): 4.15 | learning rate: 4.084E-05 | global batch size: 512 | lm loss: 1.958211E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.305 | TFLOPs: 57.47 | 7: iteration 34440/ 44073 | consumed samples: 17633280 | consumed tokens: 36112957440 | elapsed time per iteration (s): 4.17 | learning rate: 4.080E-05 | global batch size: 512 | lm loss: 1.951025E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.647 | TFLOPs: 57.16 | 7: iteration 34450/ 44073 | consumed samples: 17638400 | consumed tokens: 36123443200 | elapsed time per iteration (s): 4.32 | learning rate: 4.075E-05 | global batch size: 512 | lm loss: 1.935782E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.424 | TFLOPs: 55.19 | 7: iteration 34460/ 44073 | consumed samples: 17643520 | consumed tokens: 36133928960 | elapsed time per iteration (s): 4.13 | learning rate: 4.071E-05 | global batch size: 512 | lm loss: 1.936244E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.931 | TFLOPs: 57.76 | 7: iteration 34470/ 44073 | consumed samples: 17648640 | consumed tokens: 36144414720 | elapsed time per iteration (s): 4.14 | learning rate: 4.067E-05 | global batch size: 512 | lm loss: 1.949233E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.624 | TFLOPs: 57.61 | 7: iteration 34480/ 44073 | consumed samples: 17653760 | consumed tokens: 36154900480 | elapsed time per iteration (s): 4.16 | learning rate: 4.063E-05 | global batch size: 512 | lm loss: 1.948226E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.029 | TFLOPs: 57.34 | 7: iteration 34490/ 44073 | consumed samples: 17658880 | consumed tokens: 36165386240 | elapsed time per iteration (s): 4.14 | learning rate: 4.059E-05 | global batch size: 512 | lm loss: 1.939595E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.678 | TFLOPs: 57.64 | 7: iteration 34500/ 44073 | consumed samples: 17664000 | consumed tokens: 36175872000 | elapsed time per iteration (s): 4.15 | learning rate: 4.055E-05 | global batch size: 512 | lm loss: 1.926483E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.379 | TFLOPs: 57.50 | 7: iteration 34510/ 44073 | consumed samples: 17669120 | consumed tokens: 36186357760 | elapsed time per iteration (s): 4.17 | learning rate: 4.051E-05 | global batch size: 512 | lm loss: 1.950724E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.811 | TFLOPs: 57.24 | 7: iteration 34520/ 44073 | consumed samples: 17674240 | consumed tokens: 36196843520 | elapsed time per iteration (s): 4.16 | learning rate: 4.047E-05 | global batch size: 512 | lm loss: 1.949293E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.021 | TFLOPs: 57.33 | 7: iteration 34530/ 44073 | consumed samples: 17679360 | consumed tokens: 36207329280 | elapsed time per iteration (s): 4.18 | learning rate: 4.042E-05 | global batch size: 512 | lm loss: 1.927646E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.615 | TFLOPs: 57.14 | 7: iteration 34540/ 44073 | consumed samples: 17684480 | consumed tokens: 36217815040 | elapsed time per iteration (s): 4.17 | learning rate: 4.038E-05 | global batch size: 512 | lm loss: 1.939985E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.760 | TFLOPs: 57.21 | 7: iteration 34550/ 44073 | consumed samples: 17689600 | consumed tokens: 36228300800 | elapsed time per iteration (s): 4.16 | learning rate: 4.034E-05 | global batch size: 512 | lm loss: 1.929629E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.102 | TFLOPs: 57.37 | 7: iteration 34560/ 44073 | consumed samples: 17694720 | consumed tokens: 36238786560 | elapsed time per iteration (s): 4.19 | learning rate: 4.030E-05 | global batch size: 512 | lm loss: 1.954264E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.050 | TFLOPs: 56.88 | 7: iteration 34570/ 44073 | consumed samples: 17699840 | consumed tokens: 36249272320 | elapsed time per iteration (s): 4.14 | learning rate: 4.026E-05 | global batch size: 512 | lm loss: 1.930935E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.692 | TFLOPs: 57.65 | 7: iteration 34580/ 44073 | consumed samples: 17704960 | consumed tokens: 36259758080 | elapsed time per iteration (s): 4.15 | learning rate: 4.022E-05 | global batch size: 512 | lm loss: 1.945110E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.233 | TFLOPs: 57.43 | 7: iteration 34590/ 44073 | consumed samples: 17710080 | consumed tokens: 36270243840 | elapsed time per iteration (s): 4.18 | learning rate: 4.018E-05 | global batch size: 512 | lm loss: 1.948904E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.455 | TFLOPs: 57.07 | 7: iteration 34600/ 44073 | consumed samples: 17715200 | consumed tokens: 36280729600 | elapsed time per iteration (s): 4.15 | learning rate: 4.014E-05 | global batch size: 512 | lm loss: 1.943247E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.386 | TFLOPs: 57.50 | 7: iteration 34610/ 44073 | consumed samples: 17720320 | consumed tokens: 36291215360 | elapsed time per iteration (s): 4.15 | learning rate: 4.010E-05 | global batch size: 512 | lm loss: 1.946000E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.371 | TFLOPs: 57.50 | 7: iteration 34620/ 44073 | consumed samples: 17725440 | consumed tokens: 36301701120 | elapsed time per iteration (s): 4.19 | learning rate: 4.006E-05 | global batch size: 512 | lm loss: 1.927995E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.173 | TFLOPs: 56.94 | 7: iteration 34630/ 44073 | consumed samples: 17730560 | consumed tokens: 36312186880 | elapsed time per iteration (s): 4.18 | learning rate: 4.002E-05 | global batch size: 512 | lm loss: 1.944355E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.546 | TFLOPs: 57.11 | 7: iteration 34640/ 44073 | consumed samples: 17735680 | consumed tokens: 36322672640 | elapsed time per iteration (s): 4.18 | learning rate: 3.997E-05 | global batch size: 512 | lm loss: 1.916523E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.443 | TFLOPs: 57.06 | 7: iteration 34650/ 44073 | consumed samples: 17740800 | consumed tokens: 36333158400 | elapsed time per iteration (s): 4.14 | learning rate: 3.993E-05 | global batch size: 512 | lm loss: 1.953534E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.795 | TFLOPs: 57.69 | 7: iteration 34660/ 44073 | consumed samples: 17745920 | consumed tokens: 36343644160 | elapsed time per iteration (s): 4.14 | learning rate: 3.989E-05 | global batch size: 512 | lm loss: 1.927965E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.565 | TFLOPs: 57.59 | 7: iteration 34670/ 44073 | consumed samples: 17751040 | consumed tokens: 36354129920 | elapsed time per iteration (s): 4.14 | learning rate: 3.985E-05 | global batch size: 512 | lm loss: 1.950919E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.647 | TFLOPs: 57.63 | 7: iteration 34680/ 44073 | consumed samples: 17756160 | consumed tokens: 36364615680 | elapsed time per iteration (s): 4.20 | learning rate: 3.981E-05 | global batch size: 512 | lm loss: 1.946251E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.885 | TFLOPs: 56.80 | 7: iteration 34690/ 44073 | consumed samples: 17761280 | consumed tokens: 36375101440 | elapsed time per iteration (s): 4.17 | learning rate: 3.977E-05 | global batch size: 512 | lm loss: 1.919826E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.776 | TFLOPs: 57.22 | 7: iteration 34700/ 44073 | consumed samples: 17766400 | consumed tokens: 36385587200 | elapsed time per iteration (s): 4.24 | learning rate: 3.973E-05 | global batch size: 512 | lm loss: 1.948450E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.627 | TFLOPs: 56.22 | 7: iteration 34710/ 44073 | consumed samples: 17771520 | consumed tokens: 36396072960 | elapsed time per iteration (s): 4.17 | learning rate: 3.969E-05 | global batch size: 512 | lm loss: 1.928937E+00 | grad norm: 0.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.737 | TFLOPs: 57.20 | 7: iteration 34720/ 44073 | consumed samples: 17776640 | consumed tokens: 36406558720 | elapsed time per iteration (s): 4.18 | learning rate: 3.965E-05 | global batch size: 512 | lm loss: 1.944592E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.523 | TFLOPs: 57.10 | 7: iteration 34730/ 44073 | consumed samples: 17781760 | consumed tokens: 36417044480 | elapsed time per iteration (s): 4.16 | learning rate: 3.961E-05 | global batch size: 512 | lm loss: 1.949039E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.020 | TFLOPs: 57.33 | 7: iteration 34740/ 44073 | consumed samples: 17786880 | consumed tokens: 36427530240 | elapsed time per iteration (s): 4.17 | learning rate: 3.957E-05 | global batch size: 512 | lm loss: 1.939663E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.908 | TFLOPs: 57.28 | 7: iteration 34750/ 44073 | consumed samples: 17792000 | consumed tokens: 36438016000 | elapsed time per iteration (s): 4.22 | learning rate: 3.953E-05 | global batch size: 512 | lm loss: 1.943721E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.279 | TFLOPs: 56.52 | 7: iteration 34760/ 44073 | consumed samples: 17797120 | consumed tokens: 36448501760 | elapsed time per iteration (s): 4.20 | learning rate: 3.949E-05 | global batch size: 512 | lm loss: 1.954003E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.780 | TFLOPs: 56.76 | 7: iteration 34770/ 44073 | consumed samples: 17802240 | consumed tokens: 36458987520 | elapsed time per iteration (s): 4.15 | learning rate: 3.945E-05 | global batch size: 512 | lm loss: 1.930110E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.303 | TFLOPs: 57.47 | 7: iteration 34780/ 44073 | consumed samples: 17807360 | consumed tokens: 36469473280 | elapsed time per iteration (s): 4.20 | learning rate: 3.941E-05 | global batch size: 512 | lm loss: 1.934199E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.966 | TFLOPs: 56.84 | 7: iteration 34790/ 44073 | consumed samples: 17812480 | consumed tokens: 36479959040 | elapsed time per iteration (s): 4.18 | learning rate: 3.937E-05 | global batch size: 512 | lm loss: 1.951490E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.497 | TFLOPs: 57.09 | 7: iteration 34800/ 44073 | consumed samples: 17817600 | consumed tokens: 36490444800 | elapsed time per iteration (s): 4.35 | learning rate: 3.933E-05 | global batch size: 512 | lm loss: 1.966206E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.699 | TFLOPs: 54.85 | 7: iteration 34810/ 44073 | consumed samples: 17822720 | consumed tokens: 36500930560 | elapsed time per iteration (s): 4.17 | learning rate: 3.929E-05 | global batch size: 512 | lm loss: 1.935490E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.844 | TFLOPs: 57.25 | 7: iteration 34820/ 44073 | consumed samples: 17827840 | consumed tokens: 36511416320 | elapsed time per iteration (s): 4.16 | learning rate: 3.925E-05 | global batch size: 512 | lm loss: 1.935634E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.992 | TFLOPs: 57.32 | 7: iteration 34830/ 44073 | consumed samples: 17832960 | consumed tokens: 36521902080 | elapsed time per iteration (s): 4.16 | learning rate: 3.921E-05 | global batch size: 512 | lm loss: 1.932685E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.136 | TFLOPs: 57.39 | 7: iteration 34840/ 44073 | consumed samples: 17838080 | consumed tokens: 36532387840 | elapsed time per iteration (s): 4.14 | learning rate: 3.917E-05 | global batch size: 512 | lm loss: 1.938606E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.789 | TFLOPs: 57.69 | 7: iteration 34850/ 44073 | consumed samples: 17843200 | consumed tokens: 36542873600 | elapsed time per iteration (s): 4.16 | learning rate: 3.913E-05 | global batch size: 512 | lm loss: 1.916129E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.114 | TFLOPs: 57.38 | 7: iteration 34860/ 44073 | consumed samples: 17848320 | consumed tokens: 36553359360 | elapsed time per iteration (s): 4.18 | learning rate: 3.909E-05 | global batch size: 512 | lm loss: 1.964978E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.363 | TFLOPs: 57.03 | 7: iteration 34870/ 44073 | consumed samples: 17853440 | consumed tokens: 36563845120 | elapsed time per iteration (s): 4.19 | learning rate: 3.905E-05 | global batch size: 512 | lm loss: 1.931360E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.281 | TFLOPs: 56.99 | 7: iteration 34880/ 44073 | consumed samples: 17858560 | consumed tokens: 36574330880 | elapsed time per iteration (s): 4.15 | learning rate: 3.901E-05 | global batch size: 512 | lm loss: 1.924533E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.354 | TFLOPs: 57.49 | 7: iteration 34890/ 44073 | consumed samples: 17863680 | consumed tokens: 36584816640 | elapsed time per iteration (s): 4.14 | learning rate: 3.897E-05 | global batch size: 512 | lm loss: 1.943090E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.584 | TFLOPs: 57.60 | 7: iteration 34900/ 44073 | consumed samples: 17868800 | consumed tokens: 36595302400 | elapsed time per iteration (s): 4.16 | learning rate: 3.893E-05 | global batch size: 512 | lm loss: 1.941373E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.132 | TFLOPs: 57.39 | 7: iteration 34910/ 44073 | consumed samples: 17873920 | consumed tokens: 36605788160 | elapsed time per iteration (s): 4.17 | learning rate: 3.889E-05 | global batch size: 512 | lm loss: 1.913263E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.834 | TFLOPs: 57.25 | 7: iteration 34920/ 44073 | consumed samples: 17879040 | consumed tokens: 36616273920 | elapsed time per iteration (s): 4.18 | learning rate: 3.885E-05 | global batch size: 512 | lm loss: 1.928404E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.489 | TFLOPs: 57.09 | 7: iteration 34930/ 44073 | consumed samples: 17884160 | consumed tokens: 36626759680 | elapsed time per iteration (s): 4.20 | learning rate: 3.881E-05 | global batch size: 512 | lm loss: 1.941351E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.004 | TFLOPs: 56.86 | 7: iteration 34940/ 44073 | consumed samples: 17889280 | consumed tokens: 36637245440 | elapsed time per iteration (s): 4.19 | learning rate: 3.877E-05 | global batch size: 512 | lm loss: 1.932486E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.070 | TFLOPs: 56.89 | 7: iteration 34950/ 44073 | consumed samples: 17894400 | consumed tokens: 36647731200 | elapsed time per iteration (s): 4.16 | learning rate: 3.873E-05 | global batch size: 512 | lm loss: 1.939268E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.125 | TFLOPs: 57.38 | 7: iteration 34960/ 44073 | consumed samples: 17899520 | consumed tokens: 36658216960 | elapsed time per iteration (s): 4.15 | learning rate: 3.869E-05 | global batch size: 512 | lm loss: 1.954239E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.433 | TFLOPs: 57.53 | 7: iteration 34970/ 44073 | consumed samples: 17904640 | consumed tokens: 36668702720 | elapsed time per iteration (s): 4.35 | learning rate: 3.865E-05 | global batch size: 512 | lm loss: 1.950237E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.607 | TFLOPs: 54.81 | 7: iteration 34980/ 44073 | consumed samples: 17909760 | consumed tokens: 36679188480 | elapsed time per iteration (s): 4.13 | learning rate: 3.861E-05 | global batch size: 512 | lm loss: 1.930304E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.846 | TFLOPs: 57.72 | 7: iteration 34990/ 44073 | consumed samples: 17914880 | consumed tokens: 36689674240 | elapsed time per iteration (s): 4.16 | learning rate: 3.857E-05 | global batch size: 512 | lm loss: 1.948627E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.187 | TFLOPs: 57.41 | 7: iteration 35000/ 44073 | consumed samples: 17920000 | consumed tokens: 36700160000 | elapsed time per iteration (s): 4.20 | learning rate: 3.853E-05 | global batch size: 512 | lm loss: 1.941541E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.798 | TFLOPs: 56.76 | 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 35000 | lm loss value: 1.913077E+00 | lm loss PPL: 6.773899E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 35000 to checkpoints_2b2 0: [2022-11-27 03:23:47,852] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step35000 is begin to save! 0: [2022-11-27 03:23:47,858] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_01-model_00-model_states.pt... 0: [2022-11-27 03:23:48,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_01-model_00-model_states.pt. 0: [2022-11-27 03:23:48,165] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_03-model_00-model_states.pt... 0: [2022-11-27 03:23:48,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_03-model_00-model_states.pt. 0: [2022-11-27 03:23:48,309] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_04-model_00-model_states.pt... 0: [2022-11-27 03:23:48,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_04-model_00-model_states.pt. 0: [2022-11-27 03:23:48,452] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_05-model_00-model_states.pt... 0: [2022-11-27 03:23:48,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_05-model_00-model_states.pt. 0: [2022-11-27 03:23:48,585] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_06-model_00-model_states.pt... 0: [2022-11-27 03:23:48,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_06-model_00-model_states.pt. 0: [2022-11-27 03:23:48,720] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_07-model_00-model_states.pt... 0: [2022-11-27 03:23:48,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_07-model_00-model_states.pt. 0: [2022-11-27 03:23:48,850] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_08-model_00-model_states.pt... 0: [2022-11-27 03:23:48,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_08-model_00-model_states.pt. 0: [2022-11-27 03:23:48,983] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_09-model_00-model_states.pt... 0: [2022-11-27 03:23:49,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_09-model_00-model_states.pt. 0: [2022-11-27 03:23:49,115] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_10-model_00-model_states.pt... 0: [2022-11-27 03:23:49,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_10-model_00-model_states.pt. 0: [2022-11-27 03:23:49,249] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_11-model_00-model_states.pt... 0: [2022-11-27 03:23:49,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_11-model_00-model_states.pt. 0: [2022-11-27 03:23:49,377] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_12-model_00-model_states.pt... 0: [2022-11-27 03:23:49,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_12-model_00-model_states.pt. 0: [2022-11-27 03:23:49,503] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_13-model_00-model_states.pt... 0: [2022-11-27 03:23:49,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_13-model_00-model_states.pt. 0: [2022-11-27 03:23:49,631] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_14-model_00-model_states.pt... 0: [2022-11-27 03:23:49,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_14-model_00-model_states.pt. 0: [2022-11-27 03:23:49,752] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_15-model_00-model_states.pt... 0: [2022-11-27 03:23:49,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_15-model_00-model_states.pt. 0: [2022-11-27 03:23:49,876] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_16-model_00-model_states.pt... 0: [2022-11-27 03:23:50,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_16-model_00-model_states.pt. 0: [2022-11-27 03:23:50,000] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_17-model_00-model_states.pt... 0: [2022-11-27 03:23:50,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_17-model_00-model_states.pt. 0: [2022-11-27 03:23:50,123] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_18-model_00-model_states.pt... 0: [2022-11-27 03:23:50,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_18-model_00-model_states.pt. 0: [2022-11-27 03:23:50,247] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_19-model_00-model_states.pt... 0: [2022-11-27 03:23:50,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_19-model_00-model_states.pt. 0: [2022-11-27 03:23:50,370] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_20-model_00-model_states.pt... 0: [2022-11-27 03:23:50,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_20-model_00-model_states.pt. 0: [2022-11-27 03:23:50,493] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_21-model_00-model_states.pt... 0: [2022-11-27 03:23:50,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_21-model_00-model_states.pt. 0: [2022-11-27 03:23:50,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_22-model_00-model_states.pt... 0: [2022-11-27 03:23:50,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_22-model_00-model_states.pt. 0: [2022-11-27 03:23:50,742] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_23-model_00-model_states.pt... 0: [2022-11-27 03:23:50,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_23-model_00-model_states.pt. 0: [2022-11-27 03:23:50,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_24-model_00-model_states.pt... 0: [2022-11-27 03:23:50,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_24-model_00-model_states.pt. 0: [2022-11-27 03:23:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_25-model_00-model_states.pt... 0: [2022-11-27 03:23:51,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_25-model_00-model_states.pt. 0: [2022-11-27 03:23:51,119] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_26-model_00-model_states.pt... 0: [2022-11-27 03:23:51,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_26-model_00-model_states.pt. 0: [2022-11-27 03:23:51,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_27-model_00-model_states.pt... 0: [2022-11-27 03:23:51,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_27-model_00-model_states.pt. 0: [2022-11-27 03:23:51,369] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_28-model_00-model_states.pt... 0: [2022-11-27 03:23:51,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_28-model_00-model_states.pt. 0: [2022-11-27 03:23:51,493] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_29-model_00-model_states.pt... 0: [2022-11-27 03:23:51,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_29-model_00-model_states.pt. 0: [2022-11-27 03:23:51,620] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_30-model_00-model_states.pt... 0: [2022-11-27 03:23:51,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_30-model_00-model_states.pt. 0: [2022-11-27 03:23:51,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_31-model_00-model_states.pt... 0: [2022-11-27 03:23:51,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_31-model_00-model_states.pt. 0: [2022-11-27 03:23:51,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_32-model_00-model_states.pt... 0: [2022-11-27 03:23:51,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_32-model_00-model_states.pt. 0: [2022-11-27 03:23:51,992] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_33-model_00-model_states.pt... 0: [2022-11-27 03:23:52,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_33-model_00-model_states.pt. 0: [2022-11-27 03:23:52,117] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_34-model_00-model_states.pt... 0: [2022-11-27 03:23:52,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_34-model_00-model_states.pt. 0: [2022-11-27 03:23:52,241] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/layer_36-model_00-model_states.pt... 0: [2022-11-27 03:23:52,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/layer_36-model_00-model_states.pt. 0: [2022-11-27 03:23:52,246] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step35000/mp_rank_00_model_states.pt 0: [2022-11-27 03:23:52,246] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/mp_rank_00_model_states.pt... 0: [2022-11-27 03:23:52,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/mp_rank_00_model_states.pt. 0: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:23:52,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step35000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:23:52,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:23:52,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:23:52,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 03:23:52,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 2: [2022-11-27 03:23:52,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:23:52,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 03:23:52,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: [2022-11-27 03:23:52,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:23:52,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 03:23:52,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 4: [2022-11-27 03:23:52,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:23:52,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 03:23:52,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: [2022-11-27 03:23:52,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:23:52,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:23:52,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 03:23:52,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 03:23:52,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: [2022-11-27 03:23:52,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: [2022-11-27 03:23:52,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:23:52,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 03:23:52,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: [2022-11-27 03:23:52,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:23:52,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 03:23:52,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: [2022-11-27 03:23:52,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:23:52,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 03:23:52,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 2: [2022-11-27 03:23:52,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:23:52,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 03:23:52,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 2: [2022-11-27 03:23:52,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:23:52,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 03:23:52,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 3: [2022-11-27 03:23:52,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:23:52,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 03:23:52,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 3: [2022-11-27 03:23:52,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:23:52,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 03:23:52,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 3: [2022-11-27 03:23:52,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:23:52,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 03:23:52,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 2: [2022-11-27 03:23:52,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:23:52,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 03:23:52,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 2: [2022-11-27 03:23:52,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:23:52,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 03:23:52,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 2: [2022-11-27 03:23:52,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:23:52,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 03:23:52,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 2: [2022-11-27 03:23:52,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:23:52,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:23:52,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 03:23:52,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 03:23:52,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 2: [2022-11-27 03:23:52,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 3: [2022-11-27 03:23:52,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:23:52,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:23:52,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 03:23:52,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 4: [2022-11-27 03:23:52,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 03:23:52,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 4: [2022-11-27 03:23:52,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:23:52,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 03:23:52,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 3: [2022-11-27 03:23:52,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:23:52,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 03:23:52,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 3: [2022-11-27 03:23:52,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:23:52,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 03:23:52,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 3: [2022-11-27 03:23:52,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:23:52,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 03:23:52,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 3: [2022-11-27 03:23:52,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:23:52,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 03:23:52,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 4: [2022-11-27 03:23:52,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:23:52,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 03:23:52,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 4: [2022-11-27 03:23:52,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:23:52,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 03:23:52,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 4: [2022-11-27 03:23:52,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:23:52,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 03:23:52,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 4: [2022-11-27 03:23:52,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:23:52,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 03:23:52,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 4: [2022-11-27 03:23:52,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:23:52,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 03:23:52,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 5: [2022-11-27 03:23:52,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:23:52,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:23:52,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:23:52,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 03:23:52,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 03:23:52,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 03:23:52,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:23:52,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 5: [2022-11-27 03:23:52,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:23:52,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:23:52,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:23:52,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 5: [2022-11-27 03:23:52,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 5: [2022-11-27 03:23:52,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 03:23:52,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 03:23:52,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 03:23:52,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 5: [2022-11-27 03:23:52,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 03:23:52,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 5: [2022-11-27 03:23:52,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 5: [2022-11-27 03:23:52,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 5: [2022-11-27 03:23:52,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:23:52,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 03:23:52,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 1: [2022-11-27 03:23:53,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:23:53,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 03:23:53,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 1: [2022-11-27 03:23:53,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:23:53,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 03:23:53,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 1: [2022-11-27 03:23:53,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:23:53,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:23:53,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 03:23:53,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 1: [2022-11-27 03:23:53,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 03:23:53,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 1: [2022-11-27 03:23:53,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:23:53,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 03:23:53,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 1: [2022-11-27 03:23:53,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:23:53,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 03:23:53,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 1: [2022-11-27 03:23:53,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:23:53,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:23:53,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 03:23:53,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 03:23:53,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 1: [2022-11-27 03:23:53,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: [2022-11-27 03:23:53,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 03:23:53,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 7: [2022-11-27 03:23:53,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:23:53,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:23:53,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:23:53,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 03:23:53,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 03:23:53,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:23:53,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 7: [2022-11-27 03:23:53,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 03:23:53,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 03:23:53,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 7: [2022-11-27 03:23:53,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 7: [2022-11-27 03:23:53,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 7: [2022-11-27 03:23:53,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:23:53,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:23:53,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:23:53,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 03:23:53,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 03:23:53,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 7: [2022-11-27 03:23:53,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 03:23:53,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 7: [2022-11-27 03:23:53,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 7: [2022-11-27 03:23:53,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:23:53,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 03:23:53,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 6: [2022-11-27 03:23:53,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:23:53,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:23:53,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:23:53,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:23:53,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:23:53,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:23:53,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:23:53,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:23:53,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 03:23:53,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 03:23:53,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 03:23:53,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 03:23:53,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 03:23:53,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 03:23:53,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 03:23:53,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step35000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 03:23:53,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 6: [2022-11-27 03:23:53,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 6: [2022-11-27 03:23:53,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 6: [2022-11-27 03:23:53,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 6: [2022-11-27 03:23:53,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 6: [2022-11-27 03:23:53,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 6: [2022-11-27 03:23:53,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 6: [2022-11-27 03:23:53,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: successfully saved checkpoint at iteration 35000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 5734.69 7: iteration 35010/ 44073 | consumed samples: 17925120 | consumed tokens: 36710645760 | elapsed time per iteration (s): 4.88 | learning rate: 3.849E-05 | global batch size: 512 | lm loss: 1.930188E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 104.946 | TFLOPs: 48.91 | 7: iteration 35020/ 44073 | consumed samples: 17930240 | consumed tokens: 36721131520 | elapsed time per iteration (s): 4.18 | learning rate: 3.845E-05 | global batch size: 512 | lm loss: 1.943791E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.558 | TFLOPs: 57.12 | 7: iteration 35030/ 44073 | consumed samples: 17935360 | consumed tokens: 36731617280 | elapsed time per iteration (s): 4.14 | learning rate: 3.841E-05 | global batch size: 512 | lm loss: 1.906327E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.717 | TFLOPs: 57.66 | 7: iteration 35040/ 44073 | consumed samples: 17940480 | consumed tokens: 36742103040 | elapsed time per iteration (s): 4.34 | learning rate: 3.838E-05 | global batch size: 512 | lm loss: 1.955433E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.981 | TFLOPs: 54.99 | 7: iteration 35050/ 44073 | consumed samples: 17945600 | consumed tokens: 36752588800 | elapsed time per iteration (s): 4.16 | learning rate: 3.834E-05 | global batch size: 512 | lm loss: 1.935234E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.053 | TFLOPs: 57.35 | 7: iteration 35060/ 44073 | consumed samples: 17950720 | consumed tokens: 36763074560 | elapsed time per iteration (s): 4.16 | learning rate: 3.830E-05 | global batch size: 512 | lm loss: 1.924830E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.202 | TFLOPs: 57.42 | 7: iteration 35070/ 44073 | consumed samples: 17955840 | consumed tokens: 36773560320 | elapsed time per iteration (s): 4.16 | learning rate: 3.826E-05 | global batch size: 512 | lm loss: 1.952092E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.931 | TFLOPs: 57.29 | 7: iteration 35080/ 44073 | consumed samples: 17960960 | consumed tokens: 36784046080 | elapsed time per iteration (s): 4.21 | learning rate: 3.822E-05 | global batch size: 512 | lm loss: 1.950365E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.502 | TFLOPs: 56.63 | 7: iteration 35090/ 44073 | consumed samples: 17966080 | consumed tokens: 36794531840 | elapsed time per iteration (s): 4.14 | learning rate: 3.818E-05 | global batch size: 512 | lm loss: 1.949591E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.648 | TFLOPs: 57.63 | 7: iteration 35100/ 44073 | consumed samples: 17971200 | consumed tokens: 36805017600 | elapsed time per iteration (s): 4.15 | learning rate: 3.814E-05 | global batch size: 512 | lm loss: 1.941685E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.443 | TFLOPs: 57.53 | 7: iteration 35110/ 44073 | consumed samples: 17976320 | consumed tokens: 36815503360 | elapsed time per iteration (s): 4.18 | learning rate: 3.810E-05 | global batch size: 512 | lm loss: 1.911491E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.502 | TFLOPs: 57.09 | 7: iteration 35120/ 44073 | consumed samples: 17981440 | consumed tokens: 36825989120 | elapsed time per iteration (s): 4.24 | learning rate: 3.806E-05 | global batch size: 512 | lm loss: 1.928813E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.851 | TFLOPs: 56.32 | 7: iteration 35130/ 44073 | consumed samples: 17986560 | consumed tokens: 36836474880 | elapsed time per iteration (s): 4.14 | learning rate: 3.802E-05 | global batch size: 512 | lm loss: 1.942406E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.594 | TFLOPs: 57.60 | 7: iteration 35140/ 44073 | consumed samples: 17991680 | consumed tokens: 36846960640 | elapsed time per iteration (s): 4.13 | learning rate: 3.799E-05 | global batch size: 512 | lm loss: 1.948173E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.902 | TFLOPs: 57.74 | 7: iteration 35150/ 44073 | consumed samples: 17996800 | consumed tokens: 36857446400 | elapsed time per iteration (s): 4.22 | learning rate: 3.795E-05 | global batch size: 512 | lm loss: 1.940827E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.278 | TFLOPs: 56.52 | 7: iteration 35160/ 44073 | consumed samples: 18001920 | consumed tokens: 36867932160 | elapsed time per iteration (s): 4.16 | learning rate: 3.791E-05 | global batch size: 512 | lm loss: 1.947966E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.130 | TFLOPs: 57.38 | 7: iteration 35170/ 44073 | consumed samples: 18007040 | consumed tokens: 36878417920 | elapsed time per iteration (s): 4.17 | learning rate: 3.787E-05 | global batch size: 512 | lm loss: 1.938643E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.662 | TFLOPs: 57.17 | 7: iteration 35180/ 44073 | consumed samples: 18012160 | consumed tokens: 36888903680 | elapsed time per iteration (s): 4.18 | learning rate: 3.783E-05 | global batch size: 512 | lm loss: 1.951622E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.549 | TFLOPs: 57.11 | 7: iteration 35190/ 44073 | consumed samples: 18017280 | consumed tokens: 36899389440 | elapsed time per iteration (s): 4.14 | learning rate: 3.779E-05 | global batch size: 512 | lm loss: 1.928552E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.703 | TFLOPs: 57.65 | 7: iteration 35200/ 44073 | consumed samples: 18022400 | consumed tokens: 36909875200 | elapsed time per iteration (s): 4.17 | learning rate: 3.775E-05 | global batch size: 512 | lm loss: 1.952557E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.830 | TFLOPs: 57.24 | 7: iteration 35210/ 44073 | consumed samples: 18027520 | consumed tokens: 36920360960 | elapsed time per iteration (s): 4.14 | learning rate: 3.771E-05 | global batch size: 512 | lm loss: 1.949237E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.771 | TFLOPs: 57.68 | 7: iteration 35220/ 44073 | consumed samples: 18032640 | consumed tokens: 36930846720 | elapsed time per iteration (s): 4.13 | learning rate: 3.768E-05 | global batch size: 512 | lm loss: 1.951572E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.868 | TFLOPs: 57.73 | 7: iteration 35230/ 44073 | consumed samples: 18037760 | consumed tokens: 36941332480 | elapsed time per iteration (s): 4.17 | learning rate: 3.764E-05 | global batch size: 512 | lm loss: 1.918863E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.714 | TFLOPs: 57.19 | 7: iteration 35240/ 44073 | consumed samples: 18042880 | consumed tokens: 36951818240 | elapsed time per iteration (s): 4.15 | learning rate: 3.760E-05 | global batch size: 512 | lm loss: 1.930472E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.384 | TFLOPs: 57.50 | 7: iteration 35250/ 44073 | consumed samples: 18048000 | consumed tokens: 36962304000 | elapsed time per iteration (s): 4.13 | learning rate: 3.756E-05 | global batch size: 512 | lm loss: 1.942329E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.930 | TFLOPs: 57.76 | 7: iteration 35260/ 44073 | consumed samples: 18053120 | consumed tokens: 36972789760 | elapsed time per iteration (s): 4.15 | learning rate: 3.752E-05 | global batch size: 512 | lm loss: 1.927570E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.261 | TFLOPs: 57.45 | 7: iteration 35270/ 44073 | consumed samples: 18058240 | consumed tokens: 36983275520 | elapsed time per iteration (s): 4.23 | learning rate: 3.748E-05 | global batch size: 512 | lm loss: 1.937265E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.121 | TFLOPs: 56.45 | 7: iteration 35280/ 44073 | consumed samples: 18063360 | consumed tokens: 36993761280 | elapsed time per iteration (s): 4.18 | learning rate: 3.744E-05 | global batch size: 512 | lm loss: 1.925469E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.520 | TFLOPs: 57.10 | 7: iteration 35290/ 44073 | consumed samples: 18068480 | consumed tokens: 37004247040 | elapsed time per iteration (s): 4.19 | learning rate: 3.741E-05 | global batch size: 512 | lm loss: 1.941691E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.096 | TFLOPs: 56.90 | 7: iteration 35300/ 44073 | consumed samples: 18073600 | consumed tokens: 37014732800 | elapsed time per iteration (s): 4.17 | learning rate: 3.737E-05 | global batch size: 512 | lm loss: 1.937210E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.912 | TFLOPs: 57.28 | 7: iteration 35310/ 44073 | consumed samples: 18078720 | consumed tokens: 37025218560 | elapsed time per iteration (s): 4.13 | learning rate: 3.733E-05 | global batch size: 512 | lm loss: 1.940637E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.845 | TFLOPs: 57.72 | 7: iteration 35320/ 44073 | consumed samples: 18083840 | consumed tokens: 37035704320 | elapsed time per iteration (s): 4.15 | learning rate: 3.729E-05 | global batch size: 512 | lm loss: 1.951010E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.469 | TFLOPs: 57.54 | 7: iteration 35330/ 44073 | consumed samples: 18088960 | consumed tokens: 37046190080 | elapsed time per iteration (s): 4.15 | learning rate: 3.725E-05 | global batch size: 512 | lm loss: 1.954319E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.360 | TFLOPs: 57.49 | 7: iteration 35340/ 44073 | consumed samples: 18094080 | consumed tokens: 37056675840 | elapsed time per iteration (s): 4.17 | learning rate: 3.722E-05 | global batch size: 512 | lm loss: 1.948800E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.670 | TFLOPs: 57.17 | 7: iteration 35350/ 44073 | consumed samples: 18099200 | consumed tokens: 37067161600 | elapsed time per iteration (s): 4.15 | learning rate: 3.718E-05 | global batch size: 512 | lm loss: 1.946124E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.397 | TFLOPs: 57.51 | 7: iteration 35360/ 44073 | consumed samples: 18104320 | consumed tokens: 37077647360 | elapsed time per iteration (s): 4.16 | learning rate: 3.714E-05 | global batch size: 512 | lm loss: 1.933248E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.037 | TFLOPs: 57.34 | 7: iteration 35370/ 44073 | consumed samples: 18109440 | consumed tokens: 37088133120 | elapsed time per iteration (s): 4.16 | learning rate: 3.710E-05 | global batch size: 512 | lm loss: 1.930897E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.972 | TFLOPs: 57.31 | 7: iteration 35380/ 44073 | consumed samples: 18114560 | consumed tokens: 37098618880 | elapsed time per iteration (s): 4.34 | learning rate: 3.706E-05 | global batch size: 512 | lm loss: 1.914689E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.931 | TFLOPs: 54.96 | 7: iteration 35390/ 44073 | consumed samples: 18119680 | consumed tokens: 37109104640 | elapsed time per iteration (s): 4.16 | learning rate: 3.703E-05 | global batch size: 512 | lm loss: 1.936893E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.946 | TFLOPs: 57.30 | 7: iteration 35400/ 44073 | consumed samples: 18124800 | consumed tokens: 37119590400 | elapsed time per iteration (s): 4.17 | learning rate: 3.699E-05 | global batch size: 512 | lm loss: 1.934353E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.923 | TFLOPs: 57.29 | 7: iteration 35410/ 44073 | consumed samples: 18129920 | consumed tokens: 37130076160 | elapsed time per iteration (s): 4.15 | learning rate: 3.695E-05 | global batch size: 512 | lm loss: 1.963445E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.519 | TFLOPs: 57.57 | 7: iteration 35420/ 44073 | consumed samples: 18135040 | consumed tokens: 37140561920 | elapsed time per iteration (s): 4.17 | learning rate: 3.691E-05 | global batch size: 512 | lm loss: 1.943916E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.655 | TFLOPs: 57.16 | 7: iteration 35430/ 44073 | consumed samples: 18140160 | consumed tokens: 37151047680 | elapsed time per iteration (s): 4.18 | learning rate: 3.687E-05 | global batch size: 512 | lm loss: 1.942990E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.480 | TFLOPs: 57.08 | 7: iteration 35440/ 44073 | consumed samples: 18145280 | consumed tokens: 37161533440 | elapsed time per iteration (s): 4.17 | learning rate: 3.684E-05 | global batch size: 512 | lm loss: 1.936634E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.764 | TFLOPs: 57.21 | 7: iteration 35450/ 44073 | consumed samples: 18150400 | consumed tokens: 37172019200 | elapsed time per iteration (s): 4.18 | learning rate: 3.680E-05 | global batch size: 512 | lm loss: 1.945138E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.589 | TFLOPs: 57.13 | 7: iteration 35460/ 44073 | consumed samples: 18155520 | consumed tokens: 37182504960 | elapsed time per iteration (s): 4.18 | learning rate: 3.676E-05 | global batch size: 512 | lm loss: 1.937208E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.514 | TFLOPs: 57.10 | 7: iteration 35470/ 44073 | consumed samples: 18160640 | consumed tokens: 37192990720 | elapsed time per iteration (s): 4.20 | learning rate: 3.672E-05 | global batch size: 512 | lm loss: 1.932969E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.816 | TFLOPs: 56.77 | 7: iteration 35480/ 44073 | consumed samples: 18165760 | consumed tokens: 37203476480 | elapsed time per iteration (s): 4.22 | learning rate: 3.669E-05 | global batch size: 512 | lm loss: 1.949719E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.194 | TFLOPs: 56.48 | 7: iteration 35490/ 44073 | consumed samples: 18170880 | consumed tokens: 37213962240 | elapsed time per iteration (s): 4.14 | learning rate: 3.665E-05 | global batch size: 512 | lm loss: 1.950513E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.696 | TFLOPs: 57.65 | 7: iteration 35500/ 44073 | consumed samples: 18176000 | consumed tokens: 37224448000 | elapsed time per iteration (s): 4.14 | learning rate: 3.661E-05 | global batch size: 512 | lm loss: 1.931764E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.621 | TFLOPs: 57.61 | 7: iteration 35510/ 44073 | consumed samples: 18181120 | consumed tokens: 37234933760 | elapsed time per iteration (s): 4.14 | learning rate: 3.657E-05 | global batch size: 512 | lm loss: 1.928488E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.611 | TFLOPs: 57.61 | 7: iteration 35520/ 44073 | consumed samples: 18186240 | consumed tokens: 37245419520 | elapsed time per iteration (s): 4.14 | learning rate: 3.654E-05 | global batch size: 512 | lm loss: 1.950311E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.740 | TFLOPs: 57.67 | 7: iteration 35530/ 44073 | consumed samples: 18191360 | consumed tokens: 37255905280 | elapsed time per iteration (s): 4.17 | learning rate: 3.650E-05 | global batch size: 512 | lm loss: 1.918340E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.878 | TFLOPs: 57.27 | 7: iteration 35540/ 44073 | consumed samples: 18196480 | consumed tokens: 37266391040 | elapsed time per iteration (s): 4.16 | learning rate: 3.646E-05 | global batch size: 512 | lm loss: 1.947036E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.183 | TFLOPs: 57.41 | 7: iteration 35550/ 44073 | consumed samples: 18201600 | consumed tokens: 37276876800 | elapsed time per iteration (s): 4.16 | learning rate: 3.642E-05 | global batch size: 512 | lm loss: 1.936217E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.954 | TFLOPs: 57.30 | 7: iteration 35560/ 44073 | consumed samples: 18206720 | consumed tokens: 37287362560 | elapsed time per iteration (s): 4.17 | learning rate: 3.639E-05 | global batch size: 512 | lm loss: 1.939201E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.700 | TFLOPs: 57.18 | 7: iteration 35570/ 44073 | consumed samples: 18211840 | consumed tokens: 37297848320 | elapsed time per iteration (s): 4.17 | learning rate: 3.635E-05 | global batch size: 512 | lm loss: 1.924936E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.756 | TFLOPs: 57.21 | 7: iteration 35580/ 44073 | consumed samples: 18216960 | consumed tokens: 37308334080 | elapsed time per iteration (s): 4.17 | learning rate: 3.631E-05 | global batch size: 512 | lm loss: 1.943511E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.792 | TFLOPs: 57.23 | 7: iteration 35590/ 44073 | consumed samples: 18222080 | consumed tokens: 37318819840 | elapsed time per iteration (s): 4.17 | learning rate: 3.627E-05 | global batch size: 512 | lm loss: 1.948987E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.876 | TFLOPs: 57.27 | 7: iteration 35600/ 44073 | consumed samples: 18227200 | consumed tokens: 37329305600 | elapsed time per iteration (s): 4.21 | learning rate: 3.624E-05 | global batch size: 512 | lm loss: 1.940532E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.584 | TFLOPs: 56.66 | 7: iteration 35610/ 44073 | consumed samples: 18232320 | consumed tokens: 37339791360 | elapsed time per iteration (s): 4.18 | learning rate: 3.620E-05 | global batch size: 512 | lm loss: 1.936924E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.578 | TFLOPs: 57.13 | 7: iteration 35620/ 44073 | consumed samples: 18237440 | consumed tokens: 37350277120 | elapsed time per iteration (s): 4.17 | learning rate: 3.616E-05 | global batch size: 512 | lm loss: 1.942775E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.810 | TFLOPs: 57.24 | 7: iteration 35630/ 44073 | consumed samples: 18242560 | consumed tokens: 37360762880 | elapsed time per iteration (s): 4.13 | learning rate: 3.613E-05 | global batch size: 512 | lm loss: 1.936376E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.862 | TFLOPs: 57.73 | 7: iteration 35640/ 44073 | consumed samples: 18247680 | consumed tokens: 37371248640 | elapsed time per iteration (s): 4.16 | learning rate: 3.609E-05 | global batch size: 512 | lm loss: 1.917518E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.086 | TFLOPs: 57.36 | 7: iteration 35650/ 44073 | consumed samples: 18252800 | consumed tokens: 37381734400 | elapsed time per iteration (s): 4.19 | learning rate: 3.605E-05 | global batch size: 512 | lm loss: 1.941943E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.295 | TFLOPs: 57.00 | 7: iteration 35660/ 44073 | consumed samples: 18257920 | consumed tokens: 37392220160 | elapsed time per iteration (s): 4.15 | learning rate: 3.602E-05 | global batch size: 512 | lm loss: 1.939841E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.256 | TFLOPs: 57.44 | 7: iteration 35670/ 44073 | consumed samples: 18263040 | consumed tokens: 37402705920 | elapsed time per iteration (s): 4.21 | learning rate: 3.598E-05 | global batch size: 512 | lm loss: 1.947798E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.740 | TFLOPs: 56.74 | 7: iteration 35680/ 44073 | consumed samples: 18268160 | consumed tokens: 37413191680 | elapsed time per iteration (s): 4.23 | learning rate: 3.594E-05 | global batch size: 512 | lm loss: 1.934198E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.044 | TFLOPs: 56.41 | 7: iteration 35690/ 44073 | consumed samples: 18273280 | consumed tokens: 37423677440 | elapsed time per iteration (s): 4.18 | learning rate: 3.590E-05 | global batch size: 512 | lm loss: 1.929355E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.477 | TFLOPs: 57.08 | 7: iteration 35700/ 44073 | consumed samples: 18278400 | consumed tokens: 37434163200 | elapsed time per iteration (s): 4.16 | learning rate: 3.587E-05 | global batch size: 512 | lm loss: 1.937993E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.119 | TFLOPs: 57.38 | 7: iteration 35710/ 44073 | consumed samples: 18283520 | consumed tokens: 37444648960 | elapsed time per iteration (s): 4.16 | learning rate: 3.583E-05 | global batch size: 512 | lm loss: 1.917383E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.224 | TFLOPs: 57.43 | 7: iteration 35720/ 44073 | consumed samples: 18288640 | consumed tokens: 37455134720 | elapsed time per iteration (s): 4.16 | learning rate: 3.579E-05 | global batch size: 512 | lm loss: 1.927808E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.037 | TFLOPs: 57.34 | 7: iteration 35730/ 44073 | consumed samples: 18293760 | consumed tokens: 37465620480 | elapsed time per iteration (s): 4.18 | learning rate: 3.576E-05 | global batch size: 512 | lm loss: 1.927981E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.469 | TFLOPs: 57.08 | 7: iteration 35740/ 44073 | consumed samples: 18298880 | consumed tokens: 37476106240 | elapsed time per iteration (s): 4.35 | learning rate: 3.572E-05 | global batch size: 512 | lm loss: 1.941507E+00 | grad norm: 0.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.793 | TFLOPs: 54.90 | 7: iteration 35750/ 44073 | consumed samples: 18304000 | consumed tokens: 37486592000 | elapsed time per iteration (s): 4.19 | learning rate: 3.568E-05 | global batch size: 512 | lm loss: 1.941414E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.053 | TFLOPs: 56.88 | 7: iteration 35760/ 44073 | consumed samples: 18309120 | consumed tokens: 37497077760 | elapsed time per iteration (s): 4.18 | learning rate: 3.565E-05 | global batch size: 512 | lm loss: 1.916953E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.556 | TFLOPs: 57.12 | 7: iteration 35770/ 44073 | consumed samples: 18314240 | consumed tokens: 37507563520 | elapsed time per iteration (s): 4.17 | learning rate: 3.561E-05 | global batch size: 512 | lm loss: 1.925444E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.804 | TFLOPs: 57.23 | 7: iteration 35780/ 44073 | consumed samples: 18319360 | consumed tokens: 37518049280 | elapsed time per iteration (s): 4.16 | learning rate: 3.558E-05 | global batch size: 512 | lm loss: 1.928560E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.026 | TFLOPs: 57.34 | 7: iteration 35790/ 44073 | consumed samples: 18324480 | consumed tokens: 37528535040 | elapsed time per iteration (s): 4.16 | learning rate: 3.554E-05 | global batch size: 512 | lm loss: 1.940209E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.175 | TFLOPs: 57.41 | 7: iteration 35800/ 44073 | consumed samples: 18329600 | consumed tokens: 37539020800 | elapsed time per iteration (s): 4.17 | learning rate: 3.550E-05 | global batch size: 512 | lm loss: 1.942242E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.895 | TFLOPs: 57.28 | 7: iteration 35810/ 44073 | consumed samples: 18334720 | consumed tokens: 37549506560 | elapsed time per iteration (s): 4.19 | learning rate: 3.547E-05 | global batch size: 512 | lm loss: 1.924557E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.300 | TFLOPs: 57.00 | 7: iteration 35820/ 44073 | consumed samples: 18339840 | consumed tokens: 37559992320 | elapsed time per iteration (s): 4.20 | learning rate: 3.543E-05 | global batch size: 512 | lm loss: 1.953562E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.827 | TFLOPs: 56.78 | 7: iteration 35830/ 44073 | consumed samples: 18344960 | consumed tokens: 37570478080 | elapsed time per iteration (s): 4.16 | learning rate: 3.539E-05 | global batch size: 512 | lm loss: 1.950698E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.999 | TFLOPs: 57.32 | 7: iteration 35840/ 44073 | consumed samples: 18350080 | consumed tokens: 37580963840 | elapsed time per iteration (s): 4.20 | learning rate: 3.536E-05 | global batch size: 512 | lm loss: 1.942081E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.011 | TFLOPs: 56.86 | 7: iteration 35850/ 44073 | consumed samples: 18355200 | consumed tokens: 37591449600 | elapsed time per iteration (s): 4.18 | learning rate: 3.532E-05 | global batch size: 512 | lm loss: 1.931203E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.537 | TFLOPs: 57.11 | 7: iteration 35860/ 44073 | consumed samples: 18360320 | consumed tokens: 37601935360 | elapsed time per iteration (s): 4.17 | learning rate: 3.528E-05 | global batch size: 512 | lm loss: 1.943206E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.800 | TFLOPs: 57.23 | 7: iteration 35870/ 44073 | consumed samples: 18365440 | consumed tokens: 37612421120 | elapsed time per iteration (s): 4.17 | learning rate: 3.525E-05 | global batch size: 512 | lm loss: 1.944509E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.900 | TFLOPs: 57.28 | 7: iteration 35880/ 44073 | consumed samples: 18370560 | consumed tokens: 37622906880 | elapsed time per iteration (s): 4.15 | learning rate: 3.521E-05 | global batch size: 512 | lm loss: 1.916926E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.510 | TFLOPs: 57.56 | 7: iteration 35890/ 44073 | consumed samples: 18375680 | consumed tokens: 37633392640 | elapsed time per iteration (s): 4.18 | learning rate: 3.518E-05 | global batch size: 512 | lm loss: 1.949958E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.412 | TFLOPs: 57.05 | 7: iteration 35900/ 44073 | consumed samples: 18380800 | consumed tokens: 37643878400 | elapsed time per iteration (s): 4.20 | learning rate: 3.514E-05 | global batch size: 512 | lm loss: 1.924391E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.847 | TFLOPs: 56.79 | 7: iteration 35910/ 44073 | consumed samples: 18385920 | consumed tokens: 37654364160 | elapsed time per iteration (s): 4.37 | learning rate: 3.510E-05 | global batch size: 512 | lm loss: 1.912865E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.150 | TFLOPs: 54.60 | 7: iteration 35920/ 44073 | consumed samples: 18391040 | consumed tokens: 37664849920 | elapsed time per iteration (s): 4.15 | learning rate: 3.507E-05 | global batch size: 512 | lm loss: 1.940616E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.293 | TFLOPs: 57.46 | 7: iteration 35930/ 44073 | consumed samples: 18396160 | consumed tokens: 37675335680 | elapsed time per iteration (s): 4.16 | learning rate: 3.503E-05 | global batch size: 512 | lm loss: 1.943403E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.216 | TFLOPs: 57.42 | 7: iteration 35940/ 44073 | consumed samples: 18401280 | consumed tokens: 37685821440 | elapsed time per iteration (s): 4.17 | learning rate: 3.500E-05 | global batch size: 512 | lm loss: 1.936304E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.658 | TFLOPs: 57.17 | 7: iteration 35950/ 44073 | consumed samples: 18406400 | consumed tokens: 37696307200 | elapsed time per iteration (s): 4.24 | learning rate: 3.496E-05 | global batch size: 512 | lm loss: 1.933818E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.677 | TFLOPs: 56.24 | 7: iteration 35960/ 44073 | consumed samples: 18411520 | consumed tokens: 37706792960 | elapsed time per iteration (s): 4.21 | learning rate: 3.493E-05 | global batch size: 512 | lm loss: 1.957655E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.749 | TFLOPs: 56.74 | 7: iteration 35970/ 44073 | consumed samples: 18416640 | consumed tokens: 37717278720 | elapsed time per iteration (s): 4.33 | learning rate: 3.489E-05 | global batch size: 512 | lm loss: 1.931774E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.213 | TFLOPs: 55.09 | 7: iteration 35980/ 44073 | consumed samples: 18421760 | consumed tokens: 37727764480 | elapsed time per iteration (s): 4.17 | learning rate: 3.485E-05 | global batch size: 512 | lm loss: 1.941188E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.876 | TFLOPs: 57.27 | 7: iteration 35990/ 44073 | consumed samples: 18426880 | consumed tokens: 37738250240 | elapsed time per iteration (s): 4.14 | learning rate: 3.482E-05 | global batch size: 512 | lm loss: 1.960456E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.544 | TFLOPs: 57.58 | 0: [2022-11-27 04:33:31,812] [INFO] [logging.py:68:log_dist] [Rank 0] step=36000, skipped=0, lr=[3.4782801918957975e-05, 3.4782801918957975e-05, 3.4782801918957975e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 0: steps: 36000 loss: 1.9166 iter time (s): 4.181 samples/sec: 122.470 7: iteration 36000/ 44073 | consumed samples: 18432000 | consumed tokens: 37748736000 | elapsed time per iteration (s): 4.14 | learning rate: 3.478E-05 | global batch size: 512 | lm loss: 1.928480E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.749 | TFLOPs: 57.67 | 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 36000 | lm loss value: 1.893684E+00 | lm loss PPL: 6.643802E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 36000 to checkpoints_2b2 0: [2022-11-27 04:33:33,177] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step36000 is begin to save! 0: [2022-11-27 04:33:33,183] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_01-model_00-model_states.pt... 0: [2022-11-27 04:33:33,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_01-model_00-model_states.pt. 0: [2022-11-27 04:33:33,490] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_03-model_00-model_states.pt... 0: [2022-11-27 04:33:33,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_03-model_00-model_states.pt. 0: [2022-11-27 04:33:33,643] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_04-model_00-model_states.pt... 0: [2022-11-27 04:33:33,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_04-model_00-model_states.pt. 0: [2022-11-27 04:33:33,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_05-model_00-model_states.pt... 0: [2022-11-27 04:33:33,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_05-model_00-model_states.pt. 0: [2022-11-27 04:33:33,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_06-model_00-model_states.pt... 0: [2022-11-27 04:33:34,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_06-model_00-model_states.pt. 0: [2022-11-27 04:33:34,080] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_07-model_00-model_states.pt... 0: [2022-11-27 04:33:34,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_07-model_00-model_states.pt. 0: [2022-11-27 04:33:34,207] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_08-model_00-model_states.pt... 0: [2022-11-27 04:33:34,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_08-model_00-model_states.pt. 0: [2022-11-27 04:33:34,335] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_09-model_00-model_states.pt... 0: [2022-11-27 04:33:34,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_09-model_00-model_states.pt. 0: [2022-11-27 04:33:34,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_10-model_00-model_states.pt... 0: [2022-11-27 04:33:34,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_10-model_00-model_states.pt. 0: [2022-11-27 04:33:34,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_11-model_00-model_states.pt... 0: [2022-11-27 04:33:34,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_11-model_00-model_states.pt. 0: [2022-11-27 04:33:34,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_12-model_00-model_states.pt... 0: [2022-11-27 04:33:34,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_12-model_00-model_states.pt. 0: [2022-11-27 04:33:34,897] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_13-model_00-model_states.pt... 0: [2022-11-27 04:33:35,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_13-model_00-model_states.pt. 0: [2022-11-27 04:33:35,041] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_14-model_00-model_states.pt... 0: [2022-11-27 04:33:35,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_14-model_00-model_states.pt. 0: [2022-11-27 04:33:35,188] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_15-model_00-model_states.pt... 0: [2022-11-27 04:33:35,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_15-model_00-model_states.pt. 0: [2022-11-27 04:33:35,330] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_16-model_00-model_states.pt... 0: [2022-11-27 04:33:35,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_16-model_00-model_states.pt. 0: [2022-11-27 04:33:35,474] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_17-model_00-model_states.pt... 0: [2022-11-27 04:33:35,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_17-model_00-model_states.pt. 0: [2022-11-27 04:33:35,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_18-model_00-model_states.pt... 0: [2022-11-27 04:33:35,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_18-model_00-model_states.pt. 0: [2022-11-27 04:33:35,764] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_19-model_00-model_states.pt... 0: [2022-11-27 04:33:35,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_19-model_00-model_states.pt. 0: [2022-11-27 04:33:35,904] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_20-model_00-model_states.pt... 0: [2022-11-27 04:33:36,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_20-model_00-model_states.pt. 0: [2022-11-27 04:33:36,041] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_21-model_00-model_states.pt... 0: [2022-11-27 04:33:36,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_21-model_00-model_states.pt. 0: [2022-11-27 04:33:36,182] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_22-model_00-model_states.pt... 0: [2022-11-27 04:33:36,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_22-model_00-model_states.pt. 0: [2022-11-27 04:33:36,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_23-model_00-model_states.pt... 0: [2022-11-27 04:33:36,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_23-model_00-model_states.pt. 0: [2022-11-27 04:33:36,459] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_24-model_00-model_states.pt... 0: [2022-11-27 04:33:36,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_24-model_00-model_states.pt. 0: [2022-11-27 04:33:36,600] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_25-model_00-model_states.pt... 0: [2022-11-27 04:33:36,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_25-model_00-model_states.pt. 0: [2022-11-27 04:33:36,741] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_26-model_00-model_states.pt... 0: [2022-11-27 04:33:36,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_26-model_00-model_states.pt. 0: [2022-11-27 04:33:36,881] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_27-model_00-model_states.pt... 0: [2022-11-27 04:33:37,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_27-model_00-model_states.pt. 0: [2022-11-27 04:33:37,020] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_28-model_00-model_states.pt... 0: [2022-11-27 04:33:37,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_28-model_00-model_states.pt. 0: [2022-11-27 04:33:37,157] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_29-model_00-model_states.pt... 0: [2022-11-27 04:33:37,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_29-model_00-model_states.pt. 0: [2022-11-27 04:33:37,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_30-model_00-model_states.pt... 0: [2022-11-27 04:33:37,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_30-model_00-model_states.pt. 0: [2022-11-27 04:33:37,438] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_31-model_00-model_states.pt... 0: [2022-11-27 04:33:37,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_31-model_00-model_states.pt. 0: [2022-11-27 04:33:37,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_32-model_00-model_states.pt... 0: [2022-11-27 04:33:37,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_32-model_00-model_states.pt. 0: [2022-11-27 04:33:37,721] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_33-model_00-model_states.pt... 0: [2022-11-27 04:33:37,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_33-model_00-model_states.pt. 0: [2022-11-27 04:33:37,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_34-model_00-model_states.pt... 0: [2022-11-27 04:33:37,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_34-model_00-model_states.pt. 0: [2022-11-27 04:33:37,994] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/layer_36-model_00-model_states.pt... 0: [2022-11-27 04:33:37,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/layer_36-model_00-model_states.pt. 0: [2022-11-27 04:33:37,999] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step36000/mp_rank_00_model_states.pt 0: [2022-11-27 04:33:37,999] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/mp_rank_00_model_states.pt... 0: [2022-11-27 04:33:38,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/mp_rank_00_model_states.pt. 0: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:33:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step36000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:33:38,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:33:38,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 04:33:38,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 3: [2022-11-27 04:33:38,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:33:38,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 04:33:38,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 3: [2022-11-27 04:33:38,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:33:38,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 04:33:38,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 0: [2022-11-27 04:33:38,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:33:38,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 04:33:38,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 0: [2022-11-27 04:33:38,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:33:38,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 04:33:38,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 0: [2022-11-27 04:33:38,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:33:38,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 04:33:38,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 3: [2022-11-27 04:33:38,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:33:38,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 04:33:38,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 0: [2022-11-27 04:33:38,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:33:38,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 04:33:38,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 0: [2022-11-27 04:33:38,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:33:38,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:33:38,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 04:33:38,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 0: [2022-11-27 04:33:38,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:33:38,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 04:33:38,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 3: [2022-11-27 04:33:38,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:33:38,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 04:33:38,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 4: [2022-11-27 04:33:38,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:33:38,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 04:33:38,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 4: [2022-11-27 04:33:38,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:33:38,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 04:33:38,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 4: [2022-11-27 04:33:38,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:33:38,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 04:33:38,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 4: [2022-11-27 04:33:38,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:33:38,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 04:33:38,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 4: [2022-11-27 04:33:38,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:33:38,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:33:38,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 04:33:38,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 04:33:38,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 4: [2022-11-27 04:33:38,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 3: [2022-11-27 04:33:38,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:33:38,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 04:33:38,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 3: [2022-11-27 04:33:38,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:33:38,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:33:38,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 04:33:38,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 04:33:38,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 3: [2022-11-27 04:33:38,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 3: [2022-11-27 04:33:38,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:33:38,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 04:33:38,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 4: [2022-11-27 04:33:38,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:33:38,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 04:33:38,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 4: [2022-11-27 04:33:38,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:33:38,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 04:33:38,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 5: [2022-11-27 04:33:39,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:33:39,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:33:39,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:33:39,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 04:33:39,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 04:33:39,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 04:33:39,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 5: [2022-11-27 04:33:39,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 5: [2022-11-27 04:33:39,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 5: [2022-11-27 04:33:39,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:33:39,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 04:33:39,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 5: [2022-11-27 04:33:39,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:33:39,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 04:33:39,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 5: [2022-11-27 04:33:39,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:33:39,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 04:33:39,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 5: [2022-11-27 04:33:39,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:33:39,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 04:33:39,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 5: [2022-11-27 04:33:39,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:33:39,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 04:33:39,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 0: [2022-11-27 04:33:39,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 04:33:39,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 1: [2022-11-27 04:33:39,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:33:39,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:33:39,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:33:39,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 04:33:39,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 04:33:39,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:33:39,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:33:39,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:33:39,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 1: [2022-11-27 04:33:39,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 1: [2022-11-27 04:33:39,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 04:33:39,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 04:33:39,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 04:33:39,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 04:33:39,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 1: [2022-11-27 04:33:39,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 1: [2022-11-27 04:33:39,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 1: [2022-11-27 04:33:39,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 1: [2022-11-27 04:33:39,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:33:39,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:33:39,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 04:33:39,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 04:33:39,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 1: [2022-11-27 04:33:39,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 2: [2022-11-27 04:33:39,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:33:39,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:33:39,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:33:39,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:33:39,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:33:39,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 04:33:39,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 04:33:39,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 04:33:39,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 04:33:39,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 2: [2022-11-27 04:33:39,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 04:33:39,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 2: [2022-11-27 04:33:39,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 2: [2022-11-27 04:33:39,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 2: [2022-11-27 04:33:39,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 2: [2022-11-27 04:33:39,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:33:39,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:33:39,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:33:39,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 04:33:39,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 04:33:39,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 04:33:39,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 2: [2022-11-27 04:33:39,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 2: [2022-11-27 04:33:39,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 7: [2022-11-27 04:33:39,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:33:39,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:33:39,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:33:39,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 04:33:39,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 04:33:39,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 04:33:39,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:33:39,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 7: [2022-11-27 04:33:39,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 7: [2022-11-27 04:33:39,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 7: [2022-11-27 04:33:39,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 04:33:39,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 7: [2022-11-27 04:33:39,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:33:39,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:33:39,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:33:39,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:33:39,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 04:33:39,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 04:33:39,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 04:33:39,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 04:33:39,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 7: [2022-11-27 04:33:39,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 7: [2022-11-27 04:33:39,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 7: [2022-11-27 04:33:39,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 6: [2022-11-27 04:33:39,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:33:39,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:33:39,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:33:39,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:33:39,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:33:39,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:33:39,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 04:33:39,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 04:33:39,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 04:33:39,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 04:33:39,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:33:39,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 6: [2022-11-27 04:33:39,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 04:33:39,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 6: [2022-11-27 04:33:39,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:33:39,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 6: [2022-11-27 04:33:39,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 04:33:39,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 6: [2022-11-27 04:33:39,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 04:33:39,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 6: [2022-11-27 04:33:39,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 6: [2022-11-27 04:33:39,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step36000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 04:33:39,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 6: [2022-11-27 04:33:39,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 0: successfully saved checkpoint at iteration 36000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 6199.47 7: iteration 36010/ 44073 | consumed samples: 18437120 | consumed tokens: 37759221760 | elapsed time per iteration (s): 4.90 | learning rate: 3.475E-05 | global batch size: 512 | lm loss: 1.939615E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 104.417 | TFLOPs: 48.66 | 7: iteration 36020/ 44073 | consumed samples: 18442240 | consumed tokens: 37769707520 | elapsed time per iteration (s): 4.16 | learning rate: 3.471E-05 | global batch size: 512 | lm loss: 1.941752E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.168 | TFLOPs: 57.40 | 7: iteration 36030/ 44073 | consumed samples: 18447360 | consumed tokens: 37780193280 | elapsed time per iteration (s): 4.16 | learning rate: 3.468E-05 | global batch size: 512 | lm loss: 1.926506E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.136 | TFLOPs: 57.39 | 7: iteration 36040/ 44073 | consumed samples: 18452480 | consumed tokens: 37790679040 | elapsed time per iteration (s): 4.17 | learning rate: 3.464E-05 | global batch size: 512 | lm loss: 1.937787E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.907 | TFLOPs: 57.28 | 7: iteration 36050/ 44073 | consumed samples: 18457600 | consumed tokens: 37801164800 | elapsed time per iteration (s): 4.21 | learning rate: 3.461E-05 | global batch size: 512 | lm loss: 1.947786E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.478 | TFLOPs: 56.61 | 7: iteration 36060/ 44073 | consumed samples: 18462720 | consumed tokens: 37811650560 | elapsed time per iteration (s): 4.19 | learning rate: 3.457E-05 | global batch size: 512 | lm loss: 1.919564E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.153 | TFLOPs: 56.93 | 7: iteration 36070/ 44073 | consumed samples: 18467840 | consumed tokens: 37822136320 | elapsed time per iteration (s): 4.20 | learning rate: 3.453E-05 | global batch size: 512 | lm loss: 1.928934E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.937 | TFLOPs: 56.83 | 7: iteration 36080/ 44073 | consumed samples: 18472960 | consumed tokens: 37832622080 | elapsed time per iteration (s): 4.17 | learning rate: 3.450E-05 | global batch size: 512 | lm loss: 1.927804E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.739 | TFLOPs: 57.20 | 7: iteration 36090/ 44073 | consumed samples: 18478080 | consumed tokens: 37843107840 | elapsed time per iteration (s): 4.20 | learning rate: 3.446E-05 | global batch size: 512 | lm loss: 1.937645E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.862 | TFLOPs: 56.79 | 7: iteration 36100/ 44073 | consumed samples: 18483200 | consumed tokens: 37853593600 | elapsed time per iteration (s): 4.17 | learning rate: 3.443E-05 | global batch size: 512 | lm loss: 1.923104E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.708 | TFLOPs: 57.19 | 7: iteration 36110/ 44073 | consumed samples: 18488320 | consumed tokens: 37864079360 | elapsed time per iteration (s): 4.16 | learning rate: 3.439E-05 | global batch size: 512 | lm loss: 1.916023E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.176 | TFLOPs: 57.41 | 7: iteration 36120/ 44073 | consumed samples: 18493440 | consumed tokens: 37874565120 | elapsed time per iteration (s): 4.18 | learning rate: 3.436E-05 | global batch size: 512 | lm loss: 1.954817E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.370 | TFLOPs: 57.03 | 7: iteration 36130/ 44073 | consumed samples: 18498560 | consumed tokens: 37885050880 | elapsed time per iteration (s): 4.14 | learning rate: 3.432E-05 | global batch size: 512 | lm loss: 1.920597E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.757 | TFLOPs: 57.68 | 7: iteration 36140/ 44073 | consumed samples: 18503680 | consumed tokens: 37895536640 | elapsed time per iteration (s): 4.14 | learning rate: 3.429E-05 | global batch size: 512 | lm loss: 1.945937E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.623 | TFLOPs: 57.61 | 7: iteration 36150/ 44073 | consumed samples: 18508800 | consumed tokens: 37906022400 | elapsed time per iteration (s): 4.15 | learning rate: 3.425E-05 | global batch size: 512 | lm loss: 1.931060E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.399 | TFLOPs: 57.51 | 7: iteration 36160/ 44073 | consumed samples: 18513920 | consumed tokens: 37916508160 | elapsed time per iteration (s): 4.19 | learning rate: 3.422E-05 | global batch size: 512 | lm loss: 1.923149E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.306 | TFLOPs: 57.00 | 7: iteration 36170/ 44073 | consumed samples: 18519040 | consumed tokens: 37926993920 | elapsed time per iteration (s): 4.18 | learning rate: 3.418E-05 | global batch size: 512 | lm loss: 1.919990E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.591 | TFLOPs: 57.13 | 7: iteration 36180/ 44073 | consumed samples: 18524160 | consumed tokens: 37937479680 | elapsed time per iteration (s): 4.19 | learning rate: 3.415E-05 | global batch size: 512 | lm loss: 1.950924E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.171 | TFLOPs: 56.94 | 7: iteration 36190/ 44073 | consumed samples: 18529280 | consumed tokens: 37947965440 | elapsed time per iteration (s): 4.14 | learning rate: 3.411E-05 | global batch size: 512 | lm loss: 1.935352E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.632 | TFLOPs: 57.62 | 7: iteration 36200/ 44073 | consumed samples: 18534400 | consumed tokens: 37958451200 | elapsed time per iteration (s): 4.16 | learning rate: 3.408E-05 | global batch size: 512 | lm loss: 1.946268E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.117 | TFLOPs: 57.38 | 7: iteration 36210/ 44073 | consumed samples: 18539520 | consumed tokens: 37968936960 | elapsed time per iteration (s): 4.16 | learning rate: 3.404E-05 | global batch size: 512 | lm loss: 1.941556E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.008 | TFLOPs: 57.33 | 7: iteration 36220/ 44073 | consumed samples: 18544640 | consumed tokens: 37979422720 | elapsed time per iteration (s): 4.16 | learning rate: 3.401E-05 | global batch size: 512 | lm loss: 1.932014E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.969 | TFLOPs: 57.31 | 7: iteration 36230/ 44073 | consumed samples: 18549760 | consumed tokens: 37989908480 | elapsed time per iteration (s): 4.19 | learning rate: 3.397E-05 | global batch size: 512 | lm loss: 1.935672E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.254 | TFLOPs: 56.98 | 7: iteration 36240/ 44073 | consumed samples: 18554880 | consumed tokens: 38000394240 | elapsed time per iteration (s): 4.18 | learning rate: 3.394E-05 | global batch size: 512 | lm loss: 1.942421E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.546 | TFLOPs: 57.11 | 7: iteration 36250/ 44073 | consumed samples: 18560000 | consumed tokens: 38010880000 | elapsed time per iteration (s): 4.17 | learning rate: 3.391E-05 | global batch size: 512 | lm loss: 1.913516E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.751 | TFLOPs: 57.21 | 7: iteration 36260/ 44073 | consumed samples: 18565120 | consumed tokens: 38021365760 | elapsed time per iteration (s): 4.17 | learning rate: 3.387E-05 | global batch size: 512 | lm loss: 1.922765E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.780 | TFLOPs: 57.22 | 7: iteration 36270/ 44073 | consumed samples: 18570240 | consumed tokens: 38031851520 | elapsed time per iteration (s): 4.16 | learning rate: 3.384E-05 | global batch size: 512 | lm loss: 1.938363E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.185 | TFLOPs: 57.41 | 7: iteration 36280/ 44073 | consumed samples: 18575360 | consumed tokens: 38042337280 | elapsed time per iteration (s): 4.16 | learning rate: 3.380E-05 | global batch size: 512 | lm loss: 1.927423E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.088 | TFLOPs: 57.37 | 7: iteration 36290/ 44073 | consumed samples: 18580480 | consumed tokens: 38052823040 | elapsed time per iteration (s): 4.15 | learning rate: 3.377E-05 | global batch size: 512 | lm loss: 1.941397E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.356 | TFLOPs: 57.49 | 7: iteration 36300/ 44073 | consumed samples: 18585600 | consumed tokens: 38063308800 | elapsed time per iteration (s): 4.16 | learning rate: 3.373E-05 | global batch size: 512 | lm loss: 1.931159E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.974 | TFLOPs: 57.31 | 7: iteration 36310/ 44073 | consumed samples: 18590720 | consumed tokens: 38073794560 | elapsed time per iteration (s): 4.36 | learning rate: 3.370E-05 | global batch size: 512 | lm loss: 1.924225E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.419 | TFLOPs: 54.72 | 7: iteration 36320/ 44073 | consumed samples: 18595840 | consumed tokens: 38084280320 | elapsed time per iteration (s): 4.16 | learning rate: 3.366E-05 | global batch size: 512 | lm loss: 1.931050E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.170 | TFLOPs: 57.40 | 7: iteration 36330/ 44073 | consumed samples: 18600960 | consumed tokens: 38094766080 | elapsed time per iteration (s): 4.17 | learning rate: 3.363E-05 | global batch size: 512 | lm loss: 1.929650E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.876 | TFLOPs: 57.27 | 7: iteration 36340/ 44073 | consumed samples: 18606080 | consumed tokens: 38105251840 | elapsed time per iteration (s): 4.18 | learning rate: 3.360E-05 | global batch size: 512 | lm loss: 1.933147E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.540 | TFLOPs: 57.11 | 7: iteration 36350/ 44073 | consumed samples: 18611200 | consumed tokens: 38115737600 | elapsed time per iteration (s): 4.20 | learning rate: 3.356E-05 | global batch size: 512 | lm loss: 1.931848E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.817 | TFLOPs: 56.77 | 7: iteration 36360/ 44073 | consumed samples: 18616320 | consumed tokens: 38126223360 | elapsed time per iteration (s): 4.19 | learning rate: 3.353E-05 | global batch size: 512 | lm loss: 1.920812E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.245 | TFLOPs: 56.97 | 7: iteration 36370/ 44073 | consumed samples: 18621440 | consumed tokens: 38136709120 | elapsed time per iteration (s): 4.18 | learning rate: 3.349E-05 | global batch size: 512 | lm loss: 1.962040E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.612 | TFLOPs: 57.14 | 7: iteration 36380/ 44073 | consumed samples: 18626560 | consumed tokens: 38147194880 | elapsed time per iteration (s): 4.14 | learning rate: 3.346E-05 | global batch size: 512 | lm loss: 1.935569E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.759 | TFLOPs: 57.68 | 7: iteration 36390/ 44073 | consumed samples: 18631680 | consumed tokens: 38157680640 | elapsed time per iteration (s): 4.13 | learning rate: 3.342E-05 | global batch size: 512 | lm loss: 1.940172E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.843 | TFLOPs: 57.72 | 7: iteration 36400/ 44073 | consumed samples: 18636800 | consumed tokens: 38168166400 | elapsed time per iteration (s): 4.14 | learning rate: 3.339E-05 | global batch size: 512 | lm loss: 1.935213E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.754 | TFLOPs: 57.68 | 7: iteration 36410/ 44073 | consumed samples: 18641920 | consumed tokens: 38178652160 | elapsed time per iteration (s): 4.16 | learning rate: 3.336E-05 | global batch size: 512 | lm loss: 1.947737E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.175 | TFLOPs: 57.41 | 7: iteration 36420/ 44073 | consumed samples: 18647040 | consumed tokens: 38189137920 | elapsed time per iteration (s): 4.14 | learning rate: 3.332E-05 | global batch size: 512 | lm loss: 1.932442E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.611 | TFLOPs: 57.61 | 7: iteration 36430/ 44073 | consumed samples: 18652160 | consumed tokens: 38199623680 | elapsed time per iteration (s): 4.15 | learning rate: 3.329E-05 | global batch size: 512 | lm loss: 1.942123E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.367 | TFLOPs: 57.50 | 7: iteration 36440/ 44073 | consumed samples: 18657280 | consumed tokens: 38210109440 | elapsed time per iteration (s): 4.20 | learning rate: 3.326E-05 | global batch size: 512 | lm loss: 1.936034E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.004 | TFLOPs: 56.86 | 7: iteration 36450/ 44073 | consumed samples: 18662400 | consumed tokens: 38220595200 | elapsed time per iteration (s): 4.25 | learning rate: 3.322E-05 | global batch size: 512 | lm loss: 1.924060E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.480 | TFLOPs: 56.15 | 7: iteration 36460/ 44073 | consumed samples: 18667520 | consumed tokens: 38231080960 | elapsed time per iteration (s): 4.24 | learning rate: 3.319E-05 | global batch size: 512 | lm loss: 1.921612E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 120.719 | TFLOPs: 56.26 | 7: iteration 36470/ 44073 | consumed samples: 18672640 | consumed tokens: 38241566720 | elapsed time per iteration (s): 4.23 | learning rate: 3.315E-05 | global batch size: 512 | lm loss: 1.913466E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.098 | TFLOPs: 56.44 | 7: iteration 36480/ 44073 | consumed samples: 18677760 | consumed tokens: 38252052480 | elapsed time per iteration (s): 4.14 | learning rate: 3.312E-05 | global batch size: 512 | lm loss: 1.950156E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.676 | TFLOPs: 57.64 | 7: iteration 36490/ 44073 | consumed samples: 18682880 | consumed tokens: 38262538240 | elapsed time per iteration (s): 4.15 | learning rate: 3.309E-05 | global batch size: 512 | lm loss: 1.933923E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.387 | TFLOPs: 57.50 | 7: iteration 36500/ 44073 | consumed samples: 18688000 | consumed tokens: 38273024000 | elapsed time per iteration (s): 4.16 | learning rate: 3.305E-05 | global batch size: 512 | lm loss: 1.947684E+00 | grad norm: 0.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.191 | TFLOPs: 57.41 | 7: iteration 36510/ 44073 | consumed samples: 18693120 | consumed tokens: 38283509760 | elapsed time per iteration (s): 4.16 | learning rate: 3.302E-05 | global batch size: 512 | lm loss: 1.945464E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.085 | TFLOPs: 57.36 | 7: iteration 36520/ 44073 | consumed samples: 18698240 | consumed tokens: 38293995520 | elapsed time per iteration (s): 4.17 | learning rate: 3.299E-05 | global batch size: 512 | lm loss: 1.935277E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.873 | TFLOPs: 57.27 | 7: iteration 36530/ 44073 | consumed samples: 18703360 | consumed tokens: 38304481280 | elapsed time per iteration (s): 4.15 | learning rate: 3.295E-05 | global batch size: 512 | lm loss: 1.942678E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.246 | TFLOPs: 57.44 | 7: iteration 36540/ 44073 | consumed samples: 18708480 | consumed tokens: 38314967040 | elapsed time per iteration (s): 4.16 | learning rate: 3.292E-05 | global batch size: 512 | lm loss: 1.926479E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.935 | TFLOPs: 57.29 | 7: iteration 36550/ 44073 | consumed samples: 18713600 | consumed tokens: 38325452800 | elapsed time per iteration (s): 4.17 | learning rate: 3.289E-05 | global batch size: 512 | lm loss: 1.943885E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.680 | TFLOPs: 57.17 | 7: iteration 36560/ 44073 | consumed samples: 18718720 | consumed tokens: 38335938560 | elapsed time per iteration (s): 4.15 | learning rate: 3.285E-05 | global batch size: 512 | lm loss: 1.938087E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.422 | TFLOPs: 57.52 | 7: iteration 36570/ 44073 | consumed samples: 18723840 | consumed tokens: 38346424320 | elapsed time per iteration (s): 4.14 | learning rate: 3.282E-05 | global batch size: 512 | lm loss: 1.936833E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.524 | TFLOPs: 57.57 | 7: iteration 36580/ 44073 | consumed samples: 18728960 | consumed tokens: 38356910080 | elapsed time per iteration (s): 4.18 | learning rate: 3.279E-05 | global batch size: 512 | lm loss: 1.920846E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.612 | TFLOPs: 57.14 | 7: iteration 36590/ 44073 | consumed samples: 18734080 | consumed tokens: 38367395840 | elapsed time per iteration (s): 4.16 | learning rate: 3.275E-05 | global batch size: 512 | lm loss: 1.929695E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.115 | TFLOPs: 57.38 | 7: iteration 36600/ 44073 | consumed samples: 18739200 | consumed tokens: 38377881600 | elapsed time per iteration (s): 4.32 | learning rate: 3.272E-05 | global batch size: 512 | lm loss: 1.933834E+00 | grad norm: 0.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.423 | TFLOPs: 55.19 | 7: iteration 36610/ 44073 | consumed samples: 18744320 | consumed tokens: 38388367360 | elapsed time per iteration (s): 4.15 | learning rate: 3.269E-05 | global batch size: 512 | lm loss: 1.932324E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.509 | TFLOPs: 57.56 | 7: iteration 36620/ 44073 | consumed samples: 18749440 | consumed tokens: 38398853120 | elapsed time per iteration (s): 4.17 | learning rate: 3.265E-05 | global batch size: 512 | lm loss: 1.932692E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.636 | TFLOPs: 57.15 | 7: iteration 36630/ 44073 | consumed samples: 18754560 | consumed tokens: 38409338880 | elapsed time per iteration (s): 4.16 | learning rate: 3.262E-05 | global batch size: 512 | lm loss: 1.953343E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.120 | TFLOPs: 57.38 | 7: iteration 36640/ 44073 | consumed samples: 18759680 | consumed tokens: 38419824640 | elapsed time per iteration (s): 4.21 | learning rate: 3.259E-05 | global batch size: 512 | lm loss: 1.926232E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.508 | TFLOPs: 56.63 | 7: iteration 36650/ 44073 | consumed samples: 18764800 | consumed tokens: 38430310400 | elapsed time per iteration (s): 4.23 | learning rate: 3.255E-05 | global batch size: 512 | lm loss: 1.935964E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.157 | TFLOPs: 56.47 | 7: iteration 36660/ 44073 | consumed samples: 18769920 | consumed tokens: 38440796160 | elapsed time per iteration (s): 4.17 | learning rate: 3.252E-05 | global batch size: 512 | lm loss: 1.920981E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.902 | TFLOPs: 57.28 | 7: iteration 36670/ 44073 | consumed samples: 18775040 | consumed tokens: 38451281920 | elapsed time per iteration (s): 4.36 | learning rate: 3.249E-05 | global batch size: 512 | lm loss: 1.935461E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.528 | TFLOPs: 54.77 | 7: iteration 36680/ 44073 | consumed samples: 18780160 | consumed tokens: 38461767680 | elapsed time per iteration (s): 4.21 | learning rate: 3.245E-05 | global batch size: 512 | lm loss: 1.929991E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.682 | TFLOPs: 56.71 | 7: iteration 36690/ 44073 | consumed samples: 18785280 | consumed tokens: 38472253440 | elapsed time per iteration (s): 4.17 | learning rate: 3.242E-05 | global batch size: 512 | lm loss: 1.926131E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.662 | TFLOPs: 57.17 | 7: iteration 36700/ 44073 | consumed samples: 18790400 | consumed tokens: 38482739200 | elapsed time per iteration (s): 4.16 | learning rate: 3.239E-05 | global batch size: 512 | lm loss: 1.925812E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.931 | TFLOPs: 57.29 | 7: iteration 36710/ 44073 | consumed samples: 18795520 | consumed tokens: 38493224960 | elapsed time per iteration (s): 4.23 | learning rate: 3.236E-05 | global batch size: 512 | lm loss: 1.932414E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.079 | TFLOPs: 56.43 | 7: iteration 36720/ 44073 | consumed samples: 18800640 | consumed tokens: 38503710720 | elapsed time per iteration (s): 4.16 | learning rate: 3.232E-05 | global batch size: 512 | lm loss: 1.938746E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.000 | TFLOPs: 57.32 | 7: iteration 36730/ 44073 | consumed samples: 18805760 | consumed tokens: 38514196480 | elapsed time per iteration (s): 5.41 | learning rate: 3.229E-05 | global batch size: 512 | lm loss: 1.908170E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 94.598 | TFLOPs: 44.09 | 7: iteration 36740/ 44073 | consumed samples: 18810880 | consumed tokens: 38524682240 | elapsed time per iteration (s): 4.14 | learning rate: 3.226E-05 | global batch size: 512 | lm loss: 1.922536E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.783 | TFLOPs: 57.69 | 7: iteration 36750/ 44073 | consumed samples: 18816000 | consumed tokens: 38535168000 | elapsed time per iteration (s): 4.13 | learning rate: 3.223E-05 | global batch size: 512 | lm loss: 1.934880E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.935 | TFLOPs: 57.76 | 7: iteration 36760/ 44073 | consumed samples: 18821120 | consumed tokens: 38545653760 | elapsed time per iteration (s): 4.15 | learning rate: 3.219E-05 | global batch size: 512 | lm loss: 1.947809E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.255 | TFLOPs: 57.44 | 7: iteration 36770/ 44073 | consumed samples: 18826240 | consumed tokens: 38556139520 | elapsed time per iteration (s): 4.15 | learning rate: 3.216E-05 | global batch size: 512 | lm loss: 1.938795E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.507 | TFLOPs: 57.56 | 7: iteration 36780/ 44073 | consumed samples: 18831360 | consumed tokens: 38566625280 | elapsed time per iteration (s): 4.14 | learning rate: 3.213E-05 | global batch size: 512 | lm loss: 1.960092E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.703 | TFLOPs: 57.65 | 7: iteration 36790/ 44073 | consumed samples: 18836480 | consumed tokens: 38577111040 | elapsed time per iteration (s): 4.15 | learning rate: 3.209E-05 | global batch size: 512 | lm loss: 1.907927E+00 | grad norm: 0.115 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.323 | TFLOPs: 57.47 | 7: iteration 36800/ 44073 | consumed samples: 18841600 | consumed tokens: 38587596800 | elapsed time per iteration (s): 4.15 | learning rate: 3.206E-05 | global batch size: 512 | lm loss: 1.946252E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.364 | TFLOPs: 57.49 | 7: iteration 36810/ 44073 | consumed samples: 18846720 | consumed tokens: 38598082560 | elapsed time per iteration (s): 4.15 | learning rate: 3.203E-05 | global batch size: 512 | lm loss: 1.925102E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.287 | TFLOPs: 57.46 | 7: iteration 36820/ 44073 | consumed samples: 18851840 | consumed tokens: 38608568320 | elapsed time per iteration (s): 4.16 | learning rate: 3.200E-05 | global batch size: 512 | lm loss: 1.939671E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.123 | TFLOPs: 57.38 | 7: iteration 36830/ 44073 | consumed samples: 18856960 | consumed tokens: 38619054080 | elapsed time per iteration (s): 4.19 | learning rate: 3.197E-05 | global batch size: 512 | lm loss: 1.945512E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.173 | TFLOPs: 56.94 | 7: iteration 36840/ 44073 | consumed samples: 18862080 | consumed tokens: 38629539840 | elapsed time per iteration (s): 4.18 | learning rate: 3.193E-05 | global batch size: 512 | lm loss: 1.932243E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.538 | TFLOPs: 57.11 | 7: iteration 36850/ 44073 | consumed samples: 18867200 | consumed tokens: 38640025600 | elapsed time per iteration (s): 4.21 | learning rate: 3.190E-05 | global batch size: 512 | lm loss: 1.932353E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.595 | TFLOPs: 56.67 | 7: iteration 36860/ 44073 | consumed samples: 18872320 | consumed tokens: 38650511360 | elapsed time per iteration (s): 4.16 | learning rate: 3.187E-05 | global batch size: 512 | lm loss: 1.941673E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.087 | TFLOPs: 57.36 | 7: iteration 36870/ 44073 | consumed samples: 18877440 | consumed tokens: 38660997120 | elapsed time per iteration (s): 4.23 | learning rate: 3.184E-05 | global batch size: 512 | lm loss: 1.946413E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.126 | TFLOPs: 56.45 | 7: iteration 36880/ 44073 | consumed samples: 18882560 | consumed tokens: 38671482880 | elapsed time per iteration (s): 4.21 | learning rate: 3.180E-05 | global batch size: 512 | lm loss: 1.928719E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.502 | TFLOPs: 56.63 | 7: iteration 36890/ 44073 | consumed samples: 18887680 | consumed tokens: 38681968640 | elapsed time per iteration (s): 4.14 | learning rate: 3.177E-05 | global batch size: 512 | lm loss: 1.926086E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.740 | TFLOPs: 57.67 | 7: iteration 36900/ 44073 | consumed samples: 18892800 | consumed tokens: 38692454400 | elapsed time per iteration (s): 4.13 | learning rate: 3.174E-05 | global batch size: 512 | lm loss: 1.928362E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.914 | TFLOPs: 57.75 | 7: iteration 36910/ 44073 | consumed samples: 18897920 | consumed tokens: 38702940160 | elapsed time per iteration (s): 4.19 | learning rate: 3.171E-05 | global batch size: 512 | lm loss: 1.923606E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.244 | TFLOPs: 56.97 | 7: iteration 36920/ 44073 | consumed samples: 18903040 | consumed tokens: 38713425920 | elapsed time per iteration (s): 4.15 | learning rate: 3.168E-05 | global batch size: 512 | lm loss: 1.950475E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.438 | TFLOPs: 57.53 | 7: iteration 36930/ 44073 | consumed samples: 18908160 | consumed tokens: 38723911680 | elapsed time per iteration (s): 4.17 | learning rate: 3.164E-05 | global batch size: 512 | lm loss: 1.937297E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.891 | TFLOPs: 57.27 | 7: iteration 36940/ 44073 | consumed samples: 18913280 | consumed tokens: 38734397440 | elapsed time per iteration (s): 4.16 | learning rate: 3.161E-05 | global batch size: 512 | lm loss: 1.940339E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.931 | TFLOPs: 57.29 | 7: iteration 36950/ 44073 | consumed samples: 18918400 | consumed tokens: 38744883200 | elapsed time per iteration (s): 4.20 | learning rate: 3.158E-05 | global batch size: 512 | lm loss: 1.931717E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 121.882 | TFLOPs: 56.80 | 7: iteration 36960/ 44073 | consumed samples: 18923520 | consumed tokens: 38755368960 | elapsed time per iteration (s): 4.16 | learning rate: 3.155E-05 | global batch size: 512 | lm loss: 1.938953E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.078 | TFLOPs: 57.36 | 7: iteration 36970/ 44073 | consumed samples: 18928640 | consumed tokens: 38765854720 | elapsed time per iteration (s): 4.16 | learning rate: 3.152E-05 | global batch size: 512 | lm loss: 1.917009E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.046 | TFLOPs: 57.35 | 7: iteration 36980/ 44073 | consumed samples: 18933760 | consumed tokens: 38776340480 | elapsed time per iteration (s): 4.15 | learning rate: 3.149E-05 | global batch size: 512 | lm loss: 1.926055E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.457 | TFLOPs: 57.54 | 7: iteration 36990/ 44073 | consumed samples: 18938880 | consumed tokens: 38786826240 | elapsed time per iteration (s): 4.17 | learning rate: 3.145E-05 | global batch size: 512 | lm loss: 1.950212E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.905 | TFLOPs: 57.28 | 7: iteration 37000/ 44073 | consumed samples: 18944000 | consumed tokens: 38797312000 | elapsed time per iteration (s): 4.15 | learning rate: 3.142E-05 | global batch size: 512 | lm loss: 1.930473E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.516 | TFLOPs: 57.56 | 7: ------------------------------------------------------------------------------------------- 7: valid loss at iteration 37000 | lm loss value: 1.838464E+00 | lm loss PPL: 6.286875E+00 | 7: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 37000 to checkpoints_2b2 0: [2022-11-27 05:43:27,693] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step37000 is begin to save! 0: [2022-11-27 05:43:27,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_01-model_00-model_states.pt... 0: [2022-11-27 05:43:28,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_01-model_00-model_states.pt. 0: [2022-11-27 05:43:28,148] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_03-model_00-model_states.pt... 0: [2022-11-27 05:43:28,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_03-model_00-model_states.pt. 0: [2022-11-27 05:43:28,293] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_04-model_00-model_states.pt... 0: [2022-11-27 05:43:28,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_04-model_00-model_states.pt. 0: [2022-11-27 05:43:28,441] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_05-model_00-model_states.pt... 0: [2022-11-27 05:43:28,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_05-model_00-model_states.pt. 0: [2022-11-27 05:43:28,577] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_06-model_00-model_states.pt... 0: [2022-11-27 05:43:28,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_06-model_00-model_states.pt. 0: [2022-11-27 05:43:28,716] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_07-model_00-model_states.pt... 0: [2022-11-27 05:43:28,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_07-model_00-model_states.pt. 0: [2022-11-27 05:43:28,849] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_08-model_00-model_states.pt... 0: [2022-11-27 05:43:28,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_08-model_00-model_states.pt. 0: [2022-11-27 05:43:28,978] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_09-model_00-model_states.pt... 0: [2022-11-27 05:43:29,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_09-model_00-model_states.pt. 0: [2022-11-27 05:43:29,123] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_10-model_00-model_states.pt... 0: [2022-11-27 05:43:29,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_10-model_00-model_states.pt. 0: [2022-11-27 05:43:29,267] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_11-model_00-model_states.pt... 0: [2022-11-27 05:43:29,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_11-model_00-model_states.pt. 0: [2022-11-27 05:43:29,408] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_12-model_00-model_states.pt... 0: [2022-11-27 05:43:29,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_12-model_00-model_states.pt. 0: [2022-11-27 05:43:29,552] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_13-model_00-model_states.pt... 0: [2022-11-27 05:43:29,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_13-model_00-model_states.pt. 0: [2022-11-27 05:43:29,694] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_14-model_00-model_states.pt... 0: [2022-11-27 05:43:29,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_14-model_00-model_states.pt. 0: [2022-11-27 05:43:29,834] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_15-model_00-model_states.pt... 0: [2022-11-27 05:43:29,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_15-model_00-model_states.pt. 0: [2022-11-27 05:43:29,976] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_16-model_00-model_states.pt... 0: [2022-11-27 05:43:30,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_16-model_00-model_states.pt. 0: [2022-11-27 05:43:30,117] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_17-model_00-model_states.pt... 0: [2022-11-27 05:43:30,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_17-model_00-model_states.pt. 0: [2022-11-27 05:43:30,262] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_18-model_00-model_states.pt... 0: [2022-11-27 05:43:30,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_18-model_00-model_states.pt. 0: [2022-11-27 05:43:30,402] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_19-model_00-model_states.pt... 0: [2022-11-27 05:43:30,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_19-model_00-model_states.pt. 0: [2022-11-27 05:43:30,539] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_20-model_00-model_states.pt... 0: [2022-11-27 05:43:30,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_20-model_00-model_states.pt. 0: [2022-11-27 05:43:30,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_21-model_00-model_states.pt... 0: [2022-11-27 05:43:30,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_21-model_00-model_states.pt. 0: [2022-11-27 05:43:30,820] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_22-model_00-model_states.pt... 0: [2022-11-27 05:43:30,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_22-model_00-model_states.pt. 0: [2022-11-27 05:43:30,960] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_23-model_00-model_states.pt... 0: [2022-11-27 05:43:31,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_23-model_00-model_states.pt. 0: [2022-11-27 05:43:31,100] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_24-model_00-model_states.pt... 0: [2022-11-27 05:43:31,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_24-model_00-model_states.pt. 0: [2022-11-27 05:43:31,240] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_25-model_00-model_states.pt... 0: [2022-11-27 05:43:31,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_25-model_00-model_states.pt. 0: [2022-11-27 05:43:31,382] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_26-model_00-model_states.pt... 0: [2022-11-27 05:43:31,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_26-model_00-model_states.pt. 0: [2022-11-27 05:43:31,519] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_27-model_00-model_states.pt... 0: [2022-11-27 05:43:31,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_27-model_00-model_states.pt. 0: [2022-11-27 05:43:31,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_28-model_00-model_states.pt... 0: [2022-11-27 05:43:31,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_28-model_00-model_states.pt. 0: [2022-11-27 05:43:31,803] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_29-model_00-model_states.pt... 0: [2022-11-27 05:43:31,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_29-model_00-model_states.pt. 0: [2022-11-27 05:43:31,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_30-model_00-model_states.pt... 0: [2022-11-27 05:43:32,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_30-model_00-model_states.pt. 0: [2022-11-27 05:43:32,078] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_31-model_00-model_states.pt... 0: [2022-11-27 05:43:32,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_31-model_00-model_states.pt. 0: [2022-11-27 05:43:32,214] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_32-model_00-model_states.pt... 0: [2022-11-27 05:43:32,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_32-model_00-model_states.pt. 0: [2022-11-27 05:43:32,351] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_33-model_00-model_states.pt... 0: [2022-11-27 05:43:32,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_33-model_00-model_states.pt. 0: [2022-11-27 05:43:32,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_34-model_00-model_states.pt... 0: [2022-11-27 05:43:32,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_34-model_00-model_states.pt. 0: [2022-11-27 05:43:32,625] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/layer_36-model_00-model_states.pt... 0: [2022-11-27 05:43:32,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/layer_36-model_00-model_states.pt. 0: [2022-11-27 05:43:32,631] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_2b2/global_step37000/mp_rank_00_model_states.pt 0: [2022-11-27 05:43:32,631] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/mp_rank_00_model_states.pt... 0: [2022-11-27 05:43:32,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/mp_rank_00_model_states.pt. 0: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:43:32,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_2b2/global_step37000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:43:33,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:43:33,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 05:43:33,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 0: [2022-11-27 05:43:33,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:43:33,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 05:43:33,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 0: [2022-11-27 05:43:33,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:43:33,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 05:43:33,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 0: [2022-11-27 05:43:33,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:43:33,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 05:43:33,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 0: [2022-11-27 05:43:33,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:43:33,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 05:43:33,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 0: [2022-11-27 05:43:33,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:43:33,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 05:43:33,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 0: [2022-11-27 05:43:33,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:43:33,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 05:43:33,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 4: [2022-11-27 05:43:33,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:43:33,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 05:43:33,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 4: [2022-11-27 05:43:33,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:43:33,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 05:43:33,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 0: [2022-11-27 05:43:33,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:43:33,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:43:33,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 05:43:33,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 3: [2022-11-27 05:43:33,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:43:33,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 05:43:33,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 3: [2022-11-27 05:43:33,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:43:33,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 05:43:33,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 4: [2022-11-27 05:43:33,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:43:33,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 05:43:33,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 4: [2022-11-27 05:43:33,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:43:33,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 05:43:33,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 4: [2022-11-27 05:43:33,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:43:33,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 05:43:33,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 4: [2022-11-27 05:43:33,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:43:33,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 05:43:33,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 4: [2022-11-27 05:43:33,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:43:33,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 05:43:33,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 4: [2022-11-27 05:43:33,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:43:33,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 05:43:33,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 5: [2022-11-27 05:43:33,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:43:33,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:43:33,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:43:33,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:43:33,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:43:33,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:43:33,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:43:33,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:43:33,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 05:43:33,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 05:43:33,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 05:43:33,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 05:43:33,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 05:43:33,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 05:43:33,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 05:43:33,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 05:43:33,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 5: [2022-11-27 05:43:33,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 5: [2022-11-27 05:43:33,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 5: [2022-11-27 05:43:33,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 5: [2022-11-27 05:43:33,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 5: [2022-11-27 05:43:33,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 5: [2022-11-27 05:43:33,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 5: [2022-11-27 05:43:33,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 3: [2022-11-27 05:43:33,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:43:33,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 05:43:33,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:43:33,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 3: [2022-11-27 05:43:33,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 05:43:33,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 3: [2022-11-27 05:43:33,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:43:33,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 05:43:33,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 3: [2022-11-27 05:43:33,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:43:33,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 05:43:33,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 3: [2022-11-27 05:43:33,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:43:33,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 05:43:33,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 2: [2022-11-27 05:43:33,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:43:33,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:43:33,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:43:33,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:43:33,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:43:33,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:43:33,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:43:33,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:43:33,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 05:43:33,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 05:43:33,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 05:43:33,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 05:43:33,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 2: [2022-11-27 05:43:33,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 05:43:33,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 05:43:33,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 05:43:33,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 05:43:33,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 2: [2022-11-27 05:43:33,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 2: [2022-11-27 05:43:33,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 2: [2022-11-27 05:43:33,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 2: [2022-11-27 05:43:33,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 2: [2022-11-27 05:43:33,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 2: [2022-11-27 05:43:33,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 0: [2022-11-27 05:43:33,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 05:43:33,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 7: [2022-11-27 05:43:33,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:43:33,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:43:33,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:43:33,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:43:33,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:43:33,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:43:33,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:43:33,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 05:43:33,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 05:43:33,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 05:43:33,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 05:43:33,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 05:43:33,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 05:43:33,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 05:43:33,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 7: [2022-11-27 05:43:33,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 7: [2022-11-27 05:43:33,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 7: [2022-11-27 05:43:33,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 7: [2022-11-27 05:43:33,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 7: [2022-11-27 05:43:33,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 7: [2022-11-27 05:43:33,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 7: [2022-11-27 05:43:33,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:43:33,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 05:43:33,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 6: [2022-11-27 05:43:33,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:43:33,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 05:43:33,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:43:33,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:43:33,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:43:33,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:43:33,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:43:33,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 6: [2022-11-27 05:43:33,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 05:43:33,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 05:43:33,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 05:43:33,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 05:43:33,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 05:43:33,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 6: [2022-11-27 05:43:33,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 6: [2022-11-27 05:43:33,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 6: [2022-11-27 05:43:33,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 6: [2022-11-27 05:43:33,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 6: [2022-11-27 05:43:33,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:43:33,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:43:33,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 05:43:33,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 05:43:33,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 6: [2022-11-27 05:43:33,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-27 05:43:33,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:43:33,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:43:33,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 05:43:33,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:43:33,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-27 05:43:33,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:43:33,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 05:43:33,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 05:43:33,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 05:43:33,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-27 05:43:33,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-27 05:43:33,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-27 05:43:33,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:43:33,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:43:33,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:43:33,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 05:43:33,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-27 05:43:33,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 05:43:33,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 05:43:33,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:43:33,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-27 05:43:33,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-27 05:43:33,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_2b2/global_step37000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 05:43:33,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 0: successfully saved checkpoint at iteration 37000 to checkpoints_2b2 7: time (ms) | save-checkpoint: 6229.73 7: iteration 37010/ 44073 | consumed samples: 18949120 | consumed tokens: 38807797760 | elapsed time per iteration (s): 4.95 | learning rate: 3.139E-05 | global batch size: 512 | lm loss: 1.919727E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 103.400 | TFLOPs: 48.19 |