Setting ds_accelerator to cuda (auto detect) [2023-06-12 07:26:44,544] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. Detected CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7: setting --include=localhost:0,1,2,3,4,5,6,7 [2023-06-12 07:26:44,602] [INFO] [runner.py:555:main] cmd = /home/zhaiyuanzhao/anaconda3/envs/RLHF/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path /home/zhaiyuanzhao/llm/dataset/rm-static/data /home/zhaiyuanzhao/llm/dataset/full-hh-rlhf/data /home/zhaiyuanzhao/llm/dataset/synthetic-instruct-gptj-pairwise/data /home/zhaiyuanzhao/llm/dataset/rlhf-reward-datasets/data --data_split 2,4,4 --model_name_or_path /home/zhaiyuanzhao/llm/opt-350m --num_padding_at_beginning 1 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --max_seq_len 512 --learning_rate 5e-5 --weight_decay 0.1 --num_train_epochs 1 --disable_dropout --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --zero_stage 0 --deepspeed --output_dir ./output Setting ds_accelerator to cuda (auto detect) [2023-06-12 07:26:46,974] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} [2023-06-12 07:26:46,974] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0 [2023-06-12 07:26:46,974] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) [2023-06-12 07:26:46,974] [INFO] [launch.py:163:main] dist_world_size=8 [2023-06-12 07:26:46,974] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 Setting ds_accelerator to cuda (auto detect) Setting ds_accelerator to cuda (auto detect) Setting ds_accelerator to cuda (auto detect) Setting ds_accelerator to cuda (auto detect) Setting ds_accelerator to cuda (auto detect) Setting ds_accelerator to cuda (auto detect) Setting ds_accelerator to cuda (auto detect) Setting ds_accelerator to cuda (auto detect) [2023-06-12 07:26:51,092] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-12 07:26:51,093] [INFO] [comm.py:594:init_distributed] cdb=None [2023-06-12 07:26:51,093] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2023-06-12 07:26:51,137] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-12 07:26:51,138] [INFO] [comm.py:594:init_distributed] cdb=None [2023-06-12 07:26:51,228] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-12 07:26:51,229] [INFO] [comm.py:594:init_distributed] cdb=None [2023-06-12 07:26:51,237] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-12 07:26:51,237] [INFO] [comm.py:594:init_distributed] cdb=None [2023-06-12 07:26:51,268] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-12 07:26:51,268] [INFO] [comm.py:594:init_distributed] cdb=None [2023-06-12 07:26:51,291] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-12 07:26:51,291] [INFO] [comm.py:594:init_distributed] cdb=None [2023-06-12 07:26:51,301] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-12 07:26:51,301] [INFO] [comm.py:594:init_distributed] cdb=None [2023-06-12 07:26:51,302] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-12 07:26:51,302] [INFO] [comm.py:594:init_distributed] cdb=None model loaded model loaded model loaded model loaded model loaded model loaded model loaded model loaded Found cached dataset parquet (/home/zhaiyuanzhao/.cache/huggingface/datasets/parquet/default-d09980a08a1dbd7c/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec) 0%| | 0/2 [00:00 [2023-06-12 07:32:34,410] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-05, 5e-05], mom=[(0.9, 0.95), (0.9, 0.95)] Using /home/zhaiyuanzhao/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... [2023-06-12 07:32:34,410] [INFO] [config.py:960:print] DeepSpeedEngine configuration: Using /home/zhaiyuanzhao/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Using /home/zhaiyuanzhao/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...Using /home/zhaiyuanzhao/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Using /home/zhaiyuanzhao/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...[2023-06-12 07:32:34,411] [INFO] [config.py:964:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2023-06-12 07:32:34,411] [INFO] [config.py:964:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} Using /home/zhaiyuanzhao/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...[2023-06-12 07:32:34,411] [INFO] [config.py:964:print] amp_enabled .................. False Using /home/zhaiyuanzhao/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...[2023-06-12 07:32:34,411] [INFO] [config.py:964:print] amp_params ................... False [2023-06-12 07:32:34,411] [INFO] [config.py:964:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2023-06-12 07:32:34,411] [INFO] [config.py:964:print] bfloat16_enabled ............. False [2023-06-12 07:32:34,411] [INFO] [config.py:964:print] checkpoint_parallel_write_pipeline False [2023-06-12 07:32:34,411] [INFO] [config.py:964:print] checkpoint_tag_validation_enabled True [2023-06-12 07:32:34,411] [INFO] [config.py:964:print] checkpoint_tag_validation_fail False [2023-06-12 07:32:34,411] [INFO] [config.py:964:print] comms_config ................. [2023-06-12 07:32:34,411] [INFO] [config.py:964:print] communication_data_type ...... None [2023-06-12 07:32:34,411] [INFO] [config.py:964:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2023-06-12 07:32:34,411] [INFO] [config.py:964:print] curriculum_enabled_legacy .... False [2023-06-12 07:32:34,411] [INFO] [config.py:964:print] curriculum_params_legacy ..... False [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] data_efficiency_enabled ...... False [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] dataloader_drop_last ......... False [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] disable_allgather ............ False [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] dump_state ................... False [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 100, 'delayed_shift': 2, 'consecutive_hysteresis': False, 'min_scale': 1} [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] eigenvalue_enabled ........... False [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] eigenvalue_gas_boundary_resolution 1 [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] eigenvalue_layer_name ........ bert.encoder.layer [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] eigenvalue_layer_num ......... 0 [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] eigenvalue_max_iter .......... 100 [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] eigenvalue_stability ......... 1e-06 [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] eigenvalue_tol ............... 0.01 [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] eigenvalue_verbose ........... False [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] elasticity_enabled ........... False [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] fp16_auto_cast ............... False [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] fp16_enabled ................. True [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] fp16_master_weights_and_gradients False [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] global_rank .................. 0 [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] grad_accum_dtype ............. None [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] gradient_accumulation_steps .. 1 [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] gradient_clipping ............ 1.0 [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] gradient_predivide_factor .... 1.0 [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] initial_dynamic_scale ........ 65536 [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] load_universal_checkpoint .... False [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] loss_scale ................... 0 [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] memory_breakdown ............. False [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] mics_hierarchial_params_gather False [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] mics_shard_size .............. -1 [2023-06-12 07:32:34,412] [INFO] [config.py:964:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2023-06-12 07:32:34,413] [INFO] [config.py:964:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2023-06-12 07:32:34,413] [INFO] [config.py:964:print] optimizer_legacy_fusion ...... False [2023-06-12 07:32:34,413] [INFO] [config.py:964:print] optimizer_name ............... None [2023-06-12 07:32:34,413] [INFO] [config.py:964:print] optimizer_params ............. None [2023-06-12 07:32:34,413] [INFO] [config.py:964:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2023-06-12 07:32:34,413] [INFO] [config.py:964:print] pld_enabled .................. False [2023-06-12 07:32:34,413] [INFO] [config.py:964:print] pld_params ................... False [2023-06-12 07:32:34,413] [INFO] [config.py:964:print] prescale_gradients ........... False [2023-06-12 07:32:34,413] [INFO] [config.py:964:print] scheduler_name ............... None [2023-06-12 07:32:34,413] [INFO] [config.py:964:print] scheduler_params ............. None [2023-06-12 07:32:34,413] [INFO] [config.py:964:print] sparse_attention ............. None [2023-06-12 07:32:34,413] [INFO] [config.py:964:print] sparse_gradients_enabled ..... False [2023-06-12 07:32:34,413] [INFO] [config.py:964:print] steps_per_print .............. 10 [2023-06-12 07:32:34,413] [INFO] [config.py:964:print] train_batch_size ............. 32 [2023-06-12 07:32:34,413] [INFO] [config.py:964:print] train_micro_batch_size_per_gpu 4 [2023-06-12 07:32:34,413] [INFO] [config.py:964:print] use_node_local_storage ....... False [2023-06-12 07:32:34,413] [INFO] [config.py:964:print] wall_clock_breakdown ......... False [2023-06-12 07:32:34,413] [INFO] [config.py:964:print] world_size ................... 8 [2023-06-12 07:32:34,413] [INFO] [config.py:964:print] zero_allow_untested_optimizer False [2023-06-12 07:32:34,413] [INFO] [config.py:964:print] zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=30000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=30000000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=False [2023-06-12 07:32:34,413] [INFO] [config.py:964:print] zero_enabled ................. False [2023-06-12 07:32:34,413] [INFO] [config.py:964:print] zero_force_ds_cpu_optimizer .. True [2023-06-12 07:32:34,413] [INFO] [config.py:964:print] zero_optimization_stage ...... 0 [2023-06-12 07:32:34,413] [INFO] [config.py:950:print_user_config] json = { "train_batch_size": 32, "train_micro_batch_size_per_gpu": 4, "steps_per_print": 10, "zero_optimization": { "stage": 0, "offload_param": { "device": "none" }, "offload_optimizer": { "device": "none" }, "stage3_param_persistence_threshold": 1.000000e+04, "stage3_max_live_parameters": 3.000000e+07, "stage3_prefetch_bucket_size": 3.000000e+07, "memory_efficient_linear": false }, "fp16": { "enabled": true, "loss_scale_window": 100 }, "gradient_clipping": 1.0, "prescale_gradients": false, "wall_clock_breakdown": false, "hybrid_engine": { "enabled": false, "max_out_tokens": 512, "inference_tp_size": 1, "release_inference_cache": false, "pin_parameters": true, "tp_gather_partition_size": 8 } } Using /home/zhaiyuanzhao/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Emitting ninja build file /home/zhaiyuanzhao/.cache/torch_extensions/py39_cu117/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 1.7743585109710693 seconds Loading extension module utils... Loading extension module utils... Loading extension module utils... Loading extension module utils... Loading extension module utils... Loading extension module utils... Loading extension module utils... Time to load utils op: 1.8262453079223633 seconds Time to load utils op: 1.8265697956085205 seconds Time to load utils op: 1.8265929222106934 seconds Time to load utils op: 1.8270368576049805 secondsTime to load utils op: 1.8266544342041016 seconds Time to load utils op: 1.82395339012146 secondsTime to load utils op: 1.8270204067230225 seconds ***** Running training ***** ***** Evaluating reward, Epoch 0/1 ***** chosen_last_scores (higher is better) : 2.576474905014038, acc (higher is better) : 0.4899999797344208 Beginning of Epoch 1/1, Total Micro Batches 3680 [2023-06-12 07:32:42,955] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-06-12 07:32:42,955] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-06-12 07:32:42,955] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-06-12 07:32:42,955] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-06-12 07:32:42,956] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-06-12 07:32:42,956] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-06-12 07:32:42,956] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-06-12 07:32:42,956] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-06-12 07:32:42,955] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-06-12 07:32:42,956] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 65536, reducing to 32768.0 [2023-06-12 07:32:42,955] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-06-12 07:32:42,956] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-06-12 07:32:42,956] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-06-12 07:32:42,956] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-06-12 07:32:42,955] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-06-12 07:32:42,957] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-06-12 07:32:42,957] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-06-12 07:32:43,187] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-06-12 07:32:43,187] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:32:43,187] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-06-12 07:32:43,187] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-06-12 07:32:43,187] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:32:43,187] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:32:43,187] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-06-12 07:32:43,187] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-06-12 07:32:43,187] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-06-12 07:32:43,187] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-06-12 07:32:43,187] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:32:43,187] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:32:43,187] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:32:43,187] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-06-12 07:32:43,187] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-06-12 07:32:43,187] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:32:43,188] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:32:43,403] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-06-12 07:32:43,404] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-06-12 07:32:43,404] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:32:43,404] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-06-12 07:32:43,404] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:32:43,404] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-06-12 07:32:43,404] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:32:43,404] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:32:43,404] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-06-12 07:32:43,404] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-06-12 07:32:43,404] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:32:43,404] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-06-12 07:32:43,404] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-06-12 07:32:43,404] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:32:43,404] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:32:43,404] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-06-12 07:32:43,404] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:32:43,620] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-06-12 07:32:43,620] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-06-12 07:32:43,620] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-06-12 07:32:43,620] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-12 07:32:43,620] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-12 07:32:43,620] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-06-12 07:32:43,620] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-12 07:32:43,620] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0 [2023-06-12 07:32:43,620] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-12 07:32:43,620] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-06-12 07:32:43,620] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-06-12 07:32:43,620] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-12 07:32:43,620] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-06-12 07:32:43,620] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-12 07:32:43,620] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-06-12 07:32:43,620] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-12 07:32:43,620] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-12 07:32:43,835] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-06-12 07:32:43,835] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-06-12 07:32:43,835] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-06-12 07:32:43,836] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-12 07:32:43,836] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-12 07:32:43,835] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-06-12 07:32:43,836] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-12 07:32:43,836] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 4096.0, reducing to 2048.0 [2023-06-12 07:32:43,835] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-06-12 07:32:43,836] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-12 07:32:43,835] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-06-12 07:32:43,836] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-12 07:32:43,835] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-06-12 07:32:43,836] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-12 07:32:43,835] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-06-12 07:32:43,836] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-12 07:32:43,836] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-12 07:32:44,052] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-06-12 07:32:44,052] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-06-12 07:32:44,052] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-06-12 07:32:44,052] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-06-12 07:32:44,053] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-06-12 07:32:44,053] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-06-12 07:32:44,053] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-06-12 07:32:44,053] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-06-12 07:32:44,053] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 2048.0, reducing to 1024.0 [2023-06-12 07:32:44,053] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-06-12 07:32:44,053] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-06-12 07:32:44,053] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-06-12 07:32:44,053] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-06-12 07:32:44,053] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-06-12 07:32:44,053] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-06-12 07:32:44,053] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-06-12 07:32:44,053] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-06-12 07:32:44,269] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-06-12 07:32:44,269] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-06-12 07:32:44,269] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-06-12 07:32:44,269] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-06-12 07:32:44,269] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-06-12 07:32:44,269] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-06-12 07:32:44,269] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-06-12 07:32:44,269] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-06-12 07:32:44,269] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1024.0, reducing to 512.0 [2023-06-12 07:32:44,269] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-06-12 07:32:44,269] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-06-12 07:32:44,269] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-06-12 07:32:44,269] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-06-12 07:32:44,269] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-06-12 07:32:44,269] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-06-12 07:32:44,269] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-06-12 07:32:44,269] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-06-12 07:32:45,018] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=7, lr=[4.999991801084829e-05, 4.999991801084829e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:32:45,027] [INFO] [timer.py:215:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=139.45916111120349, CurrSamplesPerSec=128.15631859921302, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:32:47,514] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=7, lr=[4.999846044088921e-05, 4.999846044088921e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:32:47,524] [INFO] [timer.py:215:stop] epoch=0/micro_step=20/global_step=20, RunningAvgSamplesPerSec=133.08776115147245, CurrSamplesPerSec=128.86802949171062, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:32:50,006] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=7, lr=[4.9995181012051625e-05, 4.9995181012051625e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:32:50,015] [INFO] [timer.py:215:stop] epoch=0/micro_step=30/global_step=30, RunningAvgSamplesPerSec=131.49171226334505, CurrSamplesPerSec=128.62005460359686, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:32:52,496] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=7, lr=[4.9990079963336504e-05, 4.9990079963336504e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:32:52,505] [INFO] [timer.py:215:stop] epoch=0/micro_step=40/global_step=40, RunningAvgSamplesPerSec=130.75525989930847, CurrSamplesPerSec=128.66518206786145, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:32:54,989] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=7, lr=[4.998315766650239e-05, 4.998315766650239e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:32:54,998] [INFO] [timer.py:215:stop] epoch=0/micro_step=50/global_step=50, RunningAvgSamplesPerSec=130.30280993651792, CurrSamplesPerSec=128.76022582826883, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:32:57,476] [INFO] [logging.py:96:log_dist] [Rank 0] step=60, skipped=7, lr=[4.997441462603825e-05, 4.997441462603825e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:32:57,485] [INFO] [timer.py:215:stop] epoch=0/micro_step=60/global_step=60, RunningAvgSamplesPerSec=130.0584389411792, CurrSamplesPerSec=128.32271098027707, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:32:59,969] [INFO] [logging.py:96:log_dist] [Rank 0] step=70, skipped=7, lr=[4.996385147912677e-05, 4.996385147912677e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:32:59,978] [INFO] [timer.py:215:stop] epoch=0/micro_step=70/global_step=70, RunningAvgSamplesPerSec=129.84299111248677, CurrSamplesPerSec=128.00317390682372, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:33:02,463] [INFO] [logging.py:96:log_dist] [Rank 0] step=80, skipped=7, lr=[4.995146899559788e-05, 4.995146899559788e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:33:02,472] [INFO] [timer.py:215:stop] epoch=0/micro_step=80/global_step=80, RunningAvgSamplesPerSec=129.66959360325677, CurrSamplesPerSec=128.52410993009673, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:33:04,960] [INFO] [logging.py:96:log_dist] [Rank 0] step=90, skipped=7, lr=[4.993726807787265e-05, 4.993726807787265e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:33:04,969] [INFO] [timer.py:215:stop] epoch=0/micro_step=90/global_step=90, RunningAvgSamplesPerSec=129.5227164533939, CurrSamplesPerSec=127.3706632553805, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:33:07,458] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=7, lr=[4.9921249760897536e-05, 4.9921249760897536e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:33:07,467] [INFO] [timer.py:215:stop] epoch=0/micro_step=100/global_step=100, RunningAvgSamplesPerSec=129.40289835578912, CurrSamplesPerSec=128.923113571832, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:33:09,442] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:33:09,443] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:33:09,442] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:33:09,443] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-06-12 07:33:09,443] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-06-12 07:33:09,442] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:33:09,443] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:33:09,443] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-06-12 07:33:09,442] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:33:09,443] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-06-12 07:33:09,443] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:33:09,443] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-06-12 07:33:09,443] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-06-12 07:33:09,442] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:33:09,443] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-06-12 07:33:09,443] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-06-12 07:33:09,954] [INFO] [logging.py:96:log_dist] [Rank 0] step=110, skipped=7, lr=[4.990341521206896e-05, 4.990341521206896e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:33:09,963] [INFO] [timer.py:215:stop] epoch=0/micro_step=110/global_step=110, RunningAvgSamplesPerSec=129.3134761036645, CurrSamplesPerSec=129.07052455141417, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:33:12,450] [INFO] [logging.py:96:log_dist] [Rank 0] step=120, skipped=7, lr=[4.9883765731148184e-05, 4.9883765731148184e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:33:12,459] [INFO] [timer.py:215:stop] epoch=0/micro_step=120/global_step=120, RunningAvgSamplesPerSec=129.23928190718652, CurrSamplesPerSec=128.01355123919254, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:33:14,952] [INFO] [logging.py:96:log_dist] [Rank 0] step=130, skipped=7, lr=[4.986230275016667e-05, 4.986230275016667e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:33:14,961] [INFO] [timer.py:215:stop] epoch=0/micro_step=130/global_step=130, RunningAvgSamplesPerSec=129.153005016377, CurrSamplesPerSec=126.98083618419061, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:33:17,456] [INFO] [logging.py:96:log_dist] [Rank 0] step=140, skipped=7, lr=[4.983902783332164e-05, 4.983902783332164e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:33:17,465] [INFO] [timer.py:215:stop] epoch=0/micro_step=140/global_step=140, RunningAvgSamplesPerSec=129.06913315521484, CurrSamplesPerSec=127.52023760095999, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:33:19,962] [INFO] [logging.py:96:log_dist] [Rank 0] step=150, skipped=7, lr=[4.98139426768621e-05, 4.98139426768621e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:33:19,971] [INFO] [timer.py:215:stop] epoch=0/micro_step=150/global_step=150, RunningAvgSamplesPerSec=128.99316406920303, CurrSamplesPerSec=128.62523155712765, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:33:20,939] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 153 [2023-06-12 07:33:20,939] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 153 [2023-06-12 07:33:20,939] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 153 [2023-06-12 07:33:20,940] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-06-12 07:33:20,940] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-06-12 07:33:20,939] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 153 [2023-06-12 07:33:20,940] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-06-12 07:33:20,940] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1024.0, reducing to 512.0 [2023-06-12 07:33:20,939] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 153 [2023-06-12 07:33:20,940] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-06-12 07:33:20,939] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 153 [2023-06-12 07:33:20,940] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-06-12 07:33:20,939] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 153 [2023-06-12 07:33:20,940] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-06-12 07:33:20,939] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 153 [2023-06-12 07:33:20,940] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-06-12 07:33:20,940] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-06-12 07:33:22,431] [INFO] [logging.py:96:log_dist] [Rank 0] step=160, skipped=8, lr=[4.978981978944271e-05, 4.978981978944271e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:33:22,440] [INFO] [timer.py:215:stop] epoch=0/micro_step=160/global_step=160, RunningAvgSamplesPerSec=129.0466569387555, CurrSamplesPerSec=128.65728864217388, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:33:24,935] [INFO] [logging.py:96:log_dist] [Rank 0] step=170, skipped=8, lr=[4.9761300323275173e-05, 4.9761300323275173e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:33:24,944] [INFO] [timer.py:215:stop] epoch=0/micro_step=170/global_step=170, RunningAvgSamplesPerSec=128.98792191920867, CurrSamplesPerSec=127.31713906279643, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:33:27,437] [INFO] [logging.py:96:log_dist] [Rank 0] step=180, skipped=8, lr=[4.973097628218415e-05, 4.973097628218415e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:33:27,446] [INFO] [timer.py:215:stop] epoch=0/micro_step=180/global_step=180, RunningAvgSamplesPerSec=128.93962083991278, CurrSamplesPerSec=127.7033259372871, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:33:29,941] [INFO] [logging.py:96:log_dist] [Rank 0] step=190, skipped=8, lr=[4.9698849876150674e-05, 4.9698849876150674e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:33:29,950] [INFO] [timer.py:215:stop] epoch=0/micro_step=190/global_step=190, RunningAvgSamplesPerSec=128.89003113944034, CurrSamplesPerSec=127.6620275607028, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:33:32,445] [INFO] [logging.py:96:log_dist] [Rank 0] step=200, skipped=8, lr=[4.966492344651005e-05, 4.966492344651005e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:33:32,454] [INFO] [timer.py:215:stop] epoch=0/micro_step=200/global_step=200, RunningAvgSamplesPerSec=128.84730591363484, CurrSamplesPerSec=128.1066561802343, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:33:34,948] [INFO] [logging.py:96:log_dist] [Rank 0] step=210, skipped=8, lr=[4.962919946578123e-05, 4.962919946578123e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:33:34,957] [INFO] [timer.py:215:stop] epoch=0/micro_step=210/global_step=210, RunningAvgSamplesPerSec=128.8093961189123, CurrSamplesPerSec=128.07291749166728, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:33:37,446] [INFO] [logging.py:96:log_dist] [Rank 0] step=220, skipped=8, lr=[4.95916805374866e-05, 4.95916805374866e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:33:37,455] [INFO] [timer.py:215:stop] epoch=0/micro_step=220/global_step=220, RunningAvgSamplesPerSec=128.78839050454786, CurrSamplesPerSec=128.18385746156903, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:33:39,943] [INFO] [logging.py:96:log_dist] [Rank 0] step=230, skipped=8, lr=[4.955236939596225e-05, 4.955236939596225e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:33:39,952] [INFO] [timer.py:215:stop] epoch=0/micro_step=230/global_step=230, RunningAvgSamplesPerSec=128.77077558798177, CurrSamplesPerSec=128.37364613342208, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:33:42,440] [INFO] [logging.py:96:log_dist] [Rank 0] step=240, skipped=8, lr=[4.95112689061587e-05, 4.95112689061587e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:33:42,449] [INFO] [timer.py:215:stop] epoch=0/micro_step=240/global_step=240, RunningAvgSamplesPerSec=128.75464099241177, CurrSamplesPerSec=128.27463584659156, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:33:44,938] [INFO] [logging.py:96:log_dist] [Rank 0] step=250, skipped=8, lr=[4.946838206343211e-05, 4.946838206343211e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:33:44,947] [INFO] [timer.py:215:stop] epoch=0/micro_step=250/global_step=250, RunningAvgSamplesPerSec=128.73673254242163, CurrSamplesPerSec=128.1510569633548, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:33:46,175] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:33:46,176] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:33:46,176] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:33:46,176] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-06-12 07:33:46,176] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:33:46,176] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-06-12 07:33:46,176] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-06-12 07:33:46,175] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:33:46,176] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:33:46,176] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-06-12 07:33:46,176] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:33:46,176] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-06-12 07:33:46,176] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-06-12 07:33:46,176] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:33:46,176] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-06-12 07:33:46,176] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-06-12 07:33:47,439] [INFO] [logging.py:96:log_dist] [Rank 0] step=260, skipped=8, lr=[4.9423711993325955e-05, 4.9423711993325955e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:33:47,448] [INFO] [timer.py:215:stop] epoch=0/micro_step=260/global_step=260, RunningAvgSamplesPerSec=128.7151608599872, CurrSamplesPerSec=127.82080785032275, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:33:49,938] [INFO] [logging.py:96:log_dist] [Rank 0] step=270, skipped=8, lr=[4.9377261951343265e-05, 4.9377261951343265e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:33:49,947] [INFO] [timer.py:215:stop] epoch=0/micro_step=270/global_step=270, RunningAvgSamplesPerSec=128.6987933502108, CurrSamplesPerSec=128.839453838947, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:33:52,437] [INFO] [logging.py:96:log_dist] [Rank 0] step=280, skipped=8, lr=[4.9329035322709386e-05, 4.9329035322709386e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:33:52,447] [INFO] [timer.py:215:stop] epoch=0/micro_step=280/global_step=280, RunningAvgSamplesPerSec=128.6829715578517, CurrSamplesPerSec=127.99377471684403, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:33:54,944] [INFO] [logging.py:96:log_dist] [Rank 0] step=290, skipped=8, lr=[4.927903562212521e-05, 4.927903562212521e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:33:54,953] [INFO] [timer.py:215:stop] epoch=0/micro_step=290/global_step=290, RunningAvgSamplesPerSec=128.65557921272662, CurrSamplesPerSec=127.4600510152723, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:33:57,450] [INFO] [logging.py:96:log_dist] [Rank 0] step=300, skipped=8, lr=[4.922726649351108e-05, 4.922726649351108e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:33:57,459] [INFO] [timer.py:215:stop] epoch=0/micro_step=300/global_step=300, RunningAvgSamplesPerSec=128.6298066168354, CurrSamplesPerSec=127.62633814040086, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:33:59,954] [INFO] [logging.py:96:log_dist] [Rank 0] step=310, skipped=8, lr=[4.917373170974119e-05, 4.917373170974119e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:33:59,963] [INFO] [timer.py:215:stop] epoch=0/micro_step=310/global_step=310, RunningAvgSamplesPerSec=128.6110159438476, CurrSamplesPerSec=128.21336777378585, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:34:02,459] [INFO] [logging.py:96:log_dist] [Rank 0] step=320, skipped=8, lr=[4.9118435172368673e-05, 4.9118435172368673e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:34:02,468] [INFO] [timer.py:215:stop] epoch=0/micro_step=320/global_step=320, RunningAvgSamplesPerSec=128.59017355702366, CurrSamplesPerSec=128.34504382949385, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:34:04,965] [INFO] [logging.py:96:log_dist] [Rank 0] step=330, skipped=8, lr=[4.906138091134118e-05, 4.906138091134118e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:34:04,974] [INFO] [timer.py:215:stop] epoch=0/micro_step=330/global_step=330, RunningAvgSamplesPerSec=128.57031005727694, CurrSamplesPerSec=127.65025027366995, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:34:07,472] [INFO] [logging.py:96:log_dist] [Rank 0] step=340, skipped=8, lr=[4.900257308470728e-05, 4.900257308470728e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:34:07,481] [INFO] [timer.py:215:stop] epoch=0/micro_step=340/global_step=340, RunningAvgSamplesPerSec=128.54944919086356, CurrSamplesPerSec=127.49334642928046, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:34:09,973] [INFO] [logging.py:96:log_dist] [Rank 0] step=350, skipped=8, lr=[4.894201597831334e-05, 4.894201597831334e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:34:09,983] [INFO] [timer.py:215:stop] epoch=0/micro_step=350/global_step=350, RunningAvgSamplesPerSec=128.53784489123402, CurrSamplesPerSec=128.61376886309714, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:34:11,210] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:34:11,211] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:34:11,212] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-06-12 07:34:11,211] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:34:11,212] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-06-12 07:34:11,211] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:34:11,212] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-06-12 07:34:11,212] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:34:11,212] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-06-12 07:34:11,212] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-06-12 07:34:11,212] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:34:11,212] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:34:11,212] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-06-12 07:34:11,212] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-06-12 07:34:11,212] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:34:11,212] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-06-12 07:34:12,474] [INFO] [logging.py:96:log_dist] [Rank 0] step=360, skipped=8, lr=[4.88797140054912e-05, 4.88797140054912e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:34:12,483] [INFO] [timer.py:215:stop] epoch=0/micro_step=360/global_step=360, RunningAvgSamplesPerSec=128.5282209420601, CurrSamplesPerSec=128.1801849294529, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:34:14,976] [INFO] [logging.py:96:log_dist] [Rank 0] step=370, skipped=8, lr=[4.88156717067366e-05, 4.88156717067366e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:34:14,986] [INFO] [timer.py:215:stop] epoch=0/micro_step=370/global_step=370, RunningAvgSamplesPerSec=128.5168424032244, CurrSamplesPerSec=127.65110010823231, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:34:17,481] [INFO] [logging.py:96:log_dist] [Rank 0] step=380, skipped=8, lr=[4.874989374937817e-05, 4.874989374937817e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:34:17,490] [INFO] [timer.py:215:stop] epoch=0/micro_step=380/global_step=380, RunningAvgSamplesPerSec=128.502842177046, CurrSamplesPerSec=128.73219184126023, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:34:19,986] [INFO] [logging.py:96:log_dist] [Rank 0] step=390, skipped=8, lr=[4.8682384927237355e-05, 4.8682384927237355e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:34:19,995] [INFO] [timer.py:215:stop] epoch=0/micro_step=390/global_step=390, RunningAvgSamplesPerSec=128.4893315147837, CurrSamplesPerSec=128.33682149195275, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:34:22,493] [INFO] [logging.py:96:log_dist] [Rank 0] step=400, skipped=8, lr=[4.861315016027902e-05, 4.861315016027902e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:34:22,503] [INFO] [timer.py:215:stop] epoch=0/micro_step=400/global_step=400, RunningAvgSamplesPerSec=128.4729178346795, CurrSamplesPerSec=127.45932476372808, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:34:25,001] [INFO] [logging.py:96:log_dist] [Rank 0] step=410, skipped=8, lr=[4.854219449425288e-05, 4.854219449425288e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:34:25,010] [INFO] [timer.py:215:stop] epoch=0/micro_step=410/global_step=410, RunningAvgSamplesPerSec=128.45682658711848, CurrSamplesPerSec=128.3436938211917, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:34:27,509] [INFO] [logging.py:96:log_dist] [Rank 0] step=420, skipped=8, lr=[4.84695231003258e-05, 4.84695231003258e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:34:27,518] [INFO] [timer.py:215:stop] epoch=0/micro_step=420/global_step=420, RunningAvgSamplesPerSec=128.44162868085314, CurrSamplesPerSec=128.25183728501304, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:34:30,011] [INFO] [logging.py:96:log_dist] [Rank 0] step=430, skipped=8, lr=[4.83951412747049e-05, 4.83951412747049e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:34:30,020] [INFO] [timer.py:215:stop] epoch=0/micro_step=430/global_step=430, RunningAvgSamplesPerSec=128.43448259167772, CurrSamplesPerSec=128.02905933010155, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:34:32,518] [INFO] [logging.py:96:log_dist] [Rank 0] step=440, skipped=8, lr=[4.831905443825159e-05, 4.831905443825159e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:34:32,527] [INFO] [timer.py:215:stop] epoch=0/micro_step=440/global_step=440, RunningAvgSamplesPerSec=128.4210509777968, CurrSamplesPerSec=127.48486958272741, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:34:35,022] [INFO] [logging.py:96:log_dist] [Rank 0] step=450, skipped=8, lr=[4.824126813608649e-05, 4.824126813608649e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:34:35,032] [INFO] [timer.py:215:stop] epoch=0/micro_step=450/global_step=450, RunningAvgSamplesPerSec=128.41147064387533, CurrSamplesPerSec=128.35289898660506, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:34:36,265] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:34:36,266] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:34:36,266] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:34:36,266] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-12 07:34:36,266] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-12 07:34:36,265] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:34:36,266] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-12 07:34:36,265] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:34:36,266] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-12 07:34:36,265] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:34:36,266] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-12 07:34:36,266] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:34:36,266] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:34:36,266] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-12 07:34:36,266] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-12 07:34:36,266] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-12 07:34:37,529] [INFO] [logging.py:96:log_dist] [Rank 0] step=460, skipped=8, lr=[4.8161788037185327e-05, 4.8161788037185327e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:34:37,538] [INFO] [timer.py:215:stop] epoch=0/micro_step=460/global_step=460, RunningAvgSamplesPerSec=128.39982672043462, CurrSamplesPerSec=128.5660911970681, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:34:40,035] [INFO] [logging.py:96:log_dist] [Rank 0] step=470, skipped=8, lr=[4.808061993396574e-05, 4.808061993396574e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:34:40,044] [INFO] [timer.py:215:stop] epoch=0/micro_step=470/global_step=470, RunningAvgSamplesPerSec=128.3898396634247, CurrSamplesPerSec=128.38248719214272, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:34:42,538] [INFO] [logging.py:96:log_dist] [Rank 0] step=480, skipped=8, lr=[4.7997769741865226e-05, 4.7997769741865226e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:34:42,548] [INFO] [timer.py:215:stop] epoch=0/micro_step=480/global_step=480, RunningAvgSamplesPerSec=128.38231635005965, CurrSamplesPerSec=128.10751210274736, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:34:45,043] [INFO] [logging.py:96:log_dist] [Rank 0] step=490, skipped=8, lr=[4.791324349890993e-05, 4.791324349890993e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:34:45,052] [INFO] [timer.py:215:stop] epoch=0/micro_step=490/global_step=490, RunningAvgSamplesPerSec=128.3740816659432, CurrSamplesPerSec=128.71256296924253, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:34:47,545] [INFO] [logging.py:96:log_dist] [Rank 0] step=500, skipped=8, lr=[4.782704736527466e-05, 4.782704736527466e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:34:47,554] [INFO] [timer.py:215:stop] epoch=0/micro_step=500/global_step=500, RunningAvgSamplesPerSec=128.3689129590794, CurrSamplesPerSec=128.39808480616077, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:34:50,053] [INFO] [logging.py:96:log_dist] [Rank 0] step=510, skipped=8, lr=[4.7739187622833914e-05, 4.7739187622833914e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:34:50,062] [INFO] [timer.py:215:stop] epoch=0/micro_step=510/global_step=510, RunningAvgSamplesPerSec=128.358240762381, CurrSamplesPerSec=127.65571369330914, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:34:52,559] [INFO] [logging.py:96:log_dist] [Rank 0] step=520, skipped=8, lr=[4.76496706747041e-05, 4.76496706747041e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:34:52,568] [INFO] [timer.py:215:stop] epoch=0/micro_step=520/global_step=520, RunningAvgSamplesPerSec=128.34929447986508, CurrSamplesPerSec=126.91059658067451, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:34:54,540] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 527 [2023-06-12 07:34:54,540] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 527 [2023-06-12 07:34:54,541] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-12 07:34:54,540] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 527 [2023-06-12 07:34:54,541] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-12 07:34:54,540] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 527 [2023-06-12 07:34:54,540] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 527 [2023-06-12 07:34:54,541] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-12 07:34:54,540] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 527 [2023-06-12 07:34:54,541] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-12 07:34:54,541] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-12 07:34:54,540] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 527 [2023-06-12 07:34:54,541] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-12 07:34:54,541] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 4096.0, reducing to 2048.0 [2023-06-12 07:34:54,540] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 527 [2023-06-12 07:34:54,541] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-12 07:34:54,541] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-12 07:34:55,032] [INFO] [logging.py:96:log_dist] [Rank 0] step=530, skipped=9, lr=[4.756769390003164e-05, 4.756769390003164e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:34:55,041] [INFO] [timer.py:215:stop] epoch=0/micro_step=530/global_step=530, RunningAvgSamplesPerSec=128.3737817075941, CurrSamplesPerSec=128.51980255377825, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:34:57,536] [INFO] [logging.py:96:log_dist] [Rank 0] step=540, skipped=9, lr=[4.7475046333851735e-05, 4.7475046333851735e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:34:57,546] [INFO] [timer.py:215:stop] epoch=0/micro_step=540/global_step=540, RunningAvgSamplesPerSec=128.3665334982638, CurrSamplesPerSec=127.20672912431347, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:35:00,037] [INFO] [logging.py:96:log_dist] [Rank 0] step=550, skipped=9, lr=[4.738076081229433e-05, 4.738076081229433e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:35:00,047] [INFO] [timer.py:215:stop] epoch=0/micro_step=550/global_step=550, RunningAvgSamplesPerSec=128.36303470660093, CurrSamplesPerSec=128.1854489575086, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:35:02,546] [INFO] [logging.py:96:log_dist] [Rank 0] step=560, skipped=9, lr=[4.728484420677918e-05, 4.728484420677918e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:35:02,555] [INFO] [timer.py:215:stop] epoch=0/micro_step=560/global_step=560, RunningAvgSamplesPerSec=128.352690453546, CurrSamplesPerSec=127.53926208042084, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:35:05,052] [INFO] [logging.py:96:log_dist] [Rank 0] step=570, skipped=9, lr=[4.718730350759753e-05, 4.718730350759753e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:35:05,061] [INFO] [timer.py:215:stop] epoch=0/micro_step=570/global_step=570, RunningAvgSamplesPerSec=128.34481997852183, CurrSamplesPerSec=128.06986232876974, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:35:07,561] [INFO] [logging.py:96:log_dist] [Rank 0] step=580, skipped=9, lr=[4.7088145823402683e-05, 4.7088145823402683e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:35:07,570] [INFO] [timer.py:215:stop] epoch=0/micro_step=580/global_step=580, RunningAvgSamplesPerSec=128.33496935018465, CurrSamplesPerSec=126.01858106046309, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:35:10,069] [INFO] [logging.py:96:log_dist] [Rank 0] step=590, skipped=9, lr=[4.698737838069198e-05, 4.698737838069198e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:35:10,080] [INFO] [timer.py:215:stop] epoch=0/micro_step=590/global_step=590, RunningAvgSamplesPerSec=128.3248254812987, CurrSamplesPerSec=127.2321727853909, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:35:12,574] [INFO] [logging.py:96:log_dist] [Rank 0] step=600, skipped=9, lr=[4.6885008523280066e-05, 4.6885008523280066e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:35:12,583] [INFO] [timer.py:215:stop] epoch=0/micro_step=600/global_step=600, RunningAvgSamplesPerSec=128.3202640650528, CurrSamplesPerSec=127.35748947209608, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:35:15,080] [INFO] [logging.py:96:log_dist] [Rank 0] step=610, skipped=9, lr=[4.678104371176373e-05, 4.678104371176373e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:35:15,089] [INFO] [timer.py:215:stop] epoch=0/micro_step=610/global_step=610, RunningAvgSamplesPerSec=128.31383356351878, CurrSamplesPerSec=128.36185994912108, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:35:17,587] [INFO] [logging.py:96:log_dist] [Rank 0] step=620, skipped=9, lr=[4.667549152297817e-05, 4.667549152297817e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:35:17,597] [INFO] [timer.py:215:stop] epoch=0/micro_step=620/global_step=620, RunningAvgSamplesPerSec=128.30616372284862, CurrSamplesPerSec=128.1126478780208, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:35:19,828] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:35:19,829] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:35:19,830] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:35:19,830] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:35:19,830] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-12 07:35:19,830] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-12 07:35:19,830] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:35:19,830] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-12 07:35:19,830] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-12 07:35:19,829] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:35:19,830] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-12 07:35:19,829] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:35:19,830] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:35:19,830] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-12 07:35:19,830] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-12 07:35:19,830] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-12 07:35:20,092] [INFO] [logging.py:96:log_dist] [Rank 0] step=630, skipped=9, lr=[4.65683596494448e-05, 4.65683596494448e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:35:20,101] [INFO] [timer.py:215:stop] epoch=0/micro_step=630/global_step=630, RunningAvgSamplesPerSec=128.30143717198132, CurrSamplesPerSec=127.93533541891858, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:35:22,594] [INFO] [logging.py:96:log_dist] [Rank 0] step=640, skipped=9, lr=[4.645965589881063e-05, 4.645965589881063e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:35:22,603] [INFO] [timer.py:215:stop] epoch=0/micro_step=640/global_step=640, RunningAvgSamplesPerSec=128.2985768333149, CurrSamplesPerSec=127.84126155610461, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:35:25,098] [INFO] [logging.py:96:log_dist] [Rank 0] step=650, skipped=9, lr=[4.634938819327925e-05, 4.634938819327925e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:35:25,108] [INFO] [timer.py:215:stop] epoch=0/micro_step=650/global_step=650, RunningAvgSamplesPerSec=128.29372997520036, CurrSamplesPerSec=127.30627057071584, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:35:27,607] [INFO] [logging.py:96:log_dist] [Rank 0] step=660, skipped=9, lr=[4.6237564569033496e-05, 4.6237564569033496e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:35:27,617] [INFO] [timer.py:215:stop] epoch=0/micro_step=660/global_step=660, RunningAvgSamplesPerSec=128.2856388262904, CurrSamplesPerSec=127.33670702574291, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:35:30,110] [INFO] [logging.py:96:log_dist] [Rank 0] step=670, skipped=9, lr=[4.612419317564973e-05, 4.612419317564973e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:35:30,119] [INFO] [timer.py:215:stop] epoch=0/micro_step=670/global_step=670, RunningAvgSamplesPerSec=128.2829462134768, CurrSamplesPerSec=128.31167013534042, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:35:32,618] [INFO] [logging.py:96:log_dist] [Rank 0] step=680, skipped=9, lr=[4.6009282275503976e-05, 4.6009282275503976e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:35:32,628] [INFO] [timer.py:215:stop] epoch=0/micro_step=680/global_step=680, RunningAvgSamplesPerSec=128.27544356632112, CurrSamplesPerSec=128.14114668313889, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:35:35,125] [INFO] [logging.py:96:log_dist] [Rank 0] step=690, skipped=9, lr=[4.589284024316967e-05, 4.589284024316967e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:35:35,134] [INFO] [timer.py:215:stop] epoch=0/micro_step=690/global_step=690, RunningAvgSamplesPerSec=128.26987496543404, CurrSamplesPerSec=127.9328965252982, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:35:37,631] [INFO] [logging.py:96:log_dist] [Rank 0] step=700, skipped=9, lr=[4.5774875564807464e-05, 4.5774875564807464e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:35:37,641] [INFO] [timer.py:215:stop] epoch=0/micro_step=700/global_step=700, RunningAvgSamplesPerSec=128.26432308919502, CurrSamplesPerSec=127.89961492244609, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:35:40,142] [INFO] [logging.py:96:log_dist] [Rank 0] step=710, skipped=9, lr=[4.5655396837546625e-05, 4.5655396837546625e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:35:40,152] [INFO] [timer.py:215:stop] epoch=0/micro_step=710/global_step=710, RunningAvgSamplesPerSec=128.25584438199786, CurrSamplesPerSec=127.44613263471945, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:35:42,648] [INFO] [logging.py:96:log_dist] [Rank 0] step=720, skipped=9, lr=[4.5534412768858605e-05, 4.5534412768858605e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:35:42,657] [INFO] [timer.py:215:stop] epoch=0/micro_step=720/global_step=720, RunningAvgSamplesPerSec=128.25162341829233, CurrSamplesPerSec=128.2617646946397, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:35:44,893] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:35:44,894] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:35:44,894] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-12 07:35:44,894] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:35:44,894] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-12 07:35:44,894] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:35:44,895] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-12 07:35:44,894] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:35:44,895] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-12 07:35:44,895] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:35:44,895] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-12 07:35:44,895] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-12 07:35:44,895] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:35:44,895] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-12 07:35:44,895] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:35:44,895] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-12 07:35:45,157] [INFO] [logging.py:96:log_dist] [Rank 0] step=730, skipped=9, lr=[4.541193217592236e-05, 4.541193217592236e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:35:45,166] [INFO] [timer.py:215:stop] epoch=0/micro_step=730/global_step=730, RunningAvgSamplesPerSec=128.24468611851677, CurrSamplesPerSec=127.89181514736585, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:35:47,664] [INFO] [logging.py:96:log_dist] [Rank 0] step=740, skipped=9, lr=[4.528796398498182e-05, 4.528796398498182e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:35:47,673] [INFO] [timer.py:215:stop] epoch=0/micro_step=740/global_step=740, RunningAvgSamplesPerSec=128.23981378876826, CurrSamplesPerSec=128.59516174310616, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:35:50,168] [INFO] [logging.py:96:log_dist] [Rank 0] step=750, skipped=9, lr=[4.516251723069534e-05, 4.516251723069534e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:35:50,178] [INFO] [timer.py:215:stop] epoch=0/micro_step=750/global_step=750, RunningAvgSamplesPerSec=128.23634332075477, CurrSamplesPerSec=127.36002732849707, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:35:52,677] [INFO] [logging.py:96:log_dist] [Rank 0] step=760, skipped=9, lr=[4.5035601055477245e-05, 4.5035601055477245e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:35:52,687] [INFO] [timer.py:215:stop] epoch=0/micro_step=760/global_step=760, RunningAvgSamplesPerSec=128.230046450437, CurrSamplesPerSec=127.5808448690663, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:35:55,183] [INFO] [logging.py:96:log_dist] [Rank 0] step=770, skipped=9, lr=[4.4907224708831575e-05, 4.4907224708831575e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:35:55,193] [INFO] [timer.py:215:stop] epoch=0/micro_step=770/global_step=770, RunningAvgSamplesPerSec=128.22597894609675, CurrSamplesPerSec=127.77760874180431, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:35:57,694] [INFO] [logging.py:96:log_dist] [Rank 0] step=780, skipped=9, lr=[4.477739754667796e-05, 4.477739754667796e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:35:57,703] [INFO] [timer.py:215:stop] epoch=0/micro_step=780/global_step=780, RunningAvgSamplesPerSec=128.21925090506426, CurrSamplesPerSec=128.1185178397222, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:36:00,198] [INFO] [logging.py:96:log_dist] [Rank 0] step=790, skipped=9, lr=[4.4646129030669795e-05, 4.4646129030669795e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:36:00,208] [INFO] [timer.py:215:stop] epoch=0/micro_step=790/global_step=790, RunningAvgSamplesPerSec=128.21628880718973, CurrSamplesPerSec=128.48252504695407, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:36:02,709] [INFO] [logging.py:96:log_dist] [Rank 0] step=800, skipped=9, lr=[4.451342872750468e-05, 4.451342872750468e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:36:02,718] [INFO] [timer.py:215:stop] epoch=0/micro_step=800/global_step=800, RunningAvgSamplesPerSec=128.20979376777623, CurrSamplesPerSec=128.1073898274024, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:36:05,217] [INFO] [logging.py:96:log_dist] [Rank 0] step=810, skipped=9, lr=[4.43793063082272e-05, 4.43793063082272e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:36:05,227] [INFO] [timer.py:215:stop] epoch=0/micro_step=810/global_step=810, RunningAvgSamplesPerSec=128.20442298428011, CurrSamplesPerSec=127.25400697623925, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:36:07,727] [INFO] [logging.py:96:log_dist] [Rank 0] step=820, skipped=9, lr=[4.42437715475241e-05, 4.42437715475241e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:36:07,736] [INFO] [timer.py:215:stop] epoch=0/micro_step=820/global_step=820, RunningAvgSamplesPerSec=128.19871914803585, CurrSamplesPerSec=127.99011306842942, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:36:09,975] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:36:09,975] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:36:09,975] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:36:09,976] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:36:09,976] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:36:09,976] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:36:09,976] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:36:09,975] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:36:09,976] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:36:09,976] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:36:09,976] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:36:09,976] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:36:09,976] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:36:09,976] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:36:09,976] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:36:09,976] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:36:10,238] [INFO] [logging.py:96:log_dist] [Rank 0] step=830, skipped=9, lr=[4.410683432301198e-05, 4.410683432301198e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:36:10,247] [INFO] [timer.py:215:stop] epoch=0/micro_step=830/global_step=830, RunningAvgSamplesPerSec=128.19199218936544, CurrSamplesPerSec=127.7917212946962, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:36:12,748] [INFO] [logging.py:96:log_dist] [Rank 0] step=840, skipped=9, lr=[4.3968504614517336e-05, 4.3968504614517336e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:36:12,757] [INFO] [timer.py:215:stop] epoch=0/micro_step=840/global_step=840, RunningAvgSamplesPerSec=128.18673223397778, CurrSamplesPerSec=128.4114747573707, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:36:15,259] [INFO] [logging.py:96:log_dist] [Rank 0] step=850, skipped=9, lr=[4.38287925033493e-05, 4.38287925033493e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:36:15,268] [INFO] [timer.py:215:stop] epoch=0/micro_step=850/global_step=850, RunningAvgSamplesPerSec=128.18034314426606, CurrSamplesPerSec=127.34202281986425, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:36:17,770] [INFO] [logging.py:96:log_dist] [Rank 0] step=860, skipped=9, lr=[4.3687708171564925e-05, 4.3687708171564925e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:36:17,780] [INFO] [timer.py:215:stop] epoch=0/micro_step=860/global_step=860, RunningAvgSamplesPerSec=128.17389331807945, CurrSamplesPerSec=127.53271798847986, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:36:20,282] [INFO] [logging.py:96:log_dist] [Rank 0] step=870, skipped=9, lr=[4.354526190122709e-05, 4.354526190122709e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:36:20,292] [INFO] [timer.py:215:stop] epoch=0/micro_step=870/global_step=870, RunningAvgSamplesPerSec=128.1674302975097, CurrSamplesPerSec=127.39411691564719, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:36:22,793] [INFO] [logging.py:96:log_dist] [Rank 0] step=880, skipped=9, lr=[4.340146407365521e-05, 4.340146407365521e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:36:22,802] [INFO] [timer.py:215:stop] epoch=0/micro_step=880/global_step=880, RunningAvgSamplesPerSec=128.16183615758715, CurrSamplesPerSec=127.50182440320974, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:36:25,301] [INFO] [logging.py:96:log_dist] [Rank 0] step=890, skipped=9, lr=[4.3256325168668596e-05, 4.3256325168668596e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:36:25,310] [INFO] [timer.py:215:stop] epoch=0/micro_step=890/global_step=890, RunningAvgSamplesPerSec=128.15784808777184, CurrSamplesPerSec=128.4201981543231, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:36:27,811] [INFO] [logging.py:96:log_dist] [Rank 0] step=900, skipped=9, lr=[4.310985576382276e-05, 4.310985576382276e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:36:27,820] [INFO] [timer.py:215:stop] epoch=0/micro_step=900/global_step=900, RunningAvgSamplesPerSec=128.1531796068461, CurrSamplesPerSec=127.60862223709631, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:36:30,321] [INFO] [logging.py:96:log_dist] [Rank 0] step=910, skipped=9, lr=[4.296206653363848e-05, 4.296206653363848e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:36:30,330] [INFO] [timer.py:215:stop] epoch=0/micro_step=910/global_step=910, RunningAvgSamplesPerSec=128.14845002355648, CurrSamplesPerSec=128.18789748996934, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:36:32,834] [INFO] [logging.py:96:log_dist] [Rank 0] step=920, skipped=9, lr=[4.2812968248823894e-05, 4.2812968248823894e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:36:32,843] [INFO] [timer.py:215:stop] epoch=0/micro_step=920/global_step=920, RunningAvgSamplesPerSec=128.14203984896605, CurrSamplesPerSec=127.86233084533274, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:36:35,080] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:36:35,081] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:36:35,081] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:36:35,082] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:36:35,082] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:36:35,081] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:36:35,082] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:36:35,081] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:36:35,082] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:36:35,081] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:36:35,081] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:36:35,082] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:36:35,082] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:36:35,082] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:36:35,082] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:36:35,082] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:36:35,345] [INFO] [logging.py:96:log_dist] [Rank 0] step=930, skipped=9, lr=[4.2662571775489523e-05, 4.2662571775489523e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:36:35,354] [INFO] [timer.py:215:stop] epoch=0/micro_step=930/global_step=930, RunningAvgSamplesPerSec=128.1367082004806, CurrSamplesPerSec=127.28381489953304, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:36:37,855] [INFO] [logging.py:96:log_dist] [Rank 0] step=940, skipped=9, lr=[4.251088807435636e-05, 4.251088807435636e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:36:37,864] [INFO] [timer.py:215:stop] epoch=0/micro_step=940/global_step=940, RunningAvgSamplesPerSec=128.13219029713017, CurrSamplesPerSec=127.73346631593773, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:36:39,586] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 946 [2023-06-12 07:36:39,586] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 946 [2023-06-12 07:36:39,586] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 946 [2023-06-12 07:36:39,587] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:36:39,586] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 946 [2023-06-12 07:36:39,587] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:36:39,587] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:36:39,587] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-06-12 07:36:39,586] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 946 [2023-06-12 07:36:39,587] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:36:39,586] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 946 [2023-06-12 07:36:39,588] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:36:39,587] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 946 [2023-06-12 07:36:39,588] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:36:39,587] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 946 [2023-06-12 07:36:39,588] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:36:39,588] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:36:40,332] [INFO] [logging.py:96:log_dist] [Rank 0] step=950, skipped=10, lr=[4.2373281298214366e-05, 4.2373281298214366e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:36:40,341] [INFO] [timer.py:215:stop] epoch=0/micro_step=950/global_step=950, RunningAvgSamplesPerSec=128.14558216615114, CurrSamplesPerSec=128.01001054846284, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:36:42,843] [INFO] [logging.py:96:log_dist] [Rank 0] step=960, skipped=10, lr=[4.221918239638724e-05, 4.221918239638724e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:36:42,852] [INFO] [timer.py:215:stop] epoch=0/micro_step=960/global_step=960, RunningAvgSamplesPerSec=128.14033730529505, CurrSamplesPerSec=127.39351233142996, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:36:45,354] [INFO] [logging.py:96:log_dist] [Rank 0] step=970, skipped=10, lr=[4.206382858046636e-05, 4.206382858046636e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:36:45,363] [INFO] [timer.py:215:stop] epoch=0/micro_step=970/global_step=970, RunningAvgSamplesPerSec=128.13558678828747, CurrSamplesPerSec=127.98693980926552, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:36:47,867] [INFO] [logging.py:96:log_dist] [Rank 0] step=980, skipped=10, lr=[4.190723117245809e-05, 4.190723117245809e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:36:47,877] [INFO] [timer.py:215:stop] epoch=0/micro_step=980/global_step=980, RunningAvgSamplesPerSec=128.12917327451635, CurrSamplesPerSec=128.0449376459634, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:36:50,380] [INFO] [logging.py:96:log_dist] [Rank 0] step=990, skipped=10, lr=[4.174940158500041e-05, 4.174940158500041e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:36:50,389] [INFO] [timer.py:215:stop] epoch=0/micro_step=990/global_step=990, RunningAvgSamplesPerSec=128.1238394426681, CurrSamplesPerSec=128.03260107468313, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:36:52,891] [INFO] [logging.py:96:log_dist] [Rank 0] step=1000, skipped=10, lr=[4.1590351320531064e-05, 4.1590351320531064e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:36:52,900] [INFO] [timer.py:215:stop] epoch=0/micro_step=1000/global_step=1000, RunningAvgSamplesPerSec=128.11886230495972, CurrSamplesPerSec=127.76313444770209, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:36:54,873] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1007 [2023-06-12 07:36:54,873] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1007 [2023-06-12 07:36:54,874] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:36:54,873] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1007 [2023-06-12 07:36:54,874] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:36:54,873] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1007 [2023-06-12 07:36:54,873] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1007 [2023-06-12 07:36:54,874] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:36:54,874] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-06-12 07:36:54,874] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:36:54,873] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1007 [2023-06-12 07:36:54,874] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:36:54,873] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1007 [2023-06-12 07:36:54,874] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:36:54,874] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1007 [2023-06-12 07:36:54,874] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:36:54,874] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:36:55,369] [INFO] [logging.py:96:log_dist] [Rank 0] step=1010, skipped=11, lr=[4.144617198213059e-05, 4.144617198213059e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:36:55,378] [INFO] [timer.py:215:stop] epoch=0/micro_step=1010/global_step=1010, RunningAvgSamplesPerSec=128.13123745018407, CurrSamplesPerSec=126.9736285442911, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:36:57,878] [INFO] [logging.py:96:log_dist] [Rank 0] step=1020, skipped=11, lr=[4.128483443849015e-05, 4.128483443849015e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:36:57,887] [INFO] [timer.py:215:stop] epoch=0/micro_step=1020/global_step=1020, RunningAvgSamplesPerSec=128.1274498289982, CurrSamplesPerSec=128.18140908344253, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:37:00,388] [INFO] [logging.py:96:log_dist] [Rank 0] step=1030, skipped=11, lr=[4.1122310074954256e-05, 4.1122310074954256e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:37:00,397] [INFO] [timer.py:215:stop] epoch=0/micro_step=1030/global_step=1030, RunningAvgSamplesPerSec=128.12353718825966, CurrSamplesPerSec=127.2963697822021, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:37:02,895] [INFO] [logging.py:96:log_dist] [Rank 0] step=1040, skipped=11, lr=[4.095861073611052e-05, 4.095861073611052e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:37:02,904] [INFO] [timer.py:215:stop] epoch=0/micro_step=1040/global_step=1040, RunningAvgSamplesPerSec=128.12089818801556, CurrSamplesPerSec=128.31154746995784, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:37:05,402] [INFO] [logging.py:96:log_dist] [Rank 0] step=1050, skipped=11, lr=[4.079374835217739e-05, 4.079374835217739e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:37:05,412] [INFO] [timer.py:215:stop] epoch=0/micro_step=1050/global_step=1050, RunningAvgSamplesPerSec=128.11831630758965, CurrSamplesPerSec=127.54629166801291, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:37:07,910] [INFO] [logging.py:96:log_dist] [Rank 0] step=1060, skipped=11, lr=[4.062773493813468e-05, 4.062773493813468e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:37:07,919] [INFO] [timer.py:215:stop] epoch=0/micro_step=1060/global_step=1060, RunningAvgSamplesPerSec=128.1156902874788, CurrSamplesPerSec=127.75936437164879, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:37:10,421] [INFO] [logging.py:96:log_dist] [Rank 0] step=1070, skipped=11, lr=[4.046058259284796e-05, 4.046058259284796e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:37:10,430] [INFO] [timer.py:215:stop] epoch=0/micro_step=1070/global_step=1070, RunningAvgSamplesPerSec=128.11119192130886, CurrSamplesPerSec=127.44952117785026, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:37:12,929] [INFO] [logging.py:96:log_dist] [Rank 0] step=1080, skipped=11, lr=[4.0292303498186814e-05, 4.0292303498186814e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:37:12,938] [INFO] [timer.py:215:stop] epoch=0/micro_step=1080/global_step=1080, RunningAvgSamplesPerSec=128.1086252955196, CurrSamplesPerSec=127.83334714360684, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:37:15,432] [INFO] [logging.py:96:log_dist] [Rank 0] step=1090, skipped=11, lr=[4.012290991813698e-05, 4.012290991813698e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:37:15,442] [INFO] [timer.py:215:stop] epoch=0/micro_step=1090/global_step=1090, RunningAvgSamplesPerSec=128.10793591000052, CurrSamplesPerSec=128.9187794099143, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:37:17,942] [INFO] [logging.py:96:log_dist] [Rank 0] step=1100, skipped=11, lr=[3.995241419790661e-05, 3.995241419790661e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:37:17,951] [INFO] [timer.py:215:stop] epoch=0/micro_step=1100/global_step=1100, RunningAvgSamplesPerSec=128.10443613580182, CurrSamplesPerSec=128.32050265928714, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:37:20,186] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:37:20,186] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:37:20,187] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:37:20,186] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:37:20,187] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:37:20,186] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:37:20,187] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:37:20,187] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:37:20,187] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:37:20,187] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:37:20,187] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:37:20,187] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:37:20,188] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:37:20,187] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:37:20,188] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:37:20,188] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:37:20,450] [INFO] [logging.py:96:log_dist] [Rank 0] step=1110, skipped=11, lr=[3.978082876302658e-05, 3.978082876302658e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:37:20,459] [INFO] [timer.py:215:stop] epoch=0/micro_step=1110/global_step=1110, RunningAvgSamplesPerSec=128.10202837132144, CurrSamplesPerSec=127.6826734095523, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:37:22,961] [INFO] [logging.py:96:log_dist] [Rank 0] step=1120, skipped=11, lr=[3.9608166118444864e-05, 3.9608166118444864e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:37:22,970] [INFO] [timer.py:215:stop] epoch=0/micro_step=1120/global_step=1120, RunningAvgSamplesPerSec=128.0980853364955, CurrSamplesPerSec=126.97422914969722, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:37:25,467] [INFO] [logging.py:96:log_dist] [Rank 0] step=1130, skipped=11, lr=[3.94344388476153e-05, 3.94344388476153e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:37:25,476] [INFO] [timer.py:215:stop] epoch=0/micro_step=1130/global_step=1130, RunningAvgSamplesPerSec=128.09640421019526, CurrSamplesPerSec=127.90278385283402, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:37:27,975] [INFO] [logging.py:96:log_dist] [Rank 0] step=1140, skipped=11, lr=[3.925965961158039e-05, 3.925965961158039e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:37:27,984] [INFO] [timer.py:215:stop] epoch=0/micro_step=1140/global_step=1140, RunningAvgSamplesPerSec=128.09360773309078, CurrSamplesPerSec=127.65522803707029, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:37:30,486] [INFO] [logging.py:96:log_dist] [Rank 0] step=1150, skipped=11, lr=[3.908384114804867e-05, 3.908384114804867e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:37:30,495] [INFO] [timer.py:215:stop] epoch=0/micro_step=1150/global_step=1150, RunningAvgSamplesPerSec=128.08979377174293, CurrSamplesPerSec=127.96900164945701, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:37:32,993] [INFO] [logging.py:96:log_dist] [Rank 0] step=1160, skipped=11, lr=[3.890699627046639e-05, 3.890699627046639e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:37:33,002] [INFO] [timer.py:215:stop] epoch=0/micro_step=1160/global_step=1160, RunningAvgSamplesPerSec=128.08783954523238, CurrSamplesPerSec=127.55077646495336, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:37:35,502] [INFO] [logging.py:96:log_dist] [Rank 0] step=1170, skipped=11, lr=[3.872913786708364e-05, 3.872913786708364e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:37:35,511] [INFO] [timer.py:215:stop] epoch=0/micro_step=1170/global_step=1170, RunningAvgSamplesPerSec=128.08525628738477, CurrSamplesPerSec=127.49370974789596, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:37:36,230] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1172 [2023-06-12 07:37:36,230] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1172 [2023-06-12 07:37:36,231] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:37:36,230] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1172 [2023-06-12 07:37:36,231] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:37:36,231] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-06-12 07:37:36,231] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:37:36,230] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1172 [2023-06-12 07:37:36,230] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1172 [2023-06-12 07:37:36,231] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:37:36,230] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1172 [2023-06-12 07:37:36,230] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1172 [2023-06-12 07:37:36,231] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:37:36,231] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:37:36,230] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1172 [2023-06-12 07:37:36,231] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:37:36,231] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:37:37,979] [INFO] [logging.py:96:log_dist] [Rank 0] step=1180, skipped=12, lr=[3.856820945115655e-05, 3.856820945115655e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:37:37,988] [INFO] [timer.py:215:stop] epoch=0/micro_step=1180/global_step=1180, RunningAvgSamplesPerSec=128.0961629767614, CurrSamplesPerSec=128.1045775588898, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:37:40,489] [INFO] [logging.py:96:log_dist] [Rank 0] step=1190, skipped=12, lr=[3.83884611196668e-05, 3.83884611196668e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:37:40,499] [INFO] [timer.py:215:stop] epoch=0/micro_step=1190/global_step=1190, RunningAvgSamplesPerSec=128.09278505785804, CurrSamplesPerSec=128.2882453021353, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:37:43,006] [INFO] [logging.py:96:log_dist] [Rank 0] step=1200, skipped=12, lr=[3.8207737052618545e-05, 3.8207737052618545e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:37:43,015] [INFO] [timer.py:215:stop] epoch=0/micro_step=1200/global_step=1200, RunningAvgSamplesPerSec=128.08689649145865, CurrSamplesPerSec=127.12708756402873, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:37:45,515] [INFO] [logging.py:96:log_dist] [Rank 0] step=1210, skipped=12, lr=[3.80260504209727e-05, 3.80260504209727e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:37:45,525] [INFO] [timer.py:215:stop] epoch=0/micro_step=1210/global_step=1210, RunningAvgSamplesPerSec=128.0837623724224, CurrSamplesPerSec=127.99108948752156, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:37:48,026] [INFO] [logging.py:96:log_dist] [Rank 0] step=1220, skipped=12, lr=[3.784341446584082e-05, 3.784341446584082e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:37:48,035] [INFO] [timer.py:215:stop] epoch=0/micro_step=1220/global_step=1220, RunningAvgSamplesPerSec=128.080630595365, CurrSamplesPerSec=127.5270227513504, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:37:50,539] [INFO] [logging.py:96:log_dist] [Rank 0] step=1230, skipped=12, lr=[3.765984249752004e-05, 3.765984249752004e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:37:50,548] [INFO] [timer.py:215:stop] epoch=0/micro_step=1230/global_step=1230, RunningAvgSamplesPerSec=128.07615990540486, CurrSamplesPerSec=128.02185431663767, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:37:53,051] [INFO] [logging.py:96:log_dist] [Rank 0] step=1240, skipped=12, lr=[3.747534789452304e-05, 3.747534789452304e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:37:53,060] [INFO] [timer.py:215:stop] epoch=0/micro_step=1240/global_step=1240, RunningAvgSamplesPerSec=128.072433295388, CurrSamplesPerSec=127.87183256528272, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:37:55,561] [INFO] [logging.py:96:log_dist] [Rank 0] step=1250, skipped=12, lr=[3.728994410260308e-05, 3.728994410260308e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:37:55,570] [INFO] [timer.py:215:stop] epoch=0/micro_step=1250/global_step=1250, RunningAvgSamplesPerSec=128.06959324640667, CurrSamplesPerSec=127.9232638044996, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:37:58,072] [INFO] [logging.py:96:log_dist] [Rank 0] step=1260, skipped=12, lr=[3.7103644633774014e-05, 3.7103644633774014e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:37:58,081] [INFO] [timer.py:215:stop] epoch=0/micro_step=1260/global_step=1260, RunningAvgSamplesPerSec=128.06627112015954, CurrSamplesPerSec=126.90459680019515, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:38:00,583] [INFO] [logging.py:96:log_dist] [Rank 0] step=1270, skipped=12, lr=[3.691646306532564e-05, 3.691646306532564e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:38:00,592] [INFO] [timer.py:215:stop] epoch=0/micro_step=1270/global_step=1270, RunningAvgSamplesPerSec=128.0632752381575, CurrSamplesPerSec=128.22573920686327, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:38:01,575] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:38:01,575] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:38:01,576] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:38:01,576] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:38:01,576] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:38:01,575] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:38:01,576] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:38:01,575] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:38:01,575] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:38:01,576] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:38:01,576] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:38:01,576] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:38:01,576] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:38:01,576] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:38:01,576] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:38:01,576] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:38:03,095] [INFO] [logging.py:96:log_dist] [Rank 0] step=1280, skipped=12, lr=[3.672841303883413e-05, 3.672841303883413e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:38:03,104] [INFO] [timer.py:215:stop] epoch=0/micro_step=1280/global_step=1280, RunningAvgSamplesPerSec=128.05958409059627, CurrSamplesPerSec=127.58339163498098, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:38:05,610] [INFO] [logging.py:96:log_dist] [Rank 0] step=1290, skipped=12, lr=[3.6539508259167863e-05, 3.6539508259167863e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:38:05,619] [INFO] [timer.py:215:stop] epoch=0/micro_step=1290/global_step=1290, RunningAvgSamplesPerSec=128.05502779010172, CurrSamplesPerSec=127.81776470352958, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:38:08,116] [INFO] [logging.py:96:log_dist] [Rank 0] step=1300, skipped=12, lr=[3.634976249348867e-05, 3.634976249348867e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:38:08,125] [INFO] [timer.py:215:stop] epoch=0/micro_step=1300/global_step=1300, RunningAvgSamplesPerSec=128.0539067673024, CurrSamplesPerSec=127.46622448756703, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:38:10,627] [INFO] [logging.py:96:log_dist] [Rank 0] step=1310, skipped=12, lr=[3.615918957024845e-05, 3.615918957024845e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:38:10,636] [INFO] [timer.py:215:stop] epoch=0/micro_step=1310/global_step=1310, RunningAvgSamplesPerSec=128.05069035183686, CurrSamplesPerSec=127.95728964674343, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:38:13,135] [INFO] [logging.py:96:log_dist] [Rank 0] step=1320, skipped=12, lr=[3.5967803378181386e-05, 3.5967803378181386e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:38:13,144] [INFO] [timer.py:215:stop] epoch=0/micro_step=1320/global_step=1320, RunningAvgSamplesPerSec=128.0491720961679, CurrSamplesPerSec=128.09406321399192, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:38:15,646] [INFO] [logging.py:96:log_dist] [Rank 0] step=1330, skipped=12, lr=[3.577561786529177e-05, 3.577561786529177e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:38:15,656] [INFO] [timer.py:215:stop] epoch=0/micro_step=1330/global_step=1330, RunningAvgSamplesPerSec=128.04584756774636, CurrSamplesPerSec=127.55586769774963, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:38:15,872] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1330 [2023-06-12 07:38:15,872] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1330 [2023-06-12 07:38:15,872] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1330 [2023-06-12 07:38:15,872] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:38:15,872] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:38:15,872] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1330 [2023-06-12 07:38:15,872] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:38:15,872] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-06-12 07:38:15,872] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1330 [2023-06-12 07:38:15,872] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:38:15,872] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1330 [2023-06-12 07:38:15,872] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1330 [2023-06-12 07:38:15,872] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:38:15,872] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1330 [2023-06-12 07:38:15,872] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:38:15,872] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:38:15,872] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:38:18,124] [INFO] [logging.py:96:log_dist] [Rank 0] step=1340, skipped=13, lr=[3.560197905937272e-05, 3.560197905937272e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:38:18,133] [INFO] [timer.py:215:stop] epoch=0/micro_step=1340/global_step=1340, RunningAvgSamplesPerSec=128.055812100271, CurrSamplesPerSec=126.93124015869037, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:38:20,635] [INFO] [logging.py:96:log_dist] [Rank 0] step=1350, skipped=13, lr=[3.5408313471484715e-05, 3.5408313471484715e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:38:20,645] [INFO] [timer.py:215:stop] epoch=0/micro_step=1350/global_step=1350, RunningAvgSamplesPerSec=128.0528605171604, CurrSamplesPerSec=127.37646541882057, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:38:23,147] [INFO] [logging.py:96:log_dist] [Rank 0] step=1360, skipped=13, lr=[3.521388933775134e-05, 3.521388933775134e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:38:23,156] [INFO] [timer.py:215:stop] epoch=0/micro_step=1360/global_step=1360, RunningAvgSamplesPerSec=128.0496495522943, CurrSamplesPerSec=127.58242142645302, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:38:25,659] [INFO] [logging.py:96:log_dist] [Rank 0] step=1370, skipped=13, lr=[3.5018720827578524e-05, 3.5018720827578524e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:38:25,668] [INFO] [timer.py:215:stop] epoch=0/micro_step=1370/global_step=1370, RunningAvgSamplesPerSec=128.04659186369958, CurrSamplesPerSec=127.69020473399802, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:38:28,166] [INFO] [logging.py:96:log_dist] [Rank 0] step=1380, skipped=13, lr=[3.4822822164621546e-05, 3.4822822164621546e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:38:28,176] [INFO] [timer.py:215:stop] epoch=0/micro_step=1380/global_step=1380, RunningAvgSamplesPerSec=128.0449536024872, CurrSamplesPerSec=127.81411311864288, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:38:30,681] [INFO] [logging.py:96:log_dist] [Rank 0] step=1390, skipped=13, lr=[3.462620762574832e-05, 3.462620762574832e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:38:30,690] [INFO] [timer.py:215:stop] epoch=0/micro_step=1390/global_step=1390, RunningAvgSamplesPerSec=128.040906447566, CurrSamplesPerSec=127.93948175149085, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:38:33,193] [INFO] [logging.py:96:log_dist] [Rank 0] step=1400, skipped=13, lr=[3.442889153999901e-05, 3.442889153999901e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:38:33,203] [INFO] [timer.py:215:stop] epoch=0/micro_step=1400/global_step=1400, RunningAvgSamplesPerSec=128.03758458924017, CurrSamplesPerSec=127.8893779127437, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:38:35,702] [INFO] [logging.py:96:log_dist] [Rank 0] step=1410, skipped=13, lr=[3.423088828754168e-05, 3.423088828754168e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:38:35,712] [INFO] [timer.py:215:stop] epoch=0/micro_step=1410/global_step=1410, RunningAvgSamplesPerSec=128.03561672556455, CurrSamplesPerSec=128.03284534015322, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:38:38,211] [INFO] [logging.py:96:log_dist] [Rank 0] step=1420, skipped=13, lr=[3.4032212298624314e-05, 3.4032212298624314e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:38:38,220] [INFO] [timer.py:215:stop] epoch=0/micro_step=1420/global_step=1420, RunningAvgSamplesPerSec=128.03393007614144, CurrSamplesPerSec=128.22047186925914, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:38:40,720] [INFO] [logging.py:96:log_dist] [Rank 0] step=1430, skipped=13, lr=[3.383287805252317e-05, 3.383287805252317e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:38:40,730] [INFO] [timer.py:215:stop] epoch=0/micro_step=1430/global_step=1430, RunningAvgSamplesPerSec=128.03171151492984, CurrSamplesPerSec=127.30433858860711, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:38:41,210] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:38:41,211] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:38:41,211] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:38:41,211] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:38:41,211] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:38:41,211] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:38:41,211] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:38:41,211] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:38:41,211] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:38:41,211] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:38:41,211] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:38:41,212] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:38:41,211] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:38:41,212] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:38:41,212] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:38:41,212] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:38:43,233] [INFO] [logging.py:96:log_dist] [Rank 0] step=1440, skipped=13, lr=[3.36329000764875e-05, 3.36329000764875e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:38:43,242] [INFO] [timer.py:215:stop] epoch=0/micro_step=1440/global_step=1440, RunningAvgSamplesPerSec=128.02857872718292, CurrSamplesPerSec=126.91299665171712, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:38:45,743] [INFO] [logging.py:96:log_dist] [Rank 0] step=1450, skipped=13, lr=[3.343229294468086e-05, 3.343229294468086e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:38:45,752] [INFO] [timer.py:215:stop] epoch=0/micro_step=1450/global_step=1450, RunningAvgSamplesPerSec=128.02635980080208, CurrSamplesPerSec=127.73091355510404, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:38:48,251] [INFO] [logging.py:96:log_dist] [Rank 0] step=1460, skipped=13, lr=[3.323107127711897e-05, 3.323107127711897e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:38:48,260] [INFO] [timer.py:215:stop] epoch=0/micro_step=1460/global_step=1460, RunningAvgSamplesPerSec=128.02481849074564, CurrSamplesPerSec=127.96900164945701, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:38:50,761] [INFO] [logging.py:96:log_dist] [Rank 0] step=1470, skipped=13, lr=[3.302924973860416e-05, 3.302924973860416e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:38:50,770] [INFO] [timer.py:215:stop] epoch=0/micro_step=1470/global_step=1470, RunningAvgSamplesPerSec=128.02267882749456, CurrSamplesPerSec=128.42548191468393, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:38:53,274] [INFO] [logging.py:96:log_dist] [Rank 0] step=1480, skipped=13, lr=[3.282684303765669e-05, 3.282684303765669e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:38:53,283] [INFO] [timer.py:215:stop] epoch=0/micro_step=1480/global_step=1480, RunningAvgSamplesPerSec=128.01956050282269, CurrSamplesPerSec=127.57272015268684, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:38:55,782] [INFO] [logging.py:96:log_dist] [Rank 0] step=1490, skipped=13, lr=[3.2623865925442816e-05, 3.2623865925442816e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:38:55,791] [INFO] [timer.py:215:stop] epoch=0/micro_step=1490/global_step=1490, RunningAvgSamplesPerSec=128.01798402487617, CurrSamplesPerSec=127.41540193680005, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:38:58,290] [INFO] [logging.py:96:log_dist] [Rank 0] step=1500, skipped=13, lr=[3.242033319469963e-05, 3.242033319469963e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:38:58,299] [INFO] [timer.py:215:stop] epoch=0/micro_step=1500/global_step=1500, RunningAvgSamplesPerSec=128.01676885547505, CurrSamplesPerSec=128.48719892781926, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:39:00,797] [INFO] [logging.py:96:log_dist] [Rank 0] step=1510, skipped=13, lr=[3.221625967865712e-05, 3.221625967865712e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:39:00,806] [INFO] [timer.py:215:stop] epoch=0/micro_step=1510/global_step=1510, RunningAvgSamplesPerSec=128.01553598030188, CurrSamplesPerSec=128.3371896337901, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:39:03,302] [INFO] [logging.py:96:log_dist] [Rank 0] step=1520, skipped=13, lr=[3.201166024995706e-05, 3.201166024995706e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:39:03,311] [INFO] [timer.py:215:stop] epoch=0/micro_step=1520/global_step=1520, RunningAvgSamplesPerSec=128.0153895447443, CurrSamplesPerSec=128.11338159381776, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:39:05,808] [INFO] [logging.py:96:log_dist] [Rank 0] step=1530, skipped=13, lr=[3.180654981956912e-05, 3.180654981956912e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:39:05,817] [INFO] [timer.py:215:stop] epoch=0/micro_step=1530/global_step=1530, RunningAvgSamplesPerSec=128.01462318533956, CurrSamplesPerSec=128.11533821036394, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:39:06,297] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:39:06,297] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:39:06,297] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:39:06,298] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:39:06,298] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:39:06,297] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:39:06,298] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:39:06,297] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:39:06,298] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:39:06,298] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:39:06,298] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:39:06,298] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:39:06,298] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:39:06,298] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:39:06,298] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:39:06,298] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:39:07,294] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1535 [2023-06-12 07:39:07,294] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1535 [2023-06-12 07:39:07,295] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1535 [2023-06-12 07:39:07,295] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:39:07,295] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:39:07,295] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1535 [2023-06-12 07:39:07,296] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:39:07,295] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1535 [2023-06-12 07:39:07,296] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:39:07,295] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1535 [2023-06-12 07:39:07,295] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1535 [2023-06-12 07:39:07,296] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:39:07,296] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:39:07,295] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1535 [2023-06-12 07:39:07,296] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:39:07,296] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-06-12 07:39:07,296] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:39:08,290] [INFO] [logging.py:96:log_dist] [Rank 0] step=1540, skipped=14, lr=[3.1621525879721206e-05, 3.1621525879721206e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:39:08,299] [INFO] [timer.py:215:stop] epoch=0/micro_step=1540/global_step=1540, RunningAvgSamplesPerSec=128.02190124005665, CurrSamplesPerSec=127.72131122581038, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:39:10,801] [INFO] [logging.py:96:log_dist] [Rank 0] step=1550, skipped=14, lr=[3.1415485758349346e-05, 3.1415485758349346e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:39:10,810] [INFO] [timer.py:215:stop] epoch=0/micro_step=1550/global_step=1550, RunningAvgSamplesPerSec=128.01959591555064, CurrSamplesPerSec=129.09758854088014, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:39:13,036] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1558 [2023-06-12 07:39:13,036] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1558 [2023-06-12 07:39:13,036] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1558 [2023-06-12 07:39:13,038] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:39:13,038] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:39:13,036] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1558 [2023-06-12 07:39:13,038] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:39:13,038] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-06-12 07:39:13,037] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1558 [2023-06-12 07:39:13,038] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:39:13,037] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1558 [2023-06-12 07:39:13,038] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:39:13,037] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1558 [2023-06-12 07:39:13,038] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:39:13,037] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1558 [2023-06-12 07:39:13,038] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:39:13,038] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:39:13,281] [INFO] [logging.py:96:log_dist] [Rank 0] step=1560, skipped=15, lr=[3.122964946248119e-05, 3.122964946248119e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:39:13,290] [INFO] [timer.py:215:stop] epoch=0/micro_step=1560/global_step=1560, RunningAvgSamplesPerSec=128.02739920968037, CurrSamplesPerSec=127.17466772663143, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:39:15,789] [INFO] [logging.py:96:log_dist] [Rank 0] step=1570, skipped=15, lr=[3.102273385690231e-05, 3.102273385690231e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:39:15,798] [INFO] [timer.py:215:stop] epoch=0/micro_step=1570/global_step=1570, RunningAvgSamplesPerSec=128.02605166485563, CurrSamplesPerSec=127.60947151746792, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:39:18,299] [INFO] [logging.py:96:log_dist] [Rank 0] step=1580, skipped=15, lr=[3.08153793214471e-05, 3.08153793214471e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:39:18,309] [INFO] [timer.py:215:stop] epoch=0/micro_step=1580/global_step=1580, RunningAvgSamplesPerSec=128.0239027046948, CurrSamplesPerSec=127.93606710513774, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:39:20,808] [INFO] [logging.py:96:log_dist] [Rank 0] step=1590, skipped=15, lr=[3.0607600967874206e-05, 3.0607600967874206e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:39:20,818] [INFO] [timer.py:215:stop] epoch=0/micro_step=1590/global_step=1590, RunningAvgSamplesPerSec=128.02218666514744, CurrSamplesPerSec=127.55550402478545, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:39:23,319] [INFO] [logging.py:96:log_dist] [Rank 0] step=1600, skipped=15, lr=[3.039941393882969e-05, 3.039941393882969e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:39:23,328] [INFO] [timer.py:215:stop] epoch=0/micro_step=1600/global_step=1600, RunningAvgSamplesPerSec=128.02011053823924, CurrSamplesPerSec=127.61626616742525, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:39:25,830] [INFO] [logging.py:96:log_dist] [Rank 0] step=1610, skipped=15, lr=[3.0190833406743398e-05, 3.0190833406743398e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:39:25,839] [INFO] [timer.py:215:stop] epoch=0/micro_step=1610/global_step=1610, RunningAvgSamplesPerSec=128.0178990883948, CurrSamplesPerSec=127.2088992765594, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:39:28,340] [INFO] [logging.py:96:log_dist] [Rank 0] step=1620, skipped=15, lr=[2.9981874572723222e-05, 2.9981874572723222e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:39:28,349] [INFO] [timer.py:215:stop] epoch=0/micro_step=1620/global_step=1620, RunningAvgSamplesPerSec=128.0158363194406, CurrSamplesPerSec=127.82080785032275, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:39:30,847] [INFO] [logging.py:96:log_dist] [Rank 0] step=1630, skipped=15, lr=[2.9772552665447263e-05, 2.9772552665447263e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:39:30,856] [INFO] [timer.py:215:stop] epoch=0/micro_step=1630/global_step=1630, RunningAvgSamplesPerSec=128.01484766155724, CurrSamplesPerSec=128.17541095229774, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:39:33,357] [INFO] [logging.py:96:log_dist] [Rank 0] step=1640, skipped=15, lr=[2.9562882940053975e-05, 2.9562882940053975e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:39:33,366] [INFO] [timer.py:215:stop] epoch=0/micro_step=1640/global_step=1640, RunningAvgSamplesPerSec=128.01300963543733, CurrSamplesPerSec=127.5270227513504, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:39:35,866] [INFO] [logging.py:96:log_dist] [Rank 0] step=1650, skipped=15, lr=[2.9352880677030386e-05, 2.9352880677030386e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:39:35,876] [INFO] [timer.py:215:stop] epoch=0/micro_step=1650/global_step=1650, RunningAvgSamplesPerSec=128.01130887282926, CurrSamplesPerSec=127.86671607906874, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:39:38,365] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:39:38,366] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:39:38,366] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:39:38,366] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:39:38,366] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:39:38,366] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:39:38,366] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:39:38,367] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:39:38,367] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:39:38,366] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:39:38,367] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:39:38,366] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:39:38,367] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:39:38,366] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:39:38,367] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:39:38,367] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:39:38,378] [INFO] [logging.py:96:log_dist] [Rank 0] step=1660, skipped=15, lr=[2.9142561181098505e-05, 2.9142561181098505e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:39:38,387] [INFO] [timer.py:215:stop] epoch=0/micro_step=1660/global_step=1660, RunningAvgSamplesPerSec=128.0090507031761, CurrSamplesPerSec=127.28478057181441, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:39:40,895] [INFO] [logging.py:96:log_dist] [Rank 0] step=1670, skipped=15, lr=[2.89319397800999e-05, 2.89319397800999e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:39:40,904] [INFO] [timer.py:215:stop] epoch=0/micro_step=1670/global_step=1670, RunningAvgSamplesPerSec=128.0049951247616, CurrSamplesPerSec=127.60947151746792, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:39:43,407] [INFO] [logging.py:96:log_dist] [Rank 0] step=1680, skipped=15, lr=[2.872103182387862e-05, 2.872103182387862e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:39:43,416] [INFO] [timer.py:215:stop] epoch=0/micro_step=1680/global_step=1680, RunningAvgSamplesPerSec=128.00260288589115, CurrSamplesPerSec=128.11790636028067, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:39:45,920] [INFO] [logging.py:96:log_dist] [Rank 0] step=1690, skipped=15, lr=[2.8509852683162536e-05, 2.8509852683162536e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:39:45,929] [INFO] [timer.py:215:stop] epoch=0/micro_step=1690/global_step=1690, RunningAvgSamplesPerSec=128.0000851166193, CurrSamplesPerSec=127.57635795745101, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:39:48,431] [INFO] [logging.py:96:log_dist] [Rank 0] step=1700, skipped=15, lr=[2.8298417748443116e-05, 2.8298417748443116e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:39:48,441] [INFO] [timer.py:215:stop] epoch=0/micro_step=1700/global_step=1700, RunningAvgSamplesPerSec=127.99784957131214, CurrSamplesPerSec=127.41552289474134, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:39:50,943] [INFO] [logging.py:96:log_dist] [Rank 0] step=1710, skipped=15, lr=[2.8086742428853836e-05, 2.8086742428853836e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:39:50,952] [INFO] [timer.py:215:stop] epoch=0/micro_step=1710/global_step=1710, RunningAvgSamplesPerSec=127.99564664044512, CurrSamplesPerSec=127.53053677373256, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:39:53,455] [INFO] [logging.py:96:log_dist] [Rank 0] step=1720, skipped=15, lr=[2.7874842151047114e-05, 2.7874842151047114e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:39:53,464] [INFO] [timer.py:215:stop] epoch=0/micro_step=1720/global_step=1720, RunningAvgSamplesPerSec=127.99340847152818, CurrSamplesPerSec=126.9660614029694, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:39:55,970] [INFO] [logging.py:96:log_dist] [Rank 0] step=1730, skipped=15, lr=[2.766273235807006e-05, 2.766273235807006e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:39:55,979] [INFO] [timer.py:215:stop] epoch=0/micro_step=1730/global_step=1730, RunningAvgSamplesPerSec=127.99030342095347, CurrSamplesPerSec=127.37235549844696, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:39:58,482] [INFO] [logging.py:96:log_dist] [Rank 0] step=1740, skipped=15, lr=[2.7450428508239024e-05, 2.7450428508239024e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:39:58,491] [INFO] [timer.py:215:stop] epoch=0/micro_step=1740/global_step=1740, RunningAvgSamplesPerSec=127.9881257251868, CurrSamplesPerSec=128.11668341890817, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:40:00,994] [INFO] [logging.py:96:log_dist] [Rank 0] step=1750, skipped=15, lr=[2.723794607401297e-05, 2.723794607401297e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:40:01,003] [INFO] [timer.py:215:stop] epoch=0/micro_step=1750/global_step=1750, RunningAvgSamplesPerSec=127.9859074587052, CurrSamplesPerSec=127.83188612845312, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:40:03,493] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:40:03,493] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:40:03,493] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:40:03,494] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:40:03,494] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:40:03,493] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:40:03,494] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:40:03,493] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:40:03,494] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:40:03,494] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:40:03,494] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:40:03,494] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:40:03,494] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:40:03,494] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:40:03,494] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:40:03,494] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:40:03,505] [INFO] [logging.py:96:log_dist] [Rank 0] step=1760, skipped=15, lr=[2.7025300540865923e-05, 2.7025300540865923e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:40:03,514] [INFO] [timer.py:215:stop] epoch=0/micro_step=1760/global_step=1760, RunningAvgSamplesPerSec=127.98394740646945, CurrSamplesPerSec=127.18286233562934, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:40:03,738] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1760 [2023-06-12 07:40:03,738] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:40:03,738] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1760 [2023-06-12 07:40:03,738] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1760 [2023-06-12 07:40:03,739] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:40:03,739] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:40:03,739] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1760 [2023-06-12 07:40:03,739] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:40:03,739] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1760 [2023-06-12 07:40:03,739] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1760 [2023-06-12 07:40:03,739] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:40:03,739] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1760 [2023-06-12 07:40:03,739] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:40:03,739] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1760 [2023-06-12 07:40:03,739] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-06-12 07:40:03,739] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:40:03,739] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:40:05,990] [INFO] [logging.py:96:log_dist] [Rank 0] step=1770, skipped=16, lr=[2.6833792919797152e-05, 2.6833792919797152e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:40:05,999] [INFO] [timer.py:215:stop] epoch=0/micro_step=1770/global_step=1770, RunningAvgSamplesPerSec=127.98976375922193, CurrSamplesPerSec=127.46101936353881, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:40:08,497] [INFO] [logging.py:96:log_dist] [Rank 0] step=1780, skipped=16, lr=[2.6620880202842324e-05, 2.6620880202842324e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:40:08,506] [INFO] [timer.py:215:stop] epoch=0/micro_step=1780/global_step=1780, RunningAvgSamplesPerSec=127.9889938155736, CurrSamplesPerSec=127.24061605901971, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:40:11,012] [INFO] [logging.py:96:log_dist] [Rank 0] step=1790, skipped=16, lr=[2.6407849358013358e-05, 2.6407849358013358e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:40:11,021] [INFO] [timer.py:215:stop] epoch=0/micro_step=1790/global_step=1790, RunningAvgSamplesPerSec=127.98586912236219, CurrSamplesPerSec=127.06715113692091, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:40:13,523] [INFO] [logging.py:96:log_dist] [Rank 0] step=1800, skipped=16, lr=[2.6194715910751803e-05, 2.6194715910751803e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:40:13,532] [INFO] [timer.py:215:stop] epoch=0/micro_step=1800/global_step=1800, RunningAvgSamplesPerSec=127.98419696887774, CurrSamplesPerSec=128.00781297686626, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:40:16,034] [INFO] [logging.py:96:log_dist] [Rank 0] step=1810, skipped=16, lr=[2.598149539397672e-05, 2.598149539397672e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:40:16,044] [INFO] [timer.py:215:stop] epoch=0/micro_step=1810/global_step=1810, RunningAvgSamplesPerSec=127.98218899027498, CurrSamplesPerSec=127.46259296085364, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:40:18,544] [INFO] [logging.py:96:log_dist] [Rank 0] step=1820, skipped=16, lr=[2.576820334695273e-05, 2.576820334695273e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:40:18,554] [INFO] [timer.py:215:stop] epoch=0/micro_step=1820/global_step=1820, RunningAvgSamplesPerSec=127.98065290013902, CurrSamplesPerSec=127.3195545352786, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:40:21,058] [INFO] [logging.py:96:log_dist] [Rank 0] step=1830, skipped=16, lr=[2.5554855314157468e-05, 2.5554855314157468e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:40:21,068] [INFO] [timer.py:215:stop] epoch=0/micro_step=1830/global_step=1830, RunningAvgSamplesPerSec=127.97801155570318, CurrSamplesPerSec=126.86369341983878, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:40:23,570] [INFO] [logging.py:96:log_dist] [Rank 0] step=1840, skipped=16, lr=[2.5341466844148775e-05, 2.5341466844148775e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:40:23,579] [INFO] [timer.py:215:stop] epoch=0/micro_step=1840/global_step=1840, RunningAvgSamplesPerSec=127.97610936228, CurrSamplesPerSec=127.60292021877873, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:40:26,079] [INFO] [logging.py:96:log_dist] [Rank 0] step=1850, skipped=16, lr=[2.5128053488431502e-05, 2.5128053488431502e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:40:26,088] [INFO] [timer.py:215:stop] epoch=0/micro_step=1850/global_step=1850, RunningAvgSamplesPerSec=127.9749128631184, CurrSamplesPerSec=128.07988380814604, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:40:28,591] [INFO] [logging.py:96:log_dist] [Rank 0] step=1860, skipped=16, lr=[2.4914630800324114e-05, 2.4914630800324114e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:40:28,600] [INFO] [timer.py:215:stop] epoch=0/micro_step=1860/global_step=1860, RunningAvgSamplesPerSec=127.97290021634043, CurrSamplesPerSec=126.6620815135535, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:40:29,083] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:40:29,084] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:40:29,084] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:40:29,084] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:40:29,084] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:40:29,084] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:40:29,084] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:40:29,084] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:40:29,085] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:40:29,085] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:40:29,085] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:40:29,085] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:40:29,085] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:40:29,085] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:40:29,085] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:40:29,085] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:40:29,580] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1863 [2023-06-12 07:40:29,580] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1863 [2023-06-12 07:40:29,580] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1863 [2023-06-12 07:40:29,581] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:40:29,580] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1863 [2023-06-12 07:40:29,581] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:40:29,581] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:40:29,580] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1863 [2023-06-12 07:40:29,581] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:40:29,580] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1863 [2023-06-12 07:40:29,581] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:40:29,581] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-06-12 07:40:29,580] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1863 [2023-06-12 07:40:29,581] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:40:29,580] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1863 [2023-06-12 07:40:29,581] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:40:29,581] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:40:31,078] [INFO] [logging.py:96:log_dist] [Rank 0] step=1870, skipped=17, lr=[2.472255525721801e-05, 2.472255525721801e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:40:31,087] [INFO] [timer.py:215:stop] epoch=0/micro_step=1870/global_step=1870, RunningAvgSamplesPerSec=127.97781932937943, CurrSamplesPerSec=127.72374205876419, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:40:33,586] [INFO] [logging.py:96:log_dist] [Rank 0] step=1880, skipped=17, lr=[2.4509157688484295e-05, 2.4509157688484295e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:40:33,596] [INFO] [timer.py:215:stop] epoch=0/micro_step=1880/global_step=1880, RunningAvgSamplesPerSec=127.97675106399706, CurrSamplesPerSec=127.58933448421077, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:40:36,100] [INFO] [logging.py:96:log_dist] [Rank 0] step=1890, skipped=17, lr=[2.4295795891770427e-05, 2.4295795891770427e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:40:36,109] [INFO] [timer.py:215:stop] epoch=0/micro_step=1890/global_step=1890, RunningAvgSamplesPerSec=127.9742864775827, CurrSamplesPerSec=127.54701891095695, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:40:38,614] [INFO] [logging.py:96:log_dist] [Rank 0] step=1900, skipped=17, lr=[2.408248541663735e-05, 2.408248541663735e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:40:38,623] [INFO] [timer.py:215:stop] epoch=0/micro_step=1900/global_step=1900, RunningAvgSamplesPerSec=127.97192220641512, CurrSamplesPerSec=127.32897575376956, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:40:41,123] [INFO] [logging.py:96:log_dist] [Rank 0] step=1910, skipped=17, lr=[2.386924180890572e-05, 2.386924180890572e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:40:41,132] [INFO] [timer.py:215:stop] epoch=0/micro_step=1910/global_step=1910, RunningAvgSamplesPerSec=127.97079360206841, CurrSamplesPerSec=127.90717186185182, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:40:43,633] [INFO] [logging.py:96:log_dist] [Rank 0] step=1920, skipped=17, lr=[2.3656080609522975e-05, 2.3656080609522975e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:40:43,642] [INFO] [timer.py:215:stop] epoch=0/micro_step=1920/global_step=1920, RunningAvgSamplesPerSec=127.9695497479756, CurrSamplesPerSec=128.14273711867773, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:40:46,144] [INFO] [logging.py:96:log_dist] [Rank 0] step=1930, skipped=17, lr=[2.3443017353430764e-05, 2.3443017353430764e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:40:46,153] [INFO] [timer.py:215:stop] epoch=0/micro_step=1930/global_step=1930, RunningAvgSamplesPerSec=127.96782046795583, CurrSamplesPerSec=127.90558726830895, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:40:48,654] [INFO] [logging.py:96:log_dist] [Rank 0] step=1940, skipped=17, lr=[2.3230067568432687e-05, 2.3230067568432687e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:40:48,663] [INFO] [timer.py:215:stop] epoch=0/micro_step=1940/global_step=1940, RunningAvgSamplesPerSec=127.96649297722442, CurrSamplesPerSec=127.75535130936699, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:40:51,161] [INFO] [logging.py:96:log_dist] [Rank 0] step=1950, skipped=17, lr=[2.301724677406277e-05, 2.301724677406277e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:40:51,170] [INFO] [timer.py:215:stop] epoch=0/micro_step=1950/global_step=1950, RunningAvgSamplesPerSec=127.96594524180867, CurrSamplesPerSec=127.3443184136246, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:40:53,675] [INFO] [logging.py:96:log_dist] [Rank 0] step=1960, skipped=17, lr=[2.280457048045429e-05, 2.280457048045429e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:40:53,684] [INFO] [timer.py:215:stop] epoch=0/micro_step=1960/global_step=1960, RunningAvgSamplesPerSec=127.96365830567592, CurrSamplesPerSec=127.54144359364696, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:40:54,922] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:40:54,923] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:40:54,923] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:40:54,923] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:40:54,923] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:40:54,923] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:40:54,923] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:40:54,923] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:40:54,923] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:40:54,923] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:40:54,923] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:40:54,923] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:40:54,923] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:40:54,923] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:40:54,923] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:40:54,924] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:40:55,168] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1965 [2023-06-12 07:40:55,169] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:40:55,169] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1965 [2023-06-12 07:40:55,169] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:40:55,169] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1965 [2023-06-12 07:40:55,169] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1965 [2023-06-12 07:40:55,169] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:40:55,169] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1965 [2023-06-12 07:40:55,169] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:40:55,169] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1965 [2023-06-12 07:40:55,169] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:40:55,169] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:40:55,169] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1965 [2023-06-12 07:40:55,169] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:40:55,169] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1965 [2023-06-12 07:40:55,169] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:40:55,169] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-06-12 07:40:56,163] [INFO] [logging.py:96:log_dist] [Rank 0] step=1970, skipped=18, lr=[2.261329817501475e-05, 2.261329817501475e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:40:56,173] [INFO] [timer.py:215:stop] epoch=0/micro_step=1970/global_step=1970, RunningAvgSamplesPerSec=127.96797590807468, CurrSamplesPerSec=127.84077448703759, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:40:58,675] [INFO] [logging.py:96:log_dist] [Rank 0] step=1980, skipped=18, lr=[2.240093912464302e-05, 2.240093912464302e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:40:58,684] [INFO] [timer.py:215:stop] epoch=0/micro_step=1980/global_step=1980, RunningAvgSamplesPerSec=127.96626928620911, CurrSamplesPerSec=127.60680238865115, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:41:01,187] [INFO] [logging.py:96:log_dist] [Rank 0] step=1990, skipped=18, lr=[2.218876949082127e-05, 2.218876949082127e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:41:01,197] [INFO] [timer.py:215:stop] epoch=0/micro_step=1990/global_step=1990, RunningAvgSamplesPerSec=127.96435508196535, CurrSamplesPerSec=127.37380602829374, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:41:03,699] [INFO] [logging.py:96:log_dist] [Rank 0] step=2000, skipped=18, lr=[2.197680473622697e-05, 2.197680473622697e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:41:03,708] [INFO] [timer.py:215:stop] epoch=0/micro_step=2000/global_step=2000, RunningAvgSamplesPerSec=127.96279629489702, CurrSamplesPerSec=126.87340708412147, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:41:06,211] [INFO] [logging.py:96:log_dist] [Rank 0] step=2010, skipped=18, lr=[2.1765060308606246e-05, 2.1765060308606246e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:41:06,221] [INFO] [timer.py:215:stop] epoch=0/micro_step=2010/global_step=2010, RunningAvgSamplesPerSec=127.96080050720353, CurrSamplesPerSec=127.31279144328215, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:41:08,721] [INFO] [logging.py:96:log_dist] [Rank 0] step=2020, skipped=18, lr=[2.1553551639648015e-05, 2.1553551639648015e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:41:08,730] [INFO] [timer.py:215:stop] epoch=0/micro_step=2020/global_step=2020, RunningAvgSamplesPerSec=127.9598035855475, CurrSamplesPerSec=127.40935433253246, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:41:11,234] [INFO] [logging.py:96:log_dist] [Rank 0] step=2030, skipped=18, lr=[2.1342294143859416e-05, 2.1342294143859416e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:41:11,243] [INFO] [timer.py:215:stop] epoch=0/micro_step=2030/global_step=2030, RunningAvgSamplesPerSec=127.95767612643242, CurrSamplesPerSec=127.7106166599426, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:41:13,745] [INFO] [logging.py:96:log_dist] [Rank 0] step=2040, skipped=18, lr=[2.1131303217442347e-05, 2.1131303217442347e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:41:13,755] [INFO] [timer.py:215:stop] epoch=0/micro_step=2040/global_step=2040, RunningAvgSamplesPerSec=127.95616404302373, CurrSamplesPerSec=127.32184931903191, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:41:16,258] [INFO] [logging.py:96:log_dist] [Rank 0] step=2050, skipped=18, lr=[2.092059423717145e-05, 2.092059423717145e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:41:16,267] [INFO] [timer.py:215:stop] epoch=0/micro_step=2050/global_step=2050, RunningAvgSamplesPerSec=127.9542541140684, CurrSamplesPerSec=127.61663018694983, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:41:18,770] [INFO] [logging.py:96:log_dist] [Rank 0] step=2060, skipped=18, lr=[2.0710182559273457e-05, 2.0710182559273457e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:41:18,779] [INFO] [timer.py:215:stop] epoch=0/micro_step=2060/global_step=2060, RunningAvgSamplesPerSec=127.95262418743026, CurrSamplesPerSec=127.2794695555859, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:41:20,517] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:41:20,518] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:41:20,518] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:41:20,518] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:41:20,518] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:41:20,518] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:41:20,518] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:41:20,518] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:41:20,518] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:41:20,518] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:41:20,518] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:41:20,518] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:41:20,518] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:41:20,519] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:41:20,518] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:41:20,519] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:41:20,764] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2067 [2023-06-12 07:41:20,764] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:41:20,764] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2067 [2023-06-12 07:41:20,764] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2067 [2023-06-12 07:41:20,764] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2067 [2023-06-12 07:41:20,764] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:41:20,764] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:41:20,764] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2067 [2023-06-12 07:41:20,764] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:41:20,764] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2067 [2023-06-12 07:41:20,764] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:41:20,764] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:41:20,764] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2067 [2023-06-12 07:41:20,764] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-06-12 07:41:20,764] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2067 [2023-06-12 07:41:20,764] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:41:20,764] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:41:21,257] [INFO] [logging.py:96:log_dist] [Rank 0] step=2070, skipped=19, lr=[2.052107891717339e-05, 2.052107891717339e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:41:21,266] [INFO] [timer.py:215:stop] epoch=0/micro_step=2070/global_step=2070, RunningAvgSamplesPerSec=127.95708082669086, CurrSamplesPerSec=127.8349299477108, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:41:23,766] [INFO] [logging.py:96:log_dist] [Rank 0] step=2080, skipped=19, lr=[2.0311274341687408e-05, 2.0311274341687408e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:41:23,776] [INFO] [timer.py:215:stop] epoch=0/micro_step=2080/global_step=2080, RunningAvgSamplesPerSec=127.95606859567125, CurrSamplesPerSec=126.9831187740841, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:41:26,275] [INFO] [logging.py:96:log_dist] [Rank 0] step=2090, skipped=19, lr=[2.0101811475103458e-05, 2.0101811475103458e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:41:26,284] [INFO] [timer.py:215:stop] epoch=0/micro_step=2090/global_step=2090, RunningAvgSamplesPerSec=127.95531244114856, CurrSamplesPerSec=127.5664151158591, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:41:28,788] [INFO] [logging.py:96:log_dist] [Rank 0] step=2100, skipped=19, lr=[1.9892705582832933e-05, 1.9892705582832933e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:41:28,797] [INFO] [timer.py:215:stop] epoch=0/micro_step=2100/global_step=2100, RunningAvgSamplesPerSec=127.95344641526509, CurrSamplesPerSec=127.59709740685074, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:41:31,301] [INFO] [logging.py:96:log_dist] [Rank 0] step=2110, skipped=19, lr=[1.9683971904271375e-05, 1.9683971904271375e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:41:31,310] [INFO] [timer.py:215:stop] epoch=0/micro_step=2110/global_step=2110, RunningAvgSamplesPerSec=127.95152130347918, CurrSamplesPerSec=127.17623426141402, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:41:33,816] [INFO] [logging.py:96:log_dist] [Rank 0] step=2120, skipped=19, lr=[1.947562565168781e-05, 1.947562565168781e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:41:33,825] [INFO] [timer.py:215:stop] epoch=0/micro_step=2120/global_step=2120, RunningAvgSamplesPerSec=127.9490116425704, CurrSamplesPerSec=127.3644990358775, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:41:36,328] [INFO] [logging.py:96:log_dist] [Rank 0] step=2130, skipped=19, lr=[1.9267682009116103e-05, 1.9267682009116103e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:41:36,337] [INFO] [timer.py:215:stop] epoch=0/micro_step=2130/global_step=2130, RunningAvgSamplesPerSec=127.9474658835021, CurrSamplesPerSec=127.60777296802907, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:41:38,839] [INFO] [logging.py:96:log_dist] [Rank 0] step=2140, skipped=19, lr=[1.906015613124839e-05, 1.906015613124839e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:41:38,848] [INFO] [timer.py:215:stop] epoch=0/micro_step=2140/global_step=2140, RunningAvgSamplesPerSec=127.94610718955609, CurrSamplesPerSec=127.55974700507035, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:41:41,350] [INFO] [logging.py:96:log_dist] [Rank 0] step=2150, skipped=19, lr=[1.8853063142330564e-05, 1.8853063142330564e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:41:41,359] [INFO] [timer.py:215:stop] epoch=0/micro_step=2150/global_step=2150, RunningAvgSamplesPerSec=127.94478553377729, CurrSamplesPerSec=128.05654352154778, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:41:43,864] [INFO] [logging.py:96:log_dist] [Rank 0] step=2160, skipped=19, lr=[1.8646418135060102e-05, 1.8646418135060102e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:41:43,873] [INFO] [timer.py:215:stop] epoch=0/micro_step=2160/global_step=2160, RunningAvgSamplesPerSec=127.94280242462052, CurrSamplesPerSec=127.37054238259638, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:41:46,112] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:41:46,113] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:41:46,113] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:41:46,113] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:41:46,113] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:41:46,113] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:41:46,113] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:41:46,113] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:41:46,113] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:41:46,113] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:41:46,113] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:41:46,113] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:41:46,113] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:41:46,113] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:41:46,114] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:41:46,114] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:41:46,375] [INFO] [logging.py:96:log_dist] [Rank 0] step=2170, skipped=19, lr=[1.844023616948608e-05, 1.844023616948608e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:41:46,384] [INFO] [timer.py:215:stop] epoch=0/micro_step=2170/global_step=2170, RunningAvgSamplesPerSec=127.94144284801101, CurrSamplesPerSec=127.95838755592673, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:41:46,610] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2170 [2023-06-12 07:41:46,610] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2170 [2023-06-12 07:41:46,610] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2170 [2023-06-12 07:41:46,611] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:41:46,611] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:41:46,610] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2170 [2023-06-12 07:41:46,611] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:41:46,610] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2170 [2023-06-12 07:41:46,610] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2170 [2023-06-12 07:41:46,611] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:41:46,611] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:41:46,611] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2170 [2023-06-12 07:41:46,611] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:41:46,611] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2170 [2023-06-12 07:41:46,611] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:41:46,611] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:41:46,611] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-06-12 07:41:48,862] [INFO] [logging.py:96:log_dist] [Rank 0] step=2180, skipped=20, lr=[1.825508072107439e-05, 1.825508072107439e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:41:48,871] [INFO] [timer.py:215:stop] epoch=0/micro_step=2180/global_step=2180, RunningAvgSamplesPerSec=127.94576083373323, CurrSamplesPerSec=128.050190571141, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:41:51,375] [INFO] [logging.py:96:log_dist] [Rank 0] step=2190, skipped=20, lr=[1.8049819903415228e-05, 1.8049819903415228e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:41:51,385] [INFO] [timer.py:215:stop] epoch=0/micro_step=2190/global_step=2190, RunningAvgSamplesPerSec=127.94387372821984, CurrSamplesPerSec=128.15790941139176, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:41:53,883] [INFO] [logging.py:96:log_dist] [Rank 0] step=2200, skipped=20, lr=[1.7845065606841472e-05, 1.7845065606841472e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:41:53,893] [INFO] [timer.py:215:stop] epoch=0/micro_step=2200/global_step=2200, RunningAvgSamplesPerSec=127.9432796111559, CurrSamplesPerSec=127.48438522554252, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:41:56,395] [INFO] [logging.py:96:log_dist] [Rank 0] step=2210, skipped=20, lr=[1.76408327536094e-05, 1.76408327536094e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:41:56,405] [INFO] [timer.py:215:stop] epoch=0/micro_step=2210/global_step=2210, RunningAvgSamplesPerSec=127.94174684998022, CurrSamplesPerSec=127.31411985107543, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:41:58,906] [INFO] [logging.py:96:log_dist] [Rank 0] step=2220, skipped=20, lr=[1.743713622797311e-05, 1.743713622797311e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:41:58,915] [INFO] [timer.py:215:stop] epoch=0/micro_step=2220/global_step=2220, RunningAvgSamplesPerSec=127.94062801695311, CurrSamplesPerSec=127.3644990358775, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:42:01,418] [INFO] [logging.py:96:log_dist] [Rank 0] step=2230, skipped=20, lr=[1.723399087509974e-05, 1.723399087509974e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:42:01,427] [INFO] [timer.py:215:stop] epoch=0/micro_step=2230/global_step=2230, RunningAvgSamplesPerSec=127.93906558444975, CurrSamplesPerSec=127.38347707162998, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:42:03,932] [INFO] [logging.py:96:log_dist] [Rank 0] step=2240, skipped=20, lr=[1.7031411499987605e-05, 1.7031411499987605e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:42:03,941] [INFO] [timer.py:215:stop] epoch=0/micro_step=2240/global_step=2240, RunningAvgSamplesPerSec=127.9370898887745, CurrSamplesPerSec=127.03552076264248, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:42:06,440] [INFO] [logging.py:96:log_dist] [Rank 0] step=2250, skipped=20, lr=[1.6829412866387228e-05, 1.6829412866387228e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:42:06,450] [INFO] [timer.py:215:stop] epoch=0/micro_step=2250/global_step=2250, RunningAvgSamplesPerSec=127.93651448591038, CurrSamplesPerSec=128.44514623778161, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:42:08,950] [INFO] [logging.py:96:log_dist] [Rank 0] step=2260, skipped=20, lr=[1.6628009695725346e-05, 1.6628009695725346e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:42:08,959] [INFO] [timer.py:215:stop] epoch=0/micro_step=2260/global_step=2260, RunningAvgSamplesPerSec=127.93573474530254, CurrSamplesPerSec=127.5608381027689, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:42:10,680] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2266 [2023-06-12 07:42:10,680] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2266 [2023-06-12 07:42:10,680] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2266 [2023-06-12 07:42:10,682] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:42:10,682] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:42:10,680] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2266 [2023-06-12 07:42:10,682] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:42:10,682] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-06-12 07:42:10,681] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2266 [2023-06-12 07:42:10,682] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:42:10,681] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2266 [2023-06-12 07:42:10,682] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:42:10,681] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2266 [2023-06-12 07:42:10,681] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2266 [2023-06-12 07:42:10,682] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:42:10,682] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:42:10,682] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:42:11,426] [INFO] [logging.py:96:log_dist] [Rank 0] step=2270, skipped=21, lr=[1.6447268095247876e-05, 1.6447268095247876e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:42:11,436] [INFO] [timer.py:215:stop] epoch=0/micro_step=2270/global_step=2270, RunningAvgSamplesPerSec=127.94218766985102, CurrSamplesPerSec=127.98193614335135, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:42:13,939] [INFO] [logging.py:96:log_dist] [Rank 0] step=2280, skipped=21, lr=[1.6247036705412644e-05, 1.6247036705412644e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:42:13,948] [INFO] [timer.py:215:stop] epoch=0/micro_step=2280/global_step=2280, RunningAvgSamplesPerSec=127.94069686865086, CurrSamplesPerSec=127.60279890477639, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:42:16,447] [INFO] [logging.py:96:log_dist] [Rank 0] step=2290, skipped=21, lr=[1.604744322141682e-05, 1.604744322141682e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:42:16,456] [INFO] [timer.py:215:stop] epoch=0/micro_step=2290/global_step=2290, RunningAvgSamplesPerSec=127.93998311179779, CurrSamplesPerSec=127.75875631570331, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:42:17,680] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2294 [2023-06-12 07:42:17,680] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2294 [2023-06-12 07:42:17,681] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-12 07:42:17,680] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2294 [2023-06-12 07:42:17,681] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-12 07:42:17,681] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0 [2023-06-12 07:42:17,680] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2294 [2023-06-12 07:42:17,681] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-12 07:42:17,680] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2294 [2023-06-12 07:42:17,681] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-12 07:42:17,680] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2294 [2023-06-12 07:42:17,681] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-12 07:42:17,680] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2294 [2023-06-12 07:42:17,681] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-12 07:42:17,680] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2294 [2023-06-12 07:42:17,681] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-12 07:42:17,681] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-12 07:42:18,927] [INFO] [logging.py:96:log_dist] [Rank 0] step=2300, skipped=22, lr=[1.5868366518677517e-05, 1.5868366518677517e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:42:18,937] [INFO] [timer.py:215:stop] epoch=0/micro_step=2300/global_step=2300, RunningAvgSamplesPerSec=127.94571677458067, CurrSamplesPerSec=128.19842724554016, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:42:21,437] [INFO] [logging.py:96:log_dist] [Rank 0] step=2310, skipped=22, lr=[1.567002509112022e-05, 1.567002509112022e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:42:21,446] [INFO] [timer.py:215:stop] epoch=0/micro_step=2310/global_step=2310, RunningAvgSamplesPerSec=127.94487483763727, CurrSamplesPerSec=127.97498047733511, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:42:23,951] [INFO] [logging.py:96:log_dist] [Rank 0] step=2320, skipped=22, lr=[1.5472363621341286e-05, 1.5472363621341286e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:42:23,960] [INFO] [timer.py:215:stop] epoch=0/micro_step=2320/global_step=2320, RunningAvgSamplesPerSec=127.94287116711554, CurrSamplesPerSec=127.68522424252802, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:42:26,465] [INFO] [logging.py:96:log_dist] [Rank 0] step=2330, skipped=22, lr=[1.5275396514679986e-05, 1.5275396514679986e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:42:26,474] [INFO] [timer.py:215:stop] epoch=0/micro_step=2330/global_step=2330, RunningAvgSamplesPerSec=127.94092809650165, CurrSamplesPerSec=127.10409321503418, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:42:28,976] [INFO] [logging.py:96:log_dist] [Rank 0] step=2340, skipped=22, lr=[1.5079138125871195e-05, 1.5079138125871195e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:42:28,986] [INFO] [timer.py:215:stop] epoch=0/micro_step=2340/global_step=2340, RunningAvgSamplesPerSec=127.93967047412413, CurrSamplesPerSec=127.43076543303648, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:42:31,490] [INFO] [logging.py:96:log_dist] [Rank 0] step=2350, skipped=22, lr=[1.4883602757999259e-05, 1.4883602757999259e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:42:31,499] [INFO] [timer.py:215:stop] epoch=0/micro_step=2350/global_step=2350, RunningAvgSamplesPerSec=127.9379248716125, CurrSamplesPerSec=126.83528049858015, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:42:34,001] [INFO] [logging.py:96:log_dist] [Rank 0] step=2360, skipped=22, lr=[1.468880466145559e-05, 1.468880466145559e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:42:34,011] [INFO] [timer.py:215:stop] epoch=0/micro_step=2360/global_step=2360, RunningAvgSamplesPerSec=127.93658681016575, CurrSamplesPerSec=127.8181298734939, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:42:36,509] [INFO] [logging.py:96:log_dist] [Rank 0] step=2370, skipped=22, lr=[1.4494758032900119e-05, 1.4494758032900119e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:42:36,518] [INFO] [timer.py:215:stop] epoch=0/micro_step=2370/global_step=2370, RunningAvgSamplesPerSec=127.93609815872871, CurrSamplesPerSec=128.24791576887466, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:42:39,019] [INFO] [logging.py:96:log_dist] [Rank 0] step=2380, skipped=22, lr=[1.4301477014226664e-05, 1.4301477014226664e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:42:39,028] [INFO] [timer.py:215:stop] epoch=0/micro_step=2380/global_step=2380, RunningAvgSamplesPerSec=127.9351521397032, CurrSamplesPerSec=128.23504997831185, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:42:41,530] [INFO] [logging.py:96:log_dist] [Rank 0] step=2390, skipped=22, lr=[1.4108975691532272e-05, 1.4108975691532272e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:42:41,539] [INFO] [timer.py:215:stop] epoch=0/micro_step=2390/global_step=2390, RunningAvgSamplesPerSec=127.934059022353, CurrSamplesPerSec=127.43875107530047, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:42:43,023] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:42:43,024] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:42:43,024] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-12 07:42:43,024] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:42:43,024] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:42:43,025] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-12 07:42:43,025] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-12 07:42:43,024] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:42:43,025] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-12 07:42:43,024] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:42:43,024] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:42:43,025] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-12 07:42:43,024] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:42:43,025] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-12 07:42:43,025] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-12 07:42:43,025] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-12 07:42:44,042] [INFO] [logging.py:96:log_dist] [Rank 0] step=2400, skipped=22, lr=[1.3917268094090663e-05, 1.3917268094090663e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:42:44,052] [INFO] [timer.py:215:stop] epoch=0/micro_step=2400/global_step=2400, RunningAvgSamplesPerSec=127.93255866755182, CurrSamplesPerSec=126.79142731096923, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:42:46,550] [INFO] [logging.py:96:log_dist] [Rank 0] step=2410, skipped=22, lr=[1.3726368193329758e-05, 1.3726368193329758e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:42:46,559] [INFO] [timer.py:215:stop] epoch=0/micro_step=2410/global_step=2410, RunningAvgSamplesPerSec=127.93211119703683, CurrSamplesPerSec=127.76897242382472, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:42:49,063] [INFO] [logging.py:96:log_dist] [Rank 0] step=2420, skipped=22, lr=[1.3536289901813486e-05, 1.3536289901813486e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:42:49,072] [INFO] [timer.py:215:stop] epoch=0/micro_step=2420/global_step=2420, RunningAvgSamplesPerSec=127.93055843177406, CurrSamplesPerSec=127.18659846372817, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:42:51,577] [INFO] [logging.py:96:log_dist] [Rank 0] step=2430, skipped=22, lr=[1.334704707222787e-05, 1.334704707222787e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:42:51,586] [INFO] [timer.py:215:stop] epoch=0/micro_step=2430/global_step=2430, RunningAvgSamplesPerSec=127.92883428631895, CurrSamplesPerSec=127.76349930558001, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:42:54,084] [INFO] [logging.py:96:log_dist] [Rank 0] step=2440, skipped=22, lr=[1.3158653496371395e-05, 1.3158653496371395e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:42:54,093] [INFO] [timer.py:215:stop] epoch=0/micro_step=2440/global_step=2440, RunningAvgSamplesPerSec=127.92847424204895, CurrSamplesPerSec=127.88925605345104, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:42:56,594] [INFO] [logging.py:96:log_dist] [Rank 0] step=2450, skipped=22, lr=[1.2971122904149943e-05, 1.2971122904149943e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:42:56,603] [INFO] [timer.py:215:stop] epoch=0/micro_step=2450/global_step=2450, RunningAvgSamplesPerSec=127.92778142730064, CurrSamplesPerSec=127.34093546223303, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:42:59,104] [INFO] [logging.py:96:log_dist] [Rank 0] step=2460, skipped=22, lr=[1.2784468962576136e-05, 1.2784468962576136e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:42:59,113] [INFO] [timer.py:215:stop] epoch=0/micro_step=2460/global_step=2460, RunningAvgSamplesPerSec=127.92687485068114, CurrSamplesPerSec=127.9044902650094, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:43:01,614] [INFO] [logging.py:96:log_dist] [Rank 0] step=2470, skipped=22, lr=[1.2598705274773297e-05, 1.2598705274773297e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:43:01,623] [INFO] [timer.py:215:stop] epoch=0/micro_step=2470/global_step=2470, RunningAvgSamplesPerSec=127.92595681046971, CurrSamplesPerSec=127.39919564963584, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:43:04,129] [INFO] [logging.py:96:log_dist] [Rank 0] step=2480, skipped=22, lr=[1.2413845378984126e-05, 1.2413845378984126e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:43:04,138] [INFO] [timer.py:215:stop] epoch=0/micro_step=2480/global_step=2480, RunningAvgSamplesPerSec=127.92415049034301, CurrSamplesPerSec=127.15201753746084, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:43:06,642] [INFO] [logging.py:96:log_dist] [Rank 0] step=2490, skipped=22, lr=[1.2229902747583971e-05, 1.2229902747583971e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:43:06,651] [INFO] [timer.py:215:stop] epoch=0/micro_step=2490/global_step=2490, RunningAvgSamplesPerSec=127.92266844770472, CurrSamplesPerSec=127.43342720258859, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:43:08,138] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:43:08,138] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:43:08,139] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:43:08,139] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:43:08,139] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:43:08,138] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:43:08,139] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:43:08,138] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:43:08,139] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:43:08,139] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:43:08,139] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:43:08,139] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:43:08,139] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:43:08,139] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:43:08,139] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:43:08,139] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:43:09,155] [INFO] [logging.py:96:log_dist] [Rank 0] step=2500, skipped=22, lr=[1.204689078609902e-05, 1.204689078609902e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:43:09,164] [INFO] [timer.py:215:stop] epoch=0/micro_step=2500/global_step=2500, RunningAvgSamplesPerSec=127.92108580463508, CurrSamplesPerSec=127.89339939968555, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:43:11,672] [INFO] [logging.py:96:log_dist] [Rank 0] step=2510, skipped=22, lr=[1.1864822832229319e-05, 1.1864822832229319e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:43:11,681] [INFO] [timer.py:215:stop] epoch=0/micro_step=2510/global_step=2510, RunningAvgSamplesPerSec=127.91892932977089, CurrSamplesPerSec=127.50158215973592, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:43:14,183] [INFO] [logging.py:96:log_dist] [Rank 0] step=2520, skipped=22, lr=[1.1683712154876714e-05, 1.1683712154876714e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:43:14,192] [INFO] [timer.py:215:stop] epoch=0/micro_step=2520/global_step=2520, RunningAvgSamplesPerSec=127.91787676305164, CurrSamplesPerSec=127.62014914942232, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:43:16,693] [INFO] [logging.py:96:log_dist] [Rank 0] step=2530, skipped=22, lr=[1.1503571953177883e-05, 1.1503571953177883e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:43:16,702] [INFO] [timer.py:215:stop] epoch=0/micro_step=2530/global_step=2530, RunningAvgSamplesPerSec=127.91699322521518, CurrSamplesPerSec=127.72909021697754, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:43:19,204] [INFO] [logging.py:96:log_dist] [Rank 0] step=2540, skipped=22, lr=[1.1324415355542328e-05, 1.1324415355542328e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:43:19,213] [INFO] [timer.py:215:stop] epoch=0/micro_step=2540/global_step=2540, RunningAvgSamplesPerSec=127.91597448134003, CurrSamplesPerSec=127.65874912852549, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:43:21,715] [INFO] [logging.py:96:log_dist] [Rank 0] step=2550, skipped=22, lr=[1.1146255418695634e-05, 1.1146255418695634e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:43:21,725] [INFO] [timer.py:215:stop] epoch=0/micro_step=2550/global_step=2550, RunningAvgSamplesPerSec=127.91492001992398, CurrSamplesPerSec=127.39363324781434, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:43:24,226] [INFO] [logging.py:96:log_dist] [Rank 0] step=2560, skipped=22, lr=[1.0969105126727903e-05, 1.0969105126727903e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:43:24,235] [INFO] [timer.py:215:stop] epoch=0/micro_step=2560/global_step=2560, RunningAvgSamplesPerSec=127.91403594874994, CurrSamplesPerSec=127.22758976322869, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:43:26,736] [INFO] [logging.py:96:log_dist] [Rank 0] step=2570, skipped=22, lr=[1.0792977390147474e-05, 1.0792977390147474e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:43:26,745] [INFO] [timer.py:215:stop] epoch=0/micro_step=2570/global_step=2570, RunningAvgSamplesPerSec=127.91318236788061, CurrSamplesPerSec=127.6330132152139, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:43:29,253] [INFO] [logging.py:96:log_dist] [Rank 0] step=2580, skipped=22, lr=[1.0617885044940063e-05, 1.0617885044940063e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:43:29,262] [INFO] [timer.py:215:stop] epoch=0/micro_step=2580/global_step=2580, RunningAvgSamplesPerSec=127.91095123385954, CurrSamplesPerSec=126.65491001328662, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:43:31,762] [INFO] [logging.py:96:log_dist] [Rank 0] step=2590, skipped=22, lr=[1.0443840851633227e-05, 1.0443840851633227e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:43:31,771] [INFO] [timer.py:215:stop] epoch=0/micro_step=2590/global_step=2590, RunningAvgSamplesPerSec=127.91048525118175, CurrSamplesPerSec=127.18310336952804, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:43:33,257] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:43:33,258] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:43:33,258] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:43:33,258] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:43:33,258] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:43:33,258] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:43:33,258] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:43:33,258] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:43:33,258] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:43:33,258] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:43:33,258] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:43:33,258] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:43:33,258] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:43:33,258] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:43:33,258] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:43:33,258] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:43:33,503] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2596 [2023-06-12 07:43:33,503] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:43:33,503] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2596 [2023-06-12 07:43:33,503] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2596 [2023-06-12 07:43:33,503] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2596 [2023-06-12 07:43:33,503] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2596 [2023-06-12 07:43:33,503] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:43:33,504] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:43:33,504] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:43:33,504] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:43:33,504] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2596 [2023-06-12 07:43:33,504] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2596 [2023-06-12 07:43:33,504] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:43:33,504] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:43:33,504] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-06-12 07:43:33,504] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2596 [2023-06-12 07:43:33,504] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:43:34,249] [INFO] [logging.py:96:log_dist] [Rank 0] step=2600, skipped=23, lr=[1.0288107732566627e-05, 1.0288107732566627e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:43:34,258] [INFO] [timer.py:215:stop] epoch=0/micro_step=2600/global_step=2600, RunningAvgSamplesPerSec=127.91421499051327, CurrSamplesPerSec=127.5827852529218, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:43:36,761] [INFO] [logging.py:96:log_dist] [Rank 0] step=2610, skipped=23, lr=[1.0116089908795365e-05, 1.0116089908795365e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:43:36,770] [INFO] [timer.py:215:stop] epoch=0/micro_step=2610/global_step=2610, RunningAvgSamplesPerSec=127.91300019857147, CurrSamplesPerSec=127.48692814182425, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:43:39,273] [INFO] [logging.py:96:log_dist] [Rank 0] step=2620, skipped=23, lr=[9.945156807173722e-06, 9.945156807173722e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:43:39,283] [INFO] [timer.py:215:stop] epoch=0/micro_step=2620/global_step=2620, RunningAvgSamplesPerSec=127.91168849945691, CurrSamplesPerSec=127.4659823793483, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:43:41,782] [INFO] [logging.py:96:log_dist] [Rank 0] step=2630, skipped=23, lr=[9.775320885108399e-06, 9.775320885108399e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:43:41,791] [INFO] [timer.py:215:stop] epoch=0/micro_step=2630/global_step=2630, RunningAvgSamplesPerSec=127.91115165418711, CurrSamplesPerSec=127.85417023887092, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:43:44,292] [INFO] [logging.py:96:log_dist] [Rank 0] step=2640, skipped=23, lr=[9.606594520044945e-06, 9.606594520044945e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:43:44,301] [INFO] [timer.py:215:stop] epoch=0/micro_step=2640/global_step=2640, RunningAvgSamplesPerSec=127.91042212495985, CurrSamplesPerSec=127.6322849913037, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:43:46,804] [INFO] [logging.py:96:log_dist] [Rank 0] step=2650, skipped=23, lr=[9.438990008565656e-06, 9.438990008565656e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:43:46,813] [INFO] [timer.py:215:stop] epoch=0/micro_step=2650/global_step=2650, RunningAvgSamplesPerSec=127.90919777198359, CurrSamplesPerSec=127.99231003233716, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:43:49,314] [INFO] [logging.py:96:log_dist] [Rank 0] step=2660, skipped=23, lr=[9.272519565493443e-06, 9.272519565493443e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:43:49,323] [INFO] [timer.py:215:stop] epoch=0/micro_step=2660/global_step=2660, RunningAvgSamplesPerSec=127.90847408833218, CurrSamplesPerSec=127.04754566528814, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:43:51,821] [INFO] [logging.py:96:log_dist] [Rank 0] step=2670, skipped=23, lr=[9.10719532300162e-06, 9.10719532300162e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:43:51,830] [INFO] [timer.py:215:stop] epoch=0/micro_step=2670/global_step=2670, RunningAvgSamplesPerSec=127.90840752849742, CurrSamplesPerSec=127.91594878319586, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:43:54,328] [INFO] [logging.py:96:log_dist] [Rank 0] step=2680, skipped=23, lr=[8.943029329729721e-06, 8.943029329729721e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:43:54,337] [INFO] [timer.py:215:stop] epoch=0/micro_step=2680/global_step=2680, RunningAvgSamplesPerSec=127.90816067102958, CurrSamplesPerSec=127.42653102881908, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:43:55,306] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2683 [2023-06-12 07:43:55,306] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2683 [2023-06-12 07:43:55,306] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2683 [2023-06-12 07:43:55,307] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:43:55,307] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:43:55,306] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2683 [2023-06-12 07:43:55,307] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:43:55,306] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2683 [2023-06-12 07:43:55,306] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2683 [2023-06-12 07:43:55,308] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:43:55,308] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-06-12 07:43:55,308] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:43:55,308] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:43:55,306] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2683 [2023-06-12 07:43:55,306] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2683 [2023-06-12 07:43:55,308] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:43:55,308] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:43:56,804] [INFO] [logging.py:96:log_dist] [Rank 0] step=2690, skipped=24, lr=[8.796280129060475e-06, 8.796280129060475e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:43:56,813] [INFO] [timer.py:215:stop] epoch=0/micro_step=2690/global_step=2690, RunningAvgSamplesPerSec=127.91388800137813, CurrSamplesPerSec=128.18642835926806, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:43:59,315] [INFO] [logging.py:96:log_dist] [Rank 0] step=2700, skipped=24, lr=[8.634347700284575e-06, 8.634347700284575e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:43:59,324] [INFO] [timer.py:215:stop] epoch=0/micro_step=2700/global_step=2700, RunningAvgSamplesPerSec=127.91300525188942, CurrSamplesPerSec=128.09027358262784, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:44:01,826] [INFO] [logging.py:96:log_dist] [Rank 0] step=2710, skipped=24, lr=[8.473607981316364e-06, 8.473607981316364e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:44:01,836] [INFO] [timer.py:215:stop] epoch=0/micro_step=2710/global_step=2710, RunningAvgSamplesPerSec=127.91189445714673, CurrSamplesPerSec=127.32776782732765, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:44:04,340] [INFO] [logging.py:96:log_dist] [Rank 0] step=2720, skipped=24, lr=[8.31407268668061e-06, 8.31407268668061e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:44:04,349] [INFO] [timer.py:215:stop] epoch=0/micro_step=2720/global_step=2720, RunningAvgSamplesPerSec=127.91060496861292, CurrSamplesPerSec=127.63944622067866, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:44:06,849] [INFO] [logging.py:96:log_dist] [Rank 0] step=2730, skipped=24, lr=[8.155753443125036e-06, 8.155753443125036e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:44:06,858] [INFO] [timer.py:215:stop] epoch=0/micro_step=2730/global_step=2730, RunningAvgSamplesPerSec=127.90996461141324, CurrSamplesPerSec=127.21649539256555, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:44:09,355] [INFO] [logging.py:96:log_dist] [Rank 0] step=2740, skipped=24, lr=[7.998661788772957e-06, 7.998661788772957e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:44:09,365] [INFO] [timer.py:215:stop] epoch=0/micro_step=2740/global_step=2740, RunningAvgSamplesPerSec=127.90994942487251, CurrSamplesPerSec=127.56568765171373, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:44:11,871] [INFO] [logging.py:96:log_dist] [Rank 0] step=2750, skipped=24, lr=[7.842809172282436e-06, 7.842809172282436e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:44:11,880] [INFO] [timer.py:215:stop] epoch=0/micro_step=2750/global_step=2750, RunningAvgSamplesPerSec=127.90815176213754, CurrSamplesPerSec=128.55439278045006, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:44:14,387] [INFO] [logging.py:96:log_dist] [Rank 0] step=2760, skipped=24, lr=[7.688206952011861e-06, 7.688206952011861e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:44:14,396] [INFO] [timer.py:215:stop] epoch=0/micro_step=2760/global_step=2760, RunningAvgSamplesPerSec=127.90637315179274, CurrSamplesPerSec=126.99273057915985, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:44:16,900] [INFO] [logging.py:96:log_dist] [Rank 0] step=2770, skipped=24, lr=[7.534866395192203e-06, 7.534866395192203e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:44:16,909] [INFO] [timer.py:215:stop] epoch=0/micro_step=2770/global_step=2770, RunningAvgSamplesPerSec=127.90505624625827, CurrSamplesPerSec=127.14382690261898, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:44:19,409] [INFO] [logging.py:96:log_dist] [Rank 0] step=2780, skipped=24, lr=[7.382798677105856e-06, 7.382798677105856e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:44:19,418] [INFO] [timer.py:215:stop] epoch=0/micro_step=2780/global_step=2780, RunningAvgSamplesPerSec=127.90455678143223, CurrSamplesPerSec=126.86201466565593, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:44:20,652] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:44:20,653] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:44:20,653] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:44:20,653] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:44:20,653] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:44:20,652] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:44:20,653] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:44:20,652] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:44:20,653] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:44:20,653] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:44:20,653] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:44:20,652] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:44:20,653] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:44:20,653] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:44:20,653] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:44:20,653] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:44:21,919] [INFO] [logging.py:96:log_dist] [Rank 0] step=2790, skipped=24, lr=[7.2320148802721925e-06, 7.2320148802721925e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:44:21,928] [INFO] [timer.py:215:stop] epoch=0/micro_step=2790/global_step=2790, RunningAvgSamplesPerSec=127.90391261018638, CurrSamplesPerSec=128.1161942488963, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:44:24,429] [INFO] [logging.py:96:log_dist] [Rank 0] step=2800, skipped=24, lr=[7.082525993639916e-06, 7.082525993639916e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:44:24,438] [INFO] [timer.py:215:stop] epoch=0/micro_step=2800/global_step=2800, RunningAvgSamplesPerSec=127.90317647345486, CurrSamplesPerSec=127.25002559833743, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:44:26,939] [INFO] [logging.py:96:log_dist] [Rank 0] step=2810, skipped=24, lr=[6.934342911786143e-06, 6.934342911786143e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:44:26,948] [INFO] [timer.py:215:stop] epoch=0/micro_step=2810/global_step=2810, RunningAvgSamplesPerSec=127.90253904163518, CurrSamplesPerSec=127.71705749077694, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:44:29,445] [INFO] [logging.py:96:log_dist] [Rank 0] step=2820, skipped=24, lr=[6.787476434122461e-06, 6.787476434122461e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:44:29,454] [INFO] [timer.py:215:stop] epoch=0/micro_step=2820/global_step=2820, RunningAvgSamplesPerSec=127.90251568918812, CurrSamplesPerSec=127.68473836172247, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:44:31,955] [INFO] [logging.py:96:log_dist] [Rank 0] step=2830, skipped=24, lr=[6.641937264107867e-06, 6.641937264107867e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:44:31,964] [INFO] [timer.py:215:stop] epoch=0/micro_step=2830/global_step=2830, RunningAvgSamplesPerSec=127.90180511721627, CurrSamplesPerSec=127.52799211749269, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:44:34,466] [INFO] [logging.py:96:log_dist] [Rank 0] step=2840, skipped=24, lr=[6.497736008468702e-06, 6.497736008468702e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:44:34,475] [INFO] [timer.py:215:stop] epoch=0/micro_step=2840/global_step=2840, RunningAvgSamplesPerSec=127.9009379536914, CurrSamplesPerSec=127.64442314148305, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:44:36,980] [INFO] [logging.py:96:log_dist] [Rank 0] step=2850, skipped=24, lr=[6.35488317642568e-06, 6.35488317642568e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:44:36,989] [INFO] [timer.py:215:stop] epoch=0/micro_step=2850/global_step=2850, RunningAvgSamplesPerSec=127.89952911940284, CurrSamplesPerSec=127.7446510423805, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:44:39,491] [INFO] [logging.py:96:log_dist] [Rank 0] step=2860, skipped=24, lr=[6.2133891789279365e-06, 6.2133891789279365e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:44:39,500] [INFO] [timer.py:215:stop] epoch=0/micro_step=2860/global_step=2860, RunningAvgSamplesPerSec=127.89876795912716, CurrSamplesPerSec=127.9049778197083, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:44:42,000] [INFO] [logging.py:96:log_dist] [Rank 0] step=2870, skipped=24, lr=[6.073264327894332e-06, 6.073264327894332e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:44:42,009] [INFO] [timer.py:215:stop] epoch=0/micro_step=2870/global_step=2870, RunningAvgSamplesPerSec=127.8981732997756, CurrSamplesPerSec=127.55526157729474, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:44:44,509] [INFO] [logging.py:96:log_dist] [Rank 0] step=2880, skipped=24, lr=[5.934518835461908e-06, 5.934518835461908e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:44:44,518] [INFO] [timer.py:215:stop] epoch=0/micro_step=2880/global_step=2880, RunningAvgSamplesPerSec=127.89770668617244, CurrSamplesPerSec=127.76362092533574, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:44:45,751] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:44:45,751] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:44:45,751] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:44:45,752] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:44:45,752] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:44:45,752] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:44:45,752] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:44:45,752] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:44:45,752] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:44:45,752] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:44:45,752] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:44:45,752] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:44:45,752] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:44:45,752] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:44:45,752] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:44:45,752] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:44:45,997] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2885 [2023-06-12 07:44:45,997] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:44:45,998] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2885 [2023-06-12 07:44:45,998] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2885 [2023-06-12 07:44:45,998] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:44:45,998] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2885 [2023-06-12 07:44:45,998] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:44:45,998] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:44:45,998] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2885 [2023-06-12 07:44:45,998] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2885 [2023-06-12 07:44:45,998] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:44:45,998] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2885 [2023-06-12 07:44:45,998] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:44:45,998] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-06-12 07:44:45,998] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:44:45,998] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2885 [2023-06-12 07:44:45,999] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:44:46,996] [INFO] [logging.py:96:log_dist] [Rank 0] step=2890, skipped=25, lr=[5.810835603212231e-06, 5.810835603212231e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:44:47,005] [INFO] [timer.py:215:stop] epoch=0/micro_step=2890/global_step=2890, RunningAvgSamplesPerSec=127.90106613363557, CurrSamplesPerSec=127.22771036484505, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:44:49,505] [INFO] [logging.py:96:log_dist] [Rank 0] step=2900, skipped=25, lr=[5.674738665931575e-06, 5.674738665931575e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:44:49,514] [INFO] [timer.py:215:stop] epoch=0/micro_step=2900/global_step=2900, RunningAvgSamplesPerSec=127.90064842028545, CurrSamplesPerSec=127.92704355998147, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:44:52,010] [INFO] [logging.py:96:log_dist] [Rank 0] step=2910, skipped=25, lr=[5.5400501313413316e-06, 5.5400501313413316e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:44:52,019] [INFO] [timer.py:215:stop] epoch=0/micro_step=2910/global_step=2910, RunningAvgSamplesPerSec=127.90076134057966, CurrSamplesPerSec=127.7572970050487, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:44:54,522] [INFO] [logging.py:96:log_dist] [Rank 0] step=2920, skipped=25, lr=[5.406779815386087e-06, 5.406779815386087e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:44:54,531] [INFO] [timer.py:215:stop] epoch=0/micro_step=2920/global_step=2920, RunningAvgSamplesPerSec=127.89977994788383, CurrSamplesPerSec=127.99975585984066, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:44:57,033] [INFO] [logging.py:96:log_dist] [Rank 0] step=2930, skipped=25, lr=[5.274937430652302e-06, 5.274937430652302e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:44:57,042] [INFO] [timer.py:215:stop] epoch=0/micro_step=2930/global_step=2930, RunningAvgSamplesPerSec=127.89897023283858, CurrSamplesPerSec=127.83821743532498, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:44:59,546] [INFO] [logging.py:96:log_dist] [Rank 0] step=2940, skipped=25, lr=[5.144532585660452e-06, 5.144532585660452e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:44:59,555] [INFO] [timer.py:215:stop] epoch=0/micro_step=2940/global_step=2940, RunningAvgSamplesPerSec=127.89780899940031, CurrSamplesPerSec=126.95369167002454, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:45:01,274] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2946 [2023-06-12 07:45:01,274] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2946 [2023-06-12 07:45:01,274] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2946 [2023-06-12 07:45:01,275] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:45:01,275] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:45:01,274] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2946 [2023-06-12 07:45:01,275] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:45:01,275] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-06-12 07:45:01,274] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2946 [2023-06-12 07:45:01,275] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:45:01,274] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2946 [2023-06-12 07:45:01,274] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2946 [2023-06-12 07:45:01,275] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:45:01,275] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2946 [2023-06-12 07:45:01,275] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:45:01,275] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:45:01,276] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:45:02,021] [INFO] [logging.py:96:log_dist] [Rank 0] step=2950, skipped=26, lr=[5.02840517867596e-06, 5.02840517867596e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:45:02,030] [INFO] [timer.py:215:stop] epoch=0/micro_step=2950/global_step=2950, RunningAvgSamplesPerSec=127.90318812518296, CurrSamplesPerSec=127.64976465900045, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:45:04,529] [INFO] [logging.py:96:log_dist] [Rank 0] step=2960, skipped=26, lr=[4.90075775488921e-06, 4.90075775488921e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:45:04,538] [INFO] [timer.py:215:stop] epoch=0/micro_step=2960/global_step=2960, RunningAvgSamplesPerSec=127.9028364718858, CurrSamplesPerSec=128.2019783671021, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:45:07,045] [INFO] [logging.py:96:log_dist] [Rank 0] step=2970, skipped=26, lr=[4.7745751406263165e-06, 4.7745751406263165e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:45:07,054] [INFO] [timer.py:215:stop] epoch=0/micro_step=2970/global_step=2970, RunningAvgSamplesPerSec=127.9012007231677, CurrSamplesPerSec=127.74659641214632, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:45:09,557] [INFO] [logging.py:96:log_dist] [Rank 0] step=2980, skipped=26, lr=[4.649866531930241e-06, 4.649866531930241e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:45:09,566] [INFO] [timer.py:215:stop] epoch=0/micro_step=2980/global_step=2980, RunningAvgSamplesPerSec=127.90017258845717, CurrSamplesPerSec=127.8654979284143, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:45:12,070] [INFO] [logging.py:96:log_dist] [Rank 0] step=2990, skipped=26, lr=[4.526641017420119e-06, 4.526641017420119e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:45:12,079] [INFO] [timer.py:215:stop] epoch=0/micro_step=2990/global_step=2990, RunningAvgSamplesPerSec=127.89906426705373, CurrSamplesPerSec=127.60170708913542, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:45:14,582] [INFO] [logging.py:96:log_dist] [Rank 0] step=3000, skipped=26, lr=[4.404907577628895e-06, 4.404907577628895e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:45:14,591] [INFO] [timer.py:215:stop] epoch=0/micro_step=3000/global_step=3000, RunningAvgSamplesPerSec=127.89805185865083, CurrSamplesPerSec=127.40766110621188, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:45:17,093] [INFO] [logging.py:96:log_dist] [Rank 0] step=3010, skipped=26, lr=[4.284675084348852e-06, 4.284675084348852e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:45:17,102] [INFO] [timer.py:215:stop] epoch=0/micro_step=3010/global_step=3010, RunningAvgSamplesPerSec=127.89732795081981, CurrSamplesPerSec=128.02539566259176, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:45:19,600] [INFO] [logging.py:96:log_dist] [Rank 0] step=3020, skipped=26, lr=[4.165952299985004e-06, 4.165952299985004e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:45:19,609] [INFO] [timer.py:215:stop] epoch=0/micro_step=3020/global_step=3020, RunningAvgSamplesPerSec=127.8971436711895, CurrSamplesPerSec=127.59394360560236, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:45:22,112] [INFO] [logging.py:96:log_dist] [Rank 0] step=3030, skipped=26, lr=[4.048747876916539e-06, 4.048747876916539e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:45:22,121] [INFO] [timer.py:215:stop] epoch=0/micro_step=3030/global_step=3030, RunningAvgSamplesPerSec=127.89623754072561, CurrSamplesPerSec=127.2800730578365, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:45:24,620] [INFO] [logging.py:96:log_dist] [Rank 0] step=3040, skipped=26, lr=[3.933070356866231e-06, 3.933070356866231e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:45:24,629] [INFO] [timer.py:215:stop] epoch=0/micro_step=3040/global_step=3040, RunningAvgSamplesPerSec=127.89594245262232, CurrSamplesPerSec=127.61881434770868, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:45:26,615] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:45:26,616] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:45:26,615] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:45:26,616] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:45:26,616] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:45:26,615] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:45:26,616] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:45:26,616] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:45:26,616] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:45:26,616] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:45:26,616] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:45:26,616] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:45:26,616] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:45:26,616] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:45:26,616] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:45:26,616] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:45:27,131] [INFO] [logging.py:96:log_dist] [Rank 0] step=3050, skipped=26, lr=[3.818928170277911e-06, 3.818928170277911e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:45:27,140] [INFO] [timer.py:215:stop] epoch=0/micro_step=3050/global_step=3050, RunningAvgSamplesPerSec=127.89514046724503, CurrSamplesPerSec=127.29987110456464, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:45:29,642] [INFO] [logging.py:96:log_dist] [Rank 0] step=3060, skipped=26, lr=[3.7063296357021133e-06, 3.7063296357021133e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:45:29,651] [INFO] [timer.py:215:stop] epoch=0/micro_step=3060/global_step=3060, RunningAvgSamplesPerSec=127.89437980349093, CurrSamplesPerSec=127.41794210179557, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:45:32,154] [INFO] [logging.py:96:log_dist] [Rank 0] step=3070, skipped=26, lr=[3.5952829591897746e-06, 3.5952829591897746e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:45:32,163] [INFO] [timer.py:215:stop] epoch=0/micro_step=3070/global_step=3070, RunningAvgSamplesPerSec=127.89335618217243, CurrSamplesPerSec=127.33706945215292, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:45:34,667] [INFO] [logging.py:96:log_dist] [Rank 0] step=3080, skipped=26, lr=[3.4857962336942218e-06, 3.4857962336942218e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:45:34,676] [INFO] [timer.py:215:stop] epoch=0/micro_step=3080/global_step=3080, RunningAvgSamplesPerSec=127.89233934706733, CurrSamplesPerSec=127.87487828719838, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:45:35,395] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3082 [2023-06-12 07:45:35,395] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3082 [2023-06-12 07:45:35,396] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:45:35,395] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3082 [2023-06-12 07:45:35,396] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:45:35,396] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3082 [2023-06-12 07:45:35,396] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3082 [2023-06-12 07:45:35,396] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:45:35,396] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-06-12 07:45:35,396] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:45:35,396] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3082 [2023-06-12 07:45:35,396] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:45:35,396] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3082 [2023-06-12 07:45:35,397] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:45:35,396] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3082 [2023-06-12 07:45:35,397] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:45:35,397] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:45:37,145] [INFO] [logging.py:96:log_dist] [Rank 0] step=3090, skipped=27, lr=[3.3885985360893046e-06, 3.3885985360893046e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:45:37,154] [INFO] [timer.py:215:stop] epoch=0/micro_step=3090/global_step=3090, RunningAvgSamplesPerSec=127.8970061037381, CurrSamplesPerSec=127.78563791552219, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:45:39,655] [INFO] [logging.py:96:log_dist] [Rank 0] step=3100, skipped=27, lr=[3.28209760597179e-06, 3.28209760597179e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:45:39,664] [INFO] [timer.py:215:stop] epoch=0/micro_step=3100/global_step=3100, RunningAvgSamplesPerSec=127.89632706939746, CurrSamplesPerSec=127.49007665491987, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:45:42,165] [INFO] [logging.py:96:log_dist] [Rank 0] step=3110, skipped=27, lr=[3.17717945145731e-06, 3.17717945145731e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:45:42,174] [INFO] [timer.py:215:stop] epoch=0/micro_step=3110/global_step=3110, RunningAvgSamplesPerSec=127.89570625009448, CurrSamplesPerSec=127.44613263471945, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:45:44,676] [INFO] [logging.py:96:log_dist] [Rank 0] step=3120, skipped=27, lr=[3.073851718859594e-06, 3.073851718859594e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:45:44,685] [INFO] [timer.py:215:stop] epoch=0/micro_step=3120/global_step=3120, RunningAvgSamplesPerSec=127.89500311699142, CurrSamplesPerSec=127.75146009579142, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:45:47,186] [INFO] [logging.py:96:log_dist] [Rank 0] step=3130, skipped=27, lr=[2.972121938584263e-06, 2.972121938584263e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:45:47,195] [INFO] [timer.py:215:stop] epoch=0/micro_step=3130/global_step=3130, RunningAvgSamplesPerSec=127.89441509655757, CurrSamplesPerSec=128.1398009608261, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:45:49,694] [INFO] [logging.py:96:log_dist] [Rank 0] step=3140, skipped=27, lr=[2.8719975245800224e-06, 2.8719975245800224e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:45:49,703] [INFO] [timer.py:215:stop] epoch=0/micro_step=3140/global_step=3140, RunningAvgSamplesPerSec=127.89407312852991, CurrSamplesPerSec=127.66919212816978, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:45:52,202] [INFO] [logging.py:96:log_dist] [Rank 0] step=3150, skipped=27, lr=[2.7734857737983317e-06, 2.7734857737983317e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:45:52,211] [INFO] [timer.py:215:stop] epoch=0/micro_step=3150/global_step=3150, RunningAvgSamplesPerSec=127.89389143794875, CurrSamplesPerSec=127.82433808214533, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:45:54,710] [INFO] [logging.py:96:log_dist] [Rank 0] step=3160, skipped=27, lr=[2.676593865661639e-06, 2.676593865661639e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:45:54,719] [INFO] [timer.py:215:stop] epoch=0/micro_step=3160/global_step=3160, RunningAvgSamplesPerSec=127.89365586902977, CurrSamplesPerSec=127.98401078661628, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:45:57,221] [INFO] [logging.py:96:log_dist] [Rank 0] step=3170, skipped=27, lr=[2.5813288615401247e-06, 2.5813288615401247e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:45:57,230] [INFO] [timer.py:215:stop] epoch=0/micro_step=3170/global_step=3170, RunningAvgSamplesPerSec=127.89304126228993, CurrSamplesPerSec=127.8362692741059, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:45:59,729] [INFO] [logging.py:96:log_dist] [Rank 0] step=3180, skipped=27, lr=[2.4876977042370795e-06, 2.4876977042370795e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:45:59,738] [INFO] [timer.py:215:stop] epoch=0/micro_step=3180/global_step=3180, RunningAvgSamplesPerSec=127.89279792830571, CurrSamplesPerSec=127.46101936353881, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:46:00,723] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:46:00,723] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:46:00,724] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:46:00,724] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:46:00,724] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:46:00,724] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:46:00,724] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:46:00,723] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:46:00,724] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:46:00,724] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:46:00,724] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:46:00,724] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:46:00,724] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:46:00,724] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:46:00,724] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:46:00,724] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:46:02,240] [INFO] [logging.py:96:log_dist] [Rank 0] step=3190, skipped=27, lr=[2.395707217482937e-06, 2.395707217482937e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:46:02,249] [INFO] [timer.py:215:stop] epoch=0/micro_step=3190/global_step=3190, RunningAvgSamplesPerSec=127.89210096278747, CurrSamplesPerSec=127.98413282655005, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:46:04,753] [INFO] [logging.py:96:log_dist] [Rank 0] step=3200, skipped=27, lr=[2.3053641054379572e-06, 2.3053641054379572e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:46:04,762] [INFO] [timer.py:215:stop] epoch=0/micro_step=3200/global_step=3200, RunningAvgSamplesPerSec=127.89103656350647, CurrSamplesPerSec=127.43681507011897, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:46:07,261] [INFO] [logging.py:96:log_dist] [Rank 0] step=3210, skipped=27, lr=[2.216674952203629e-06, 2.216674952203629e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:46:07,270] [INFO] [timer.py:215:stop] epoch=0/micro_step=3210/global_step=3210, RunningAvgSamplesPerSec=127.89073334712918, CurrSamplesPerSec=128.22230926345918, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:46:09,769] [INFO] [logging.py:96:log_dist] [Rank 0] step=3220, skipped=27, lr=[2.129646221342854e-06, 2.129646221342854e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:46:09,778] [INFO] [timer.py:215:stop] epoch=0/micro_step=3220/global_step=3220, RunningAvgSamplesPerSec=127.89059273156528, CurrSamplesPerSec=127.25195593226766, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:46:12,253] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3229 [2023-06-12 07:46:12,253] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3229 [2023-06-12 07:46:12,253] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3229 [2023-06-12 07:46:12,254] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:46:12,254] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:46:12,253] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3229 [2023-06-12 07:46:12,254] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:46:12,253] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3229 [2023-06-12 07:46:12,253] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3229 [2023-06-12 07:46:12,254] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:46:12,254] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:46:12,253] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3229 [2023-06-12 07:46:12,254] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:46:12,253] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3229 [2023-06-12 07:46:12,254] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:46:12,254] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-12 07:46:12,254] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-06-12 07:46:12,255] [INFO] [logging.py:96:log_dist] [Rank 0] step=3230, skipped=28, lr=[2.0527452693256287e-06, 2.0527452693256287e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:46:12,255] [INFO] [timer.py:215:stop] epoch=0/micro_step=3230/global_step=3230, RunningAvgSamplesPerSec=127.89517223528777, CurrSamplesPerSec=147.90973188014502, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:46:14,759] [INFO] [logging.py:96:log_dist] [Rank 0] step=3240, skipped=28, lr=[1.9688887143216263e-06, 1.9688887143216263e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:46:14,768] [INFO] [timer.py:215:stop] epoch=0/micro_step=3240/global_step=3240, RunningAvgSamplesPerSec=127.89420046053199, CurrSamplesPerSec=127.12046530173863, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:46:17,269] [INFO] [logging.py:96:log_dist] [Rank 0] step=3250, skipped=28, lr=[1.8867106400655533e-06, 1.8867106400655533e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:46:17,279] [INFO] [timer.py:215:stop] epoch=0/micro_step=3250/global_step=3250, RunningAvgSamplesPerSec=127.89348565971912, CurrSamplesPerSec=127.79013956039144, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:46:19,777] [INFO] [logging.py:96:log_dist] [Rank 0] step=3260, skipped=28, lr=[1.8062170356003855e-06, 1.8062170356003855e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:46:19,786] [INFO] [timer.py:215:stop] epoch=0/micro_step=3260/global_step=3260, RunningAvgSamplesPerSec=127.89329833014285, CurrSamplesPerSec=128.12438833935687, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:46:22,285] [INFO] [logging.py:96:log_dist] [Rank 0] step=3270, skipped=28, lr=[1.7274137672069145e-06, 1.7274137672069145e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:46:22,294] [INFO] [timer.py:215:stop] epoch=0/micro_step=3270/global_step=3270, RunningAvgSamplesPerSec=127.89307414818478, CurrSamplesPerSec=127.7356544778662, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:46:24,789] [INFO] [logging.py:96:log_dist] [Rank 0] step=3280, skipped=28, lr=[1.6503065779761796e-06, 1.6503065779761796e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:46:24,798] [INFO] [timer.py:215:stop] epoch=0/micro_step=3280/global_step=3280, RunningAvgSamplesPerSec=127.89344929159383, CurrSamplesPerSec=128.01758053931968, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:46:27,299] [INFO] [logging.py:96:log_dist] [Rank 0] step=3290, skipped=28, lr=[1.5749010873909175e-06, 1.5749010873909175e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:46:27,308] [INFO] [timer.py:215:stop] epoch=0/micro_step=3290/global_step=3290, RunningAvgSamplesPerSec=127.89287386862631, CurrSamplesPerSec=127.62330424628995, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:46:29,805] [INFO] [logging.py:96:log_dist] [Rank 0] step=3300, skipped=28, lr=[1.5012027909160675e-06, 1.5012027909160675e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:46:29,815] [INFO] [timer.py:215:stop] epoch=0/micro_step=3300/global_step=3300, RunningAvgSamplesPerSec=127.89283485240924, CurrSamplesPerSec=128.6913359598751, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:46:32,314] [INFO] [logging.py:96:log_dist] [Rank 0] step=3310, skipped=28, lr=[1.4292170595982146e-06, 1.4292170595982146e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:46:32,323] [INFO] [timer.py:215:stop] epoch=0/micro_step=3310/global_step=3310, RunningAvgSamplesPerSec=127.89255370379047, CurrSamplesPerSec=127.81606060461772, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:46:34,823] [INFO] [logging.py:96:log_dist] [Rank 0] step=3320, skipped=28, lr=[1.3589491396741898e-06, 1.3589491396741898e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:46:34,832] [INFO] [timer.py:215:stop] epoch=0/micro_step=3320/global_step=3320, RunningAvgSamplesPerSec=127.89212671315367, CurrSamplesPerSec=127.76362092533574, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:46:37,328] [INFO] [logging.py:96:log_dist] [Rank 0] step=3330, skipped=28, lr=[1.2904041521887122e-06, 1.2904041521887122e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:46:37,337] [INFO] [timer.py:215:stop] epoch=0/micro_step=3330/global_step=3330, RunningAvgSamplesPerSec=127.8922711871296, CurrSamplesPerSec=127.79756188365103, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:46:37,568] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:46:37,568] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:46:37,568] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:46:37,568] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:46:37,568] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:46:37,568] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:46:37,568] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:46:37,569] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:46:37,568] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:46:37,569] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:46:37,569] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:46:37,569] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:46:37,569] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:46:37,569] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:46:37,569] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:46:37,569] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-12 07:46:39,838] [INFO] [logging.py:96:log_dist] [Rank 0] step=3340, skipped=28, lr=[1.2235870926211619e-06, 1.2235870926211619e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:46:39,848] [INFO] [timer.py:215:stop] epoch=0/micro_step=3340/global_step=3340, RunningAvgSamplesPerSec=127.8916840834043, CurrSamplesPerSec=127.74671799972398, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:46:42,345] [INFO] [logging.py:96:log_dist] [Rank 0] step=3350, skipped=28, lr=[1.15850283052156e-06, 1.15850283052156e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:46:42,354] [INFO] [timer.py:215:stop] epoch=0/micro_step=3350/global_step=3350, RunningAvgSamplesPerSec=127.8917288816514, CurrSamplesPerSec=127.73091355510404, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:46:44,854] [INFO] [logging.py:96:log_dist] [Rank 0] step=3360, skipped=28, lr=[1.095156109155629e-06, 1.095156109155629e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:46:44,863] [INFO] [timer.py:215:stop] epoch=0/micro_step=3360/global_step=3360, RunningAvgSamplesPerSec=127.89127543487183, CurrSamplesPerSec=127.51284745511514, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:46:47,361] [INFO] [logging.py:96:log_dist] [Rank 0] step=3370, skipped=28, lr=[1.0335515451591503e-06, 1.0335515451591503e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:46:47,370] [INFO] [timer.py:215:stop] epoch=0/micro_step=3370/global_step=3370, RunningAvgSamplesPerSec=127.89112915918383, CurrSamplesPerSec=127.64648685662114, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:46:49,874] [INFO] [logging.py:96:log_dist] [Rank 0] step=3380, skipped=28, lr=[9.73693628201483e-07, 9.73693628201483e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:46:49,883] [INFO] [timer.py:215:stop] epoch=0/micro_step=3380/global_step=3380, RunningAvgSamplesPerSec=127.89012819304345, CurrSamplesPerSec=126.88264244051389, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:46:52,384] [INFO] [logging.py:96:log_dist] [Rank 0] step=3390, skipped=28, lr=[9.155867206583624e-07, 9.155867206583624e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:46:52,393] [INFO] [timer.py:215:stop] epoch=0/micro_step=3390/global_step=3390, RunningAvgSamplesPerSec=127.88973539918096, CurrSamplesPerSec=127.7621615035487, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:46:54,891] [INFO] [logging.py:96:log_dist] [Rank 0] step=3400, skipped=28, lr=[8.59235057294e-07, 8.59235057294e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:46:54,900] [INFO] [timer.py:215:stop] epoch=0/micro_step=3400/global_step=3400, RunningAvgSamplesPerSec=127.88954194619005, CurrSamplesPerSec=127.67732913759744, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:46:57,397] [INFO] [logging.py:96:log_dist] [Rank 0] step=3410, skipped=28, lr=[8.046427449524274e-07, 8.046427449524274e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:46:57,406] [INFO] [timer.py:215:stop] epoch=0/micro_step=3410/global_step=3410, RunningAvgSamplesPerSec=127.88964287169027, CurrSamplesPerSec=127.38710409150919, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:46:59,907] [INFO] [logging.py:96:log_dist] [Rank 0] step=3420, skipped=28, lr=[7.518137622582188e-07, 7.518137622582188e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:46:59,917] [INFO] [timer.py:215:stop] epoch=0/micro_step=3420/global_step=3420, RunningAvgSamplesPerSec=127.88907576059347, CurrSamplesPerSec=127.08543899284363, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:47:02,413] [INFO] [logging.py:96:log_dist] [Rank 0] step=3430, skipped=28, lr=[7.007519593265204e-07, 7.007519593265204e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:47:02,422] [INFO] [timer.py:215:stop] epoch=0/micro_step=3430/global_step=3430, RunningAvgSamplesPerSec=127.8892374973014, CurrSamplesPerSec=127.85307411736352, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:47:02,651] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:47:02,651] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:47:02,651] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:47:02,651] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:47:02,651] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:47:02,652] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:47:02,652] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:47:02,652] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:47:02,652] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:47:02,652] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:47:02,652] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:47:02,652] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:47:02,652] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:47:02,652] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:47:02,652] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:47:02,652] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:47:03,397] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3433 [2023-06-12 07:47:03,397] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3433 [2023-06-12 07:47:03,397] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3433 [2023-06-12 07:47:03,398] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:47:03,398] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:47:03,397] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3433 [2023-06-12 07:47:03,398] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:47:03,397] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3433 [2023-06-12 07:47:03,397] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3433 [2023-06-12 07:47:03,398] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:47:03,398] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:47:03,397] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3433 [2023-06-12 07:47:03,398] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:47:03,398] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-06-12 07:47:03,397] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3433 [2023-06-12 07:47:03,398] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:47:03,398] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:47:04,895] [INFO] [logging.py:96:log_dist] [Rank 0] step=3440, skipped=29, lr=[6.563103537256809e-07, 6.563103537256809e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:47:04,904] [INFO] [timer.py:215:stop] epoch=0/micro_step=3440/global_step=3440, RunningAvgSamplesPerSec=127.89283487192752, CurrSamplesPerSec=127.51842027256075, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:47:07,406] [INFO] [logging.py:96:log_dist] [Rank 0] step=3450, skipped=29, lr=[6.086163379298321e-07, 6.086163379298321e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:47:07,415] [INFO] [timer.py:215:stop] epoch=0/micro_step=3450/global_step=3450, RunningAvgSamplesPerSec=127.89217632262086, CurrSamplesPerSec=127.36727890414079, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:47:09,915] [INFO] [logging.py:96:log_dist] [Rank 0] step=3460, skipped=29, lr=[5.626999379591269e-07, 5.626999379591269e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:47:09,924] [INFO] [timer.py:215:stop] epoch=0/micro_step=3460/global_step=3460, RunningAvgSamplesPerSec=127.89185852929202, CurrSamplesPerSec=127.6304644678836, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:47:12,423] [INFO] [logging.py:96:log_dist] [Rank 0] step=3470, skipped=29, lr=[5.185645001476724e-07, 5.185645001476724e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:47:12,432] [INFO] [timer.py:215:stop] epoch=0/micro_step=3470/global_step=3470, RunningAvgSamplesPerSec=127.89154028618431, CurrSamplesPerSec=127.31013471085825, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:47:14,937] [INFO] [logging.py:96:log_dist] [Rank 0] step=3480, skipped=29, lr=[4.762132410351311e-07, 4.762132410351311e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:47:14,946] [INFO] [timer.py:215:stop] epoch=0/micro_step=3480/global_step=3480, RunningAvgSamplesPerSec=127.89044613605812, CurrSamplesPerSec=127.755716122793, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:47:17,448] [INFO] [logging.py:96:log_dist] [Rank 0] step=3490, skipped=29, lr=[4.356492471322665e-07, 4.356492471322665e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:47:17,458] [INFO] [timer.py:215:stop] epoch=0/micro_step=3490/global_step=3490, RunningAvgSamplesPerSec=127.88974268846509, CurrSamplesPerSec=128.08159494917493, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:47:19,959] [INFO] [logging.py:96:log_dist] [Rank 0] step=3500, skipped=29, lr=[3.968754746960346e-07, 3.968754746960346e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:47:19,968] [INFO] [timer.py:215:stop] epoch=0/micro_step=3500/global_step=3500, RunningAvgSamplesPerSec=127.88915659442345, CurrSamplesPerSec=128.42363869308411, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:47:22,465] [INFO] [logging.py:96:log_dist] [Rank 0] step=3510, skipped=29, lr=[3.598947495141114e-07, 3.598947495141114e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:47:22,474] [INFO] [timer.py:215:stop] epoch=0/micro_step=3510/global_step=3510, RunningAvgSamplesPerSec=127.88916062958893, CurrSamplesPerSec=127.9258242596151, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:47:24,975] [INFO] [logging.py:96:log_dist] [Rank 0] step=3520, skipped=29, lr=[3.2470976669896905e-07, 3.2470976669896905e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:47:24,984] [INFO] [timer.py:215:stop] epoch=0/micro_step=3520/global_step=3520, RunningAvgSamplesPerSec=127.88868856907614, CurrSamplesPerSec=127.56095933699872, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:47:27,484] [INFO] [logging.py:96:log_dist] [Rank 0] step=3530, skipped=29, lr=[2.9132309049146046e-07, 2.9132309049146046e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:47:27,493] [INFO] [timer.py:215:stop] epoch=0/micro_step=3530/global_step=3530, RunningAvgSamplesPerSec=127.88836425635921, CurrSamplesPerSec=127.95863153830386, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:47:28,726] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:47:28,727] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:47:28,727] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:47:28,727] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:47:28,727] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:47:28,727] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:47:28,727] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:47:28,727] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:47:28,727] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:47:28,727] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:47:28,728] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:47:28,727] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:47:28,728] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:47:28,727] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:47:28,728] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:47:28,728] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:47:28,971] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3535 [2023-06-12 07:47:28,971] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3535 [2023-06-12 07:47:28,972] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:47:28,971] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3535 [2023-06-12 07:47:28,972] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:47:28,972] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3535 [2023-06-12 07:47:28,972] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3535 [2023-06-12 07:47:28,972] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:47:28,972] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3535 [2023-06-12 07:47:28,973] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:47:28,973] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:47:28,972] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3535 [2023-06-12 07:47:28,973] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:47:28,972] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3535 [2023-06-12 07:47:28,973] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:47:28,973] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-06-12 07:47:28,973] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:47:29,969] [INFO] [logging.py:96:log_dist] [Rank 0] step=3540, skipped=30, lr=[2.628146477903104e-07, 2.628146477903104e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:47:29,978] [INFO] [timer.py:215:stop] epoch=0/micro_step=3540/global_step=3540, RunningAvgSamplesPerSec=127.89143295417861, CurrSamplesPerSec=126.87004910602458, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:47:32,475] [INFO] [logging.py:96:log_dist] [Rank 0] step=3550, skipped=30, lr=[2.3285134909173112e-07, 2.3285134909173112e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:47:32,484] [INFO] [timer.py:215:stop] epoch=0/micro_step=3550/global_step=3550, RunningAvgSamplesPerSec=127.89148987970766, CurrSamplesPerSec=127.40911244030266, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:47:34,984] [INFO] [logging.py:96:log_dist] [Rank 0] step=3560, skipped=30, lr=[2.0469305153599516e-07, 2.0469305153599516e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:47:34,993] [INFO] [timer.py:215:stop] epoch=0/micro_step=3560/global_step=3560, RunningAvgSamplesPerSec=127.89108704975476, CurrSamplesPerSec=127.78235313158702, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:47:37,491] [INFO] [logging.py:96:log_dist] [Rank 0] step=3570, skipped=30, lr=[1.7834180726725158e-07, 1.7834180726725158e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:47:37,500] [INFO] [timer.py:215:stop] epoch=0/micro_step=3570/global_step=3570, RunningAvgSamplesPerSec=127.89100558307919, CurrSamplesPerSec=127.6951856140214, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:47:40,000] [INFO] [logging.py:96:log_dist] [Rank 0] step=3580, skipped=30, lr=[1.5379953673370084e-07, 1.5379953673370084e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:47:40,009] [INFO] [timer.py:215:stop] epoch=0/micro_step=3580/global_step=3580, RunningAvgSamplesPerSec=127.89061494415745, CurrSamplesPerSec=127.8687869884495, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:47:42,507] [INFO] [logging.py:96:log_dist] [Rank 0] step=3590, skipped=30, lr=[1.31068028547629e-07, 1.31068028547629e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:47:42,516] [INFO] [timer.py:215:stop] epoch=0/micro_step=3590/global_step=3590, RunningAvgSamplesPerSec=127.89059047290596, CurrSamplesPerSec=127.67975829666116, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:47:45,015] [INFO] [logging.py:96:log_dist] [Rank 0] step=3600, skipped=30, lr=[1.1014893935505367e-07, 1.1014893935505367e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:47:45,025] [INFO] [timer.py:215:stop] epoch=0/micro_step=3600/global_step=3600, RunningAvgSamplesPerSec=127.89029098424352, CurrSamplesPerSec=127.42290176346332, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:47:47,519] [INFO] [logging.py:96:log_dist] [Rank 0] step=3610, skipped=30, lr=[9.104379371500105e-08, 9.104379371500105e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:47:47,528] [INFO] [timer.py:215:stop] epoch=0/micro_step=3610/global_step=3610, RunningAvgSamplesPerSec=127.89068973868224, CurrSamplesPerSec=127.90412460142448, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:47:50,024] [INFO] [logging.py:96:log_dist] [Rank 0] step=3620, skipped=30, lr=[7.375398398837829e-08, 7.375398398837829e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:47:50,033] [INFO] [timer.py:215:stop] epoch=0/micro_step=3620/global_step=3620, RunningAvgSamplesPerSec=127.89086018223459, CurrSamplesPerSec=127.89778676480398, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:47:52,530] [INFO] [logging.py:96:log_dist] [Rank 0] step=3630, skipped=30, lr=[5.8280770236518456e-08, 5.8280770236518456e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:47:52,539] [INFO] [timer.py:215:stop] epoch=0/micro_step=3630/global_step=3630, RunningAvgSamplesPerSec=127.89091702769119, CurrSamplesPerSec=128.18116425077406, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:47:54,272] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:47:54,272] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:47:54,273] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:47:54,273] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:47:54,273] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:47:54,273] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:47:54,273] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:47:54,273] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:47:54,273] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:47:54,273] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:47:54,273] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:47:54,273] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:47:54,273] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-12 07:47:54,274] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:47:54,274] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:47:54,274] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-12 07:47:54,519] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3637 [2023-06-12 07:47:54,519] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:47:54,519] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3637 [2023-06-12 07:47:54,519] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3637 [2023-06-12 07:47:54,519] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:47:54,519] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3637 [2023-06-12 07:47:54,519] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3637 [2023-06-12 07:47:54,519] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:47:54,519] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:47:54,519] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:47:54,519] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3637 [2023-06-12 07:47:54,519] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3637 [2023-06-12 07:47:54,519] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:47:54,519] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3637 [2023-06-12 07:47:54,519] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:47:54,519] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-12 07:47:54,519] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-06-12 07:47:55,013] [INFO] [logging.py:96:log_dist] [Rank 0] step=3640, skipped=31, lr=[4.5909002174351904e-08, 4.5909002174351904e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:47:55,022] [INFO] [timer.py:215:stop] epoch=0/micro_step=3640/global_step=3640, RunningAvgSamplesPerSec=127.89422976269, CurrSamplesPerSec=127.92289803356437, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:47:57,520] [INFO] [logging.py:96:log_dist] [Rank 0] step=3650, skipped=31, lr=[3.389031801728504e-08, 3.389031801728504e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:47:57,529] [INFO] [timer.py:215:stop] epoch=0/micro_step=3650/global_step=3650, RunningAvgSamplesPerSec=127.89414981528735, CurrSamplesPerSec=127.15225845446096, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:48:00,029] [INFO] [logging.py:96:log_dist] [Rank 0] step=3660, skipped=31, lr=[2.369113505284737e-08, 2.369113505284737e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:48:00,038] [INFO] [timer.py:215:stop] epoch=0/micro_step=3660/global_step=3660, RunningAvgSamplesPerSec=127.89373988213451, CurrSamplesPerSec=127.45230475879372, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:48:02,542] [INFO] [logging.py:96:log_dist] [Rank 0] step=3670, skipped=31, lr=[1.5312196585692828e-08, 1.5312196585692828e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:48:02,551] [INFO] [timer.py:215:stop] epoch=0/micro_step=3670/global_step=3670, RunningAvgSamplesPerSec=127.89283392302097, CurrSamplesPerSec=127.42024043358931, MemAllocated=4.32GB, MaxMemAllocated=12.79GB [2023-06-12 07:48:05,048] [INFO] [logging.py:96:log_dist] [Rank 0] step=3680, skipped=31, lr=[8.754113263159668e-09, 8.754113263159668e-09], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-12 07:48:05,057] [INFO] [timer.py:215:stop] epoch=0/micro_step=3680/global_step=3680, RunningAvgSamplesPerSec=127.89297541765943, CurrSamplesPerSec=128.34933949940807, MemAllocated=4.32GB, MaxMemAllocated=12.79GB Epoch 1/1 with loss 0.5835847191188647 ***** Evaluating reward, Epoch 1/1 ***** chosen_last_scores (higher is better) : 2.8516085147857666, acc (higher is better) : 0.6974999904632568 saving model ... [2023-06-12 07:48:24,156] [INFO] [launch.py:346:main] Process 4245 exits successfully. [2023-06-12 07:48:24,157] [INFO] [launch.py:346:main] Process 4249 exits successfully. [2023-06-12 07:48:24,157] [INFO] [launch.py:346:main] Process 4246 exits successfully. [2023-06-12 07:48:24,157] [INFO] [launch.py:346:main] Process 4244 exits successfully. [2023-06-12 07:48:24,157] [INFO] [launch.py:346:main] Process 4250 exits successfully. [2023-06-12 07:48:25,158] [INFO] [launch.py:346:main] Process 4247 exits successfully. [2023-06-12 07:48:25,160] [INFO] [launch.py:346:main] Process 4243 exits successfully. [2023-06-12 07:48:25,160] [INFO] [launch.py:346:main] Process 4248 exits successfully.