[2023-06-30 00:28:45,109] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-06-30 00:28:46,132] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-06-30 00:28:46,192] [INFO] [runner.py:555:main] cmd = /home/mxfeng/miniconda3/envs/safe-rlhf/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets --data_split 2,4,4 --model_name_or_path facebook/opt-350m --num_padding_at_beginning 1 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --max_seq_len 512 --learning_rate 5e-5 --weight_decay 0.1 --num_train_epochs 1 --disable_dropout --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --zero_stage 0 --deepspeed --output_dir ./output [2023-06-30 00:28:47,453] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-06-30 00:28:48,506] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} [2023-06-30 00:28:48,506] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0 [2023-06-30 00:28:48,506] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) [2023-06-30 00:28:48,506] [INFO] [launch.py:163:main] dist_world_size=8 [2023-06-30 00:28:48,506] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 [2023-06-30 00:28:50,082] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-06-30 00:28:50,149] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-06-30 00:28:50,186] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-06-30 00:28:50,188] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-06-30 00:28:50,189] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-06-30 00:28:50,192] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-06-30 00:28:50,192] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-06-30 00:28:50,197] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-06-30 00:28:52,680] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-30 00:28:52,680] [INFO] [comm.py:594:init_distributed] cdb=None [2023-06-30 00:28:52,739] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-30 00:28:52,739] [INFO] [comm.py:594:init_distributed] cdb=None [2023-06-30 00:28:52,779] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-30 00:28:52,779] [INFO] [comm.py:594:init_distributed] cdb=None [2023-06-30 00:28:52,799] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-30 00:28:52,799] [INFO] [comm.py:594:init_distributed] cdb=None [2023-06-30 00:28:52,805] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-30 00:28:52,805] [INFO] [comm.py:594:init_distributed] cdb=None [2023-06-30 00:28:52,808] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-30 00:28:52,808] [INFO] [comm.py:594:init_distributed] cdb=None [2023-06-30 00:28:52,808] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2023-06-30 00:28:52,811] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-30 00:28:52,812] [INFO] [comm.py:594:init_distributed] cdb=None [2023-06-30 00:28:52,819] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-30 00:28:52,820] [INFO] [comm.py:594:init_distributed] cdb=None Installed CUDA version 11.1 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/mxfeng/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/mxfeng/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.07954144477844238 seconds Installed CUDA version 11.1 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/mxfeng/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/mxfeng/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.07536959648132324 seconds Installed CUDA version 11.1 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/mxfeng/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/mxfeng/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.07327842712402344 seconds Installed CUDA version 11.1 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/mxfeng/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/mxfeng/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.08143973350524902 seconds Installed CUDA version 11.1 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/mxfeng/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/mxfeng/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.07306838035583496 seconds Installed CUDA version 11.1 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/mxfeng/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/mxfeng/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.0784151554107666 seconds Installed CUDA version 11.1 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/mxfeng/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/mxfeng/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.07120323181152344 seconds [2023-06-30 00:30:42,554] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.5, git-hash=unknown, git-branch=unknown [2023-06-30 00:30:42,554] [INFO] [comm.py:619:init_distributed] Distributed backend already initialized Installed CUDA version 11.1 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/mxfeng/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/mxfeng/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.07027721405029297 seconds [2023-06-30 00:30:45,151] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False [2023-06-30 00:30:45,152] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer [2023-06-30 00:30:45,152] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer [2023-06-30 00:30:45,170] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam [2023-06-30 00:30:45,171] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 optimizer with dynamic loss scale [2023-06-30 00:30:45,345] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam [2023-06-30 00:30:45,345] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler [2023-06-30 00:30:45,345] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = [2023-06-30 00:30:45,345] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-05, 5e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:30:45,346] [INFO] [config.py:960:print] DeepSpeedEngine configuration: [2023-06-30 00:30:45,346] [INFO] [config.py:964:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2023-06-30 00:30:45,346] [INFO] [config.py:964:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2023-06-30 00:30:45,346] [INFO] [config.py:964:print] amp_enabled .................. False [2023-06-30 00:30:45,346] [INFO] [config.py:964:print] amp_params ................... False [2023-06-30 00:30:45,347] [INFO] [config.py:964:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2023-06-30 00:30:45,347] [INFO] [config.py:964:print] bfloat16_enabled ............. False [2023-06-30 00:30:45,347] [INFO] [config.py:964:print] checkpoint_parallel_write_pipeline False [2023-06-30 00:30:45,347] [INFO] [config.py:964:print] checkpoint_tag_validation_enabled True [2023-06-30 00:30:45,347] [INFO] [config.py:964:print] checkpoint_tag_validation_fail False [2023-06-30 00:30:45,347] [INFO] [config.py:964:print] comms_config ................. [2023-06-30 00:30:45,347] [INFO] [config.py:964:print] communication_data_type ...... None [2023-06-30 00:30:45,347] [INFO] [config.py:964:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2023-06-30 00:30:45,347] [INFO] [config.py:964:print] curriculum_enabled_legacy .... False [2023-06-30 00:30:45,347] [INFO] [config.py:964:print] curriculum_params_legacy ..... False [2023-06-30 00:30:45,347] [INFO] [config.py:964:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2023-06-30 00:30:45,347] [INFO] [config.py:964:print] data_efficiency_enabled ...... False [2023-06-30 00:30:45,347] [INFO] [config.py:964:print] dataloader_drop_last ......... False [2023-06-30 00:30:45,347] [INFO] [config.py:964:print] disable_allgather ............ False [2023-06-30 00:30:45,347] [INFO] [config.py:964:print] dump_state ................... False [2023-06-30 00:30:45,347] [INFO] [config.py:964:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 100, 'delayed_shift': 2, 'consecutive_hysteresis': False, 'min_scale': 1} [2023-06-30 00:30:45,347] [INFO] [config.py:964:print] eigenvalue_enabled ........... False [2023-06-30 00:30:45,347] [INFO] [config.py:964:print] eigenvalue_gas_boundary_resolution 1 [2023-06-30 00:30:45,347] [INFO] [config.py:964:print] eigenvalue_layer_name ........ bert.encoder.layer [2023-06-30 00:30:45,347] [INFO] [config.py:964:print] eigenvalue_layer_num ......... 0 [2023-06-30 00:30:45,347] [INFO] [config.py:964:print] eigenvalue_max_iter .......... 100 [2023-06-30 00:30:45,347] [INFO] [config.py:964:print] eigenvalue_stability ......... 1e-06 [2023-06-30 00:30:45,347] [INFO] [config.py:964:print] eigenvalue_tol ............... 0.01 [2023-06-30 00:30:45,348] [INFO] [config.py:964:print] eigenvalue_verbose ........... False [2023-06-30 00:30:45,348] [INFO] [config.py:964:print] elasticity_enabled ........... False [2023-06-30 00:30:45,348] [INFO] [config.py:964:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2023-06-30 00:30:45,348] [INFO] [config.py:964:print] fp16_auto_cast ............... False [2023-06-30 00:30:45,348] [INFO] [config.py:964:print] fp16_enabled ................. True [2023-06-30 00:30:45,348] [INFO] [config.py:964:print] fp16_master_weights_and_gradients False [2023-06-30 00:30:45,348] [INFO] [config.py:964:print] global_rank .................. 0 [2023-06-30 00:30:45,348] [INFO] [config.py:964:print] grad_accum_dtype ............. None [2023-06-30 00:30:45,348] [INFO] [config.py:964:print] gradient_accumulation_steps .. 1 [2023-06-30 00:30:45,348] [INFO] [config.py:964:print] gradient_clipping ............ 1.0 [2023-06-30 00:30:45,348] [INFO] [config.py:964:print] gradient_predivide_factor .... 1.0 [2023-06-30 00:30:45,348] [INFO] [config.py:964:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2023-06-30 00:30:45,348] [INFO] [config.py:964:print] initial_dynamic_scale ........ 65536 [2023-06-30 00:30:45,348] [INFO] [config.py:964:print] load_universal_checkpoint .... False [2023-06-30 00:30:45,348] [INFO] [config.py:964:print] loss_scale ................... 0 [2023-06-30 00:30:45,348] [INFO] [config.py:964:print] memory_breakdown ............. False [2023-06-30 00:30:45,348] [INFO] [config.py:964:print] mics_hierarchial_params_gather False [2023-06-30 00:30:45,348] [INFO] [config.py:964:print] mics_shard_size .............. -1 [2023-06-30 00:30:45,348] [INFO] [config.py:964:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2023-06-30 00:30:45,348] [INFO] [config.py:964:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2023-06-30 00:30:45,348] [INFO] [config.py:964:print] optimizer_legacy_fusion ...... False [2023-06-30 00:30:45,348] [INFO] [config.py:964:print] optimizer_name ............... None [2023-06-30 00:30:45,348] [INFO] [config.py:964:print] optimizer_params ............. None [2023-06-30 00:30:45,348] [INFO] [config.py:964:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2023-06-30 00:30:45,349] [INFO] [config.py:964:print] pld_enabled .................. False [2023-06-30 00:30:45,349] [INFO] [config.py:964:print] pld_params ................... False [2023-06-30 00:30:45,349] [INFO] [config.py:964:print] prescale_gradients ........... False [2023-06-30 00:30:45,349] [INFO] [config.py:964:print] scheduler_name ............... None [2023-06-30 00:30:45,349] [INFO] [config.py:964:print] scheduler_params ............. None [2023-06-30 00:30:45,349] [INFO] [config.py:964:print] sparse_attention ............. None [2023-06-30 00:30:45,349] [INFO] [config.py:964:print] sparse_gradients_enabled ..... False [2023-06-30 00:30:45,349] [INFO] [config.py:964:print] steps_per_print .............. 10 [2023-06-30 00:30:45,349] [INFO] [config.py:964:print] train_batch_size ............. 32 [2023-06-30 00:30:45,349] [INFO] [config.py:964:print] train_micro_batch_size_per_gpu 4 [2023-06-30 00:30:45,349] [INFO] [config.py:964:print] use_node_local_storage ....... False [2023-06-30 00:30:45,349] [INFO] [config.py:964:print] wall_clock_breakdown ......... False [2023-06-30 00:30:45,349] [INFO] [config.py:964:print] world_size ................... 8 [2023-06-30 00:30:45,349] [INFO] [config.py:964:print] zero_allow_untested_optimizer False [2023-06-30 00:30:45,349] [INFO] [config.py:964:print] zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=30000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=30000000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=False [2023-06-30 00:30:45,349] [INFO] [config.py:964:print] zero_enabled ................. False [2023-06-30 00:30:45,349] [INFO] [config.py:964:print] zero_force_ds_cpu_optimizer .. True [2023-06-30 00:30:45,349] [INFO] [config.py:964:print] zero_optimization_stage ...... 0 [2023-06-30 00:30:45,349] [INFO] [config.py:950:print_user_config] json = { "train_batch_size": 32, "train_micro_batch_size_per_gpu": 4, "steps_per_print": 10, "zero_optimization": { "stage": 0, "offload_param": { "device": "none" }, "offload_optimizer": { "device": "none" }, "stage3_param_persistence_threshold": 1.000000e+04, "stage3_max_live_parameters": 3.000000e+07, "stage3_prefetch_bucket_size": 3.000000e+07, "memory_efficient_linear": false }, "fp16": { "enabled": true, "loss_scale_window": 100 }, "gradient_clipping": 1.0, "prescale_gradients": false, "wall_clock_breakdown": false, "hybrid_engine": { "enabled": false, "max_out_tokens": 512, "inference_tp_size": 1, "release_inference_cache": false, "pin_parameters": true, "tp_gather_partition_size": 8 } } ***** Running training ***** ***** Evaluating reward, Epoch 0/1 ***** chosen_last_scores (higher is better) : 2.575861692428589, acc (higher is better) : 0.4925000071525574 Beginning of Epoch 1/1, Total Micro Batches 3680 [2023-06-30 00:30:54,697] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-06-30 00:30:54,697] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-06-30 00:30:54,697] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-06-30 00:30:54,697] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-06-30 00:30:54,697] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-06-30 00:30:54,697] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-06-30 00:30:54,698] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-06-30 00:30:54,697] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-06-30 00:30:54,697] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-06-30 00:30:54,698] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-06-30 00:30:54,698] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-06-30 00:30:54,698] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-06-30 00:30:54,698] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-06-30 00:30:54,698] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-06-30 00:30:54,697] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-06-30 00:30:54,698] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 65536, reducing to 32768.0 [2023-06-30 00:30:54,698] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-06-30 00:30:54,954] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-06-30 00:30:54,954] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-06-30 00:30:54,954] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-06-30 00:30:54,954] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-06-30 00:30:54,954] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-06-30 00:30:54,954] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-06-30 00:30:54,955] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-06-30 00:30:54,955] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:30:54,955] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:30:54,955] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-06-30 00:30:54,955] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:30:54,955] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:30:54,955] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:30:54,955] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:30:54,955] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:30:54,955] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:30:54,955] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-06-30 00:30:55,211] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-06-30 00:30:55,211] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-06-30 00:30:55,211] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-06-30 00:30:55,211] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-06-30 00:30:55,211] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-06-30 00:30:55,211] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-06-30 00:30:55,211] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-06-30 00:30:55,211] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:30:55,211] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:30:55,211] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:30:55,211] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:30:55,211] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:30:55,211] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-06-30 00:30:55,211] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:30:55,211] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:30:55,211] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:30:55,211] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-06-30 00:30:55,470] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-06-30 00:30:55,470] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-06-30 00:30:55,470] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-06-30 00:30:55,470] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-06-30 00:30:55,470] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-06-30 00:30:55,470] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-30 00:30:55,470] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-06-30 00:30:55,470] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-30 00:30:55,470] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-06-30 00:30:55,470] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-30 00:30:55,470] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-30 00:30:55,470] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-06-30 00:30:55,470] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-30 00:30:55,470] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-30 00:30:55,470] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-30 00:30:55,470] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-30 00:30:55,470] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0 [2023-06-30 00:30:55,727] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-06-30 00:30:55,727] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-06-30 00:30:55,727] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-06-30 00:30:55,727] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-06-30 00:30:55,727] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-06-30 00:30:55,727] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-06-30 00:30:55,727] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-06-30 00:30:55,727] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-06-30 00:30:55,727] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-30 00:30:55,727] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-30 00:30:55,727] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-30 00:30:55,727] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-30 00:30:55,727] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-30 00:30:55,727] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-30 00:30:55,727] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-30 00:30:55,727] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-30 00:30:55,727] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 4096.0, reducing to 2048.0 [2023-06-30 00:30:55,985] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-06-30 00:30:55,985] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-06-30 00:30:55,985] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-06-30 00:30:55,985] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-06-30 00:30:55,985] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-06-30 00:30:55,985] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-06-30 00:30:55,985] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-06-30 00:30:55,985] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-06-30 00:30:55,985] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-06-30 00:30:55,985] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-06-30 00:30:55,985] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-06-30 00:30:55,985] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-06-30 00:30:55,985] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-06-30 00:30:55,985] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-06-30 00:30:55,985] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-06-30 00:30:55,985] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-06-30 00:30:55,985] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 2048.0, reducing to 1024.0 [2023-06-30 00:30:56,243] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-06-30 00:30:56,243] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-06-30 00:30:56,243] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-06-30 00:30:56,243] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-06-30 00:30:56,243] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-06-30 00:30:56,243] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-06-30 00:30:56,243] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-06-30 00:30:56,243] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-06-30 00:30:56,243] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-06-30 00:30:56,243] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-06-30 00:30:56,243] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-06-30 00:30:56,243] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-06-30 00:30:56,243] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-06-30 00:30:56,243] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-06-30 00:30:56,243] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-06-30 00:30:56,243] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1024.0, reducing to 512.0 [2023-06-30 00:30:56,243] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-06-30 00:30:57,164] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=7, lr=[4.999991801084829e-05, 4.999991801084829e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:30:57,180] [INFO] [timer.py:215:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=115.31334474715926, CurrSamplesPerSec=103.52727452238732, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:31:00,273] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=7, lr=[4.999846044088921e-05, 4.999846044088921e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:31:00,289] [INFO] [timer.py:215:stop] epoch=0/micro_step=20/global_step=20, RunningAvgSamplesPerSec=108.20087556070622, CurrSamplesPerSec=103.64271307270093, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:31:03,371] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=7, lr=[4.9995181012051625e-05, 4.9995181012051625e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:31:03,387] [INFO] [timer.py:215:stop] epoch=0/micro_step=30/global_step=30, RunningAvgSamplesPerSec=106.4669833189667, CurrSamplesPerSec=102.35712733237803, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:31:06,472] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=7, lr=[4.9990079963336504e-05, 4.9990079963336504e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:31:06,488] [INFO] [timer.py:215:stop] epoch=0/micro_step=40/global_step=40, RunningAvgSamplesPerSec=105.64137691879239, CurrSamplesPerSec=104.3554708243693, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:31:09,571] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=7, lr=[4.998315766650239e-05, 4.998315766650239e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:31:09,587] [INFO] [timer.py:215:stop] epoch=0/micro_step=50/global_step=50, RunningAvgSamplesPerSec=105.18395444720493, CurrSamplesPerSec=102.723361656337, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:31:12,677] [INFO] [logging.py:96:log_dist] [Rank 0] step=60, skipped=7, lr=[4.997441462603825e-05, 4.997441462603825e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:31:12,693] [INFO] [timer.py:215:stop] epoch=0/micro_step=60/global_step=60, RunningAvgSamplesPerSec=104.83865306831603, CurrSamplesPerSec=103.25457275793309, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:31:15,802] [INFO] [logging.py:96:log_dist] [Rank 0] step=70, skipped=7, lr=[4.996385147912677e-05, 4.996385147912677e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:31:15,818] [INFO] [timer.py:215:stop] epoch=0/micro_step=70/global_step=70, RunningAvgSamplesPerSec=104.51391923296349, CurrSamplesPerSec=103.43974813975623, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:31:18,888] [INFO] [logging.py:96:log_dist] [Rank 0] step=80, skipped=7, lr=[4.995146899559788e-05, 4.995146899559788e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:31:18,904] [INFO] [timer.py:215:stop] epoch=0/micro_step=80/global_step=80, RunningAvgSamplesPerSec=104.43762455359426, CurrSamplesPerSec=104.30243399779144, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:31:22,013] [INFO] [logging.py:96:log_dist] [Rank 0] step=90, skipped=7, lr=[4.993726807787265e-05, 4.993726807787265e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:31:22,029] [INFO] [timer.py:215:stop] epoch=0/micro_step=90/global_step=90, RunningAvgSamplesPerSec=104.23249007029328, CurrSamplesPerSec=100.70818667421003, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:31:25,170] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=7, lr=[4.9921249760897536e-05, 4.9921249760897536e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:31:25,186] [INFO] [timer.py:215:stop] epoch=0/micro_step=100/global_step=100, RunningAvgSamplesPerSec=103.9552388583616, CurrSamplesPerSec=101.9083871976845, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:31:27,686] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:31:27,686] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-06-30 00:31:27,686] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:31:27,687] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-06-30 00:31:27,687] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:31:27,687] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-06-30 00:31:27,687] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:31:27,687] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-06-30 00:31:27,688] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:31:27,688] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-06-30 00:31:27,690] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:31:27,690] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-06-30 00:31:27,691] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:31:27,691] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-06-30 00:31:27,697] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:31:27,697] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-06-30 00:31:28,336] [INFO] [logging.py:96:log_dist] [Rank 0] step=110, skipped=7, lr=[4.990341521206896e-05, 4.990341521206896e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:31:28,352] [INFO] [timer.py:215:stop] epoch=0/micro_step=110/global_step=110, RunningAvgSamplesPerSec=103.70494993875829, CurrSamplesPerSec=101.78102982427271, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:31:31,506] [INFO] [logging.py:96:log_dist] [Rank 0] step=120, skipped=7, lr=[4.9883765731148184e-05, 4.9883765731148184e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:31:31,521] [INFO] [timer.py:215:stop] epoch=0/micro_step=120/global_step=120, RunningAvgSamplesPerSec=103.48760810452458, CurrSamplesPerSec=101.45060340774582, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:31:34,659] [INFO] [logging.py:96:log_dist] [Rank 0] step=130, skipped=7, lr=[4.986230275016667e-05, 4.986230275016667e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:31:34,675] [INFO] [timer.py:215:stop] epoch=0/micro_step=130/global_step=130, RunningAvgSamplesPerSec=103.34378138497242, CurrSamplesPerSec=101.82241693882737, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:31:37,814] [INFO] [logging.py:96:log_dist] [Rank 0] step=140, skipped=7, lr=[4.983902783332164e-05, 4.983902783332164e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:31:37,830] [INFO] [timer.py:215:stop] epoch=0/micro_step=140/global_step=140, RunningAvgSamplesPerSec=103.21898402001713, CurrSamplesPerSec=101.81755066309215, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:31:40,970] [INFO] [logging.py:96:log_dist] [Rank 0] step=150, skipped=7, lr=[4.98139426768621e-05, 4.98139426768621e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:31:40,986] [INFO] [timer.py:215:stop] epoch=0/micro_step=150/global_step=150, RunningAvgSamplesPerSec=103.11025563829845, CurrSamplesPerSec=101.99465776649045, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:31:44,134] [INFO] [logging.py:96:log_dist] [Rank 0] step=160, skipped=7, lr=[4.9787049108965236e-05, 4.9787049108965236e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:31:44,150] [INFO] [timer.py:215:stop] epoch=0/micro_step=160/global_step=160, RunningAvgSamplesPerSec=102.9964334650231, CurrSamplesPerSec=100.69028099530674, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:31:47,367] [INFO] [logging.py:96:log_dist] [Rank 0] step=170, skipped=7, lr=[4.975834908960318e-05, 4.975834908960318e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:31:47,383] [INFO] [timer.py:215:stop] epoch=0/micro_step=170/global_step=170, RunningAvgSamplesPerSec=102.76413158564898, CurrSamplesPerSec=102.13280838017607, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:31:50,488] [INFO] [logging.py:96:log_dist] [Rank 0] step=180, skipped=7, lr=[4.9727844710400125e-05, 4.9727844710400125e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:31:50,504] [INFO] [timer.py:215:stop] epoch=0/micro_step=180/global_step=180, RunningAvgSamplesPerSec=102.76443672511549, CurrSamplesPerSec=104.66846082040813, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:31:53,587] [INFO] [logging.py:96:log_dist] [Rank 0] step=190, skipped=7, lr=[4.969553819447994e-05, 4.969553819447994e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:31:53,603] [INFO] [timer.py:215:stop] epoch=0/micro_step=190/global_step=190, RunningAvgSamplesPerSec=102.80278448843676, CurrSamplesPerSec=103.49981184337966, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:31:56,676] [INFO] [logging.py:96:log_dist] [Rank 0] step=200, skipped=7, lr=[4.966143189630415e-05, 4.966143189630415e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:31:56,692] [INFO] [timer.py:215:stop] epoch=0/micro_step=200/global_step=200, RunningAvgSamplesPerSec=102.85508288225616, CurrSamplesPerSec=103.8145878130656, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:31:59,131] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:31:59,131] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-06-30 00:31:59,131] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:31:59,131] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-06-30 00:31:59,131] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:31:59,131] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-06-30 00:31:59,131] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:31:59,131] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:31:59,132] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-06-30 00:31:59,132] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-06-30 00:31:59,132] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:31:59,132] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-06-30 00:31:59,132] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:31:59,132] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-06-30 00:31:59,132] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:31:59,132] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-06-30 00:31:59,767] [INFO] [logging.py:96:log_dist] [Rank 0] step=210, skipped=7, lr=[4.96255283015003e-05, 4.96255283015003e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:31:59,783] [INFO] [timer.py:215:stop] epoch=0/micro_step=210/global_step=210, RunningAvgSamplesPerSec=102.89672633512075, CurrSamplesPerSec=102.73169596659135, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:32:02,858] [INFO] [logging.py:96:log_dist] [Rank 0] step=220, skipped=7, lr=[4.9587830026680835e-05, 4.9587830026680835e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:32:02,874] [INFO] [timer.py:215:stop] epoch=0/micro_step=220/global_step=220, RunningAvgSamplesPerSec=102.93624214100747, CurrSamplesPerSec=104.37007758291543, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:32:05,954] [INFO] [logging.py:96:log_dist] [Rank 0] step=230, skipped=7, lr=[4.954833981925243e-05, 4.954833981925243e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:32:05,970] [INFO] [timer.py:215:stop] epoch=0/micro_step=230/global_step=230, RunningAvgSamplesPerSec=102.96497505231955, CurrSamplesPerSec=102.70182451479415, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:32:09,049] [INFO] [logging.py:96:log_dist] [Rank 0] step=240, skipped=7, lr=[4.950706055721572e-05, 4.950706055721572e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:32:09,065] [INFO] [timer.py:215:stop] epoch=0/micro_step=240/global_step=240, RunningAvgSamplesPerSec=102.9918477738346, CurrSamplesPerSec=104.25924965005522, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:32:12,134] [INFO] [logging.py:96:log_dist] [Rank 0] step=250, skipped=7, lr=[4.9463995248955566e-05, 4.9463995248955566e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:32:12,150] [INFO] [timer.py:215:stop] epoch=0/micro_step=250/global_step=250, RunningAvgSamplesPerSec=103.02985925480897, CurrSamplesPerSec=104.17492867055576, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:32:15,220] [INFO] [logging.py:96:log_dist] [Rank 0] step=260, skipped=7, lr=[4.9419147033021814e-05, 4.9419147033021814e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:32:15,236] [INFO] [timer.py:215:stop] epoch=0/micro_step=260/global_step=260, RunningAvgSamplesPerSec=103.06377383054685, CurrSamplesPerSec=104.14704054292083, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:32:18,319] [INFO] [logging.py:96:log_dist] [Rank 0] step=270, skipped=7, lr=[4.9372519177900555e-05, 4.9372519177900555e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:32:18,334] [INFO] [timer.py:215:stop] epoch=0/micro_step=270/global_step=270, RunningAvgSamplesPerSec=103.08129982020756, CurrSamplesPerSec=102.30056539892713, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:32:21,439] [INFO] [logging.py:96:log_dist] [Rank 0] step=280, skipped=7, lr=[4.932411508177595e-05, 4.932411508177595e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:32:21,455] [INFO] [timer.py:215:stop] epoch=0/micro_step=280/global_step=280, RunningAvgSamplesPerSec=103.06964606467078, CurrSamplesPerSec=102.98417307035683, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:32:24,536] [INFO] [logging.py:96:log_dist] [Rank 0] step=290, skipped=7, lr=[4.92739382722825e-05, 4.92739382722825e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:32:24,552] [INFO] [timer.py:215:stop] epoch=0/micro_step=290/global_step=290, RunningAvgSamplesPerSec=103.08511789529751, CurrSamplesPerSec=103.88369040247677, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:32:27,619] [INFO] [logging.py:96:log_dist] [Rank 0] step=300, skipped=7, lr=[4.922199240624807e-05, 4.922199240624807e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:32:27,635] [INFO] [timer.py:215:stop] epoch=0/micro_step=300/global_step=300, RunningAvgSamplesPerSec=103.11586746053922, CurrSamplesPerSec=104.6226893677789, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:32:30,078] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:32:30,078] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-30 00:32:30,078] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:32:30,079] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:32:30,079] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-30 00:32:30,079] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-30 00:32:30,079] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:32:30,079] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-30 00:32:30,079] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:32:30,079] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:32:30,079] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-30 00:32:30,079] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-30 00:32:30,079] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:32:30,079] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-30 00:32:30,080] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:32:30,080] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-30 00:32:30,715] [INFO] [logging.py:96:log_dist] [Rank 0] step=310, skipped=7, lr=[4.9168281269427265e-05, 4.9168281269427265e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:32:30,731] [INFO] [timer.py:215:stop] epoch=0/micro_step=310/global_step=310, RunningAvgSamplesPerSec=103.13049229194807, CurrSamplesPerSec=102.19424239678564, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:32:33,801] [INFO] [logging.py:96:log_dist] [Rank 0] step=320, skipped=7, lr=[4.9112808776225604e-05, 4.9112808776225604e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:32:33,817] [INFO] [timer.py:215:stop] epoch=0/micro_step=320/global_step=320, RunningAvgSamplesPerSec=103.15439184491017, CurrSamplesPerSec=104.00862956120665, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:32:36,887] [INFO] [logging.py:96:log_dist] [Rank 0] step=330, skipped=7, lr=[4.905557896941422e-05, 4.905557896941422e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:32:36,903] [INFO] [timer.py:215:stop] epoch=0/micro_step=330/global_step=330, RunningAvgSamplesPerSec=103.1768444365252, CurrSamplesPerSec=104.29967820572298, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:32:39,970] [INFO] [logging.py:96:log_dist] [Rank 0] step=340, skipped=7, lr=[4.899659601983524e-05, 4.899659601983524e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:32:39,986] [INFO] [timer.py:215:stop] epoch=0/micro_step=340/global_step=340, RunningAvgSamplesPerSec=103.20151411081378, CurrSamplesPerSec=104.24119251084797, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:32:43,054] [INFO] [logging.py:96:log_dist] [Rank 0] step=350, skipped=7, lr=[4.893586422609778e-05, 4.893586422609778e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:32:43,070] [INFO] [timer.py:215:stop] epoch=0/micro_step=350/global_step=350, RunningAvgSamplesPerSec=103.2232335262338, CurrSamplesPerSec=103.5881594769103, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:32:46,138] [INFO] [logging.py:96:log_dist] [Rank 0] step=360, skipped=7, lr=[4.887338801426473e-05, 4.887338801426473e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:32:46,154] [INFO] [timer.py:215:stop] epoch=0/micro_step=360/global_step=360, RunningAvgSamplesPerSec=103.2434726589322, CurrSamplesPerSec=104.21998538628128, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:32:48,575] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 367 [2023-06-30 00:32:48,575] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 367 [2023-06-30 00:32:48,575] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 367 [2023-06-30 00:32:48,575] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-30 00:32:48,575] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-30 00:32:48,575] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-30 00:32:48,575] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 367 [2023-06-30 00:32:48,575] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 4096.0, reducing to 2048.0 [2023-06-30 00:32:48,575] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 367 [2023-06-30 00:32:48,575] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-30 00:32:48,575] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 367 [2023-06-30 00:32:48,575] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-30 00:32:48,575] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-30 00:32:48,575] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 367 [2023-06-30 00:32:48,575] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-30 00:32:48,577] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 367 [2023-06-30 00:32:48,577] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-30 00:32:49,187] [INFO] [logging.py:96:log_dist] [Rank 0] step=370, skipped=8, lr=[4.88156717067366e-05, 4.88156717067366e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:32:49,202] [INFO] [timer.py:215:stop] epoch=0/micro_step=370/global_step=370, RunningAvgSamplesPerSec=103.29607859209052, CurrSamplesPerSec=101.4904175302673, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:32:52,294] [INFO] [logging.py:96:log_dist] [Rank 0] step=380, skipped=8, lr=[4.874989374937817e-05, 4.874989374937817e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:32:52,310] [INFO] [timer.py:215:stop] epoch=0/micro_step=380/global_step=380, RunningAvgSamplesPerSec=103.2931161941711, CurrSamplesPerSec=104.38387660997861, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:32:55,394] [INFO] [logging.py:96:log_dist] [Rank 0] step=390, skipped=8, lr=[4.8682384927237355e-05, 4.8682384927237355e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:32:55,410] [INFO] [timer.py:215:stop] epoch=0/micro_step=390/global_step=390, RunningAvgSamplesPerSec=103.29660493465222, CurrSamplesPerSec=103.68154349221602, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:32:58,476] [INFO] [logging.py:96:log_dist] [Rank 0] step=400, skipped=8, lr=[4.861315016027902e-05, 4.861315016027902e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:32:58,491] [INFO] [timer.py:215:stop] epoch=0/micro_step=400/global_step=400, RunningAvgSamplesPerSec=103.31595954457559, CurrSamplesPerSec=103.5284723545297, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:33:01,558] [INFO] [logging.py:96:log_dist] [Rank 0] step=410, skipped=8, lr=[4.854219449425288e-05, 4.854219449425288e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:33:01,574] [INFO] [timer.py:215:stop] epoch=0/micro_step=410/global_step=410, RunningAvgSamplesPerSec=103.33323884033565, CurrSamplesPerSec=104.54641107328966, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:33:04,640] [INFO] [logging.py:96:log_dist] [Rank 0] step=420, skipped=8, lr=[4.84695231003258e-05, 4.84695231003258e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:33:04,656] [INFO] [timer.py:215:stop] epoch=0/micro_step=420/global_step=420, RunningAvgSamplesPerSec=103.34998518177052, CurrSamplesPerSec=104.34906137122368, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:33:07,725] [INFO] [logging.py:96:log_dist] [Rank 0] step=430, skipped=8, lr=[4.83951412747049e-05, 4.83951412747049e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:33:07,741] [INFO] [timer.py:215:stop] epoch=0/micro_step=430/global_step=430, RunningAvgSamplesPerSec=103.36367155717534, CurrSamplesPerSec=104.28922373066786, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:33:10,810] [INFO] [logging.py:96:log_dist] [Rank 0] step=440, skipped=8, lr=[4.831905443825159e-05, 4.831905443825159e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:33:10,826] [INFO] [timer.py:215:stop] epoch=0/micro_step=440/global_step=440, RunningAvgSamplesPerSec=103.37696014443303, CurrSamplesPerSec=103.39424827981475, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:33:13,894] [INFO] [logging.py:96:log_dist] [Rank 0] step=450, skipped=8, lr=[4.824126813608649e-05, 4.824126813608649e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:33:13,910] [INFO] [timer.py:215:stop] epoch=0/micro_step=450/global_step=450, RunningAvgSamplesPerSec=103.39018169440214, CurrSamplesPerSec=104.47634297030135, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:33:16,988] [INFO] [logging.py:96:log_dist] [Rank 0] step=460, skipped=8, lr=[4.8161788037185327e-05, 4.8161788037185327e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:33:17,004] [INFO] [timer.py:215:stop] epoch=0/micro_step=460/global_step=460, RunningAvgSamplesPerSec=103.39546268759845, CurrSamplesPerSec=104.34395059943202, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:33:19,751] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:33:19,751] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:33:19,752] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-30 00:33:19,752] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-30 00:33:19,752] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:33:19,752] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-30 00:33:19,752] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:33:19,752] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-30 00:33:19,752] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:33:19,752] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-30 00:33:19,752] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:33:19,752] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-30 00:33:19,753] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:33:19,753] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-30 00:33:19,753] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:33:19,753] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-30 00:33:20,075] [INFO] [logging.py:96:log_dist] [Rank 0] step=470, skipped=8, lr=[4.808061993396574e-05, 4.808061993396574e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:33:20,091] [INFO] [timer.py:215:stop] epoch=0/micro_step=470/global_step=470, RunningAvgSamplesPerSec=103.40621614648363, CurrSamplesPerSec=104.30891878890182, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:33:23,201] [INFO] [logging.py:96:log_dist] [Rank 0] step=480, skipped=8, lr=[4.7997769741865226e-05, 4.7997769741865226e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:33:23,216] [INFO] [timer.py:215:stop] epoch=0/micro_step=480/global_step=480, RunningAvgSamplesPerSec=103.38920894699105, CurrSamplesPerSec=103.64527418353447, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:33:26,286] [INFO] [logging.py:96:log_dist] [Rank 0] step=490, skipped=8, lr=[4.791324349890993e-05, 4.791324349890993e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:33:26,301] [INFO] [timer.py:215:stop] epoch=0/micro_step=490/global_step=490, RunningAvgSamplesPerSec=103.40059728715907, CurrSamplesPerSec=104.42367069964733, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:33:29,381] [INFO] [logging.py:96:log_dist] [Rank 0] step=500, skipped=8, lr=[4.782704736527466e-05, 4.782704736527466e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:33:29,397] [INFO] [timer.py:215:stop] epoch=0/micro_step=500/global_step=500, RunningAvgSamplesPerSec=103.4045579960439, CurrSamplesPerSec=102.99484172965506, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:33:32,475] [INFO] [logging.py:96:log_dist] [Rank 0] step=510, skipped=8, lr=[4.7739187622833914e-05, 4.7739187622833914e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:33:32,491] [INFO] [timer.py:215:stop] epoch=0/micro_step=510/global_step=510, RunningAvgSamplesPerSec=103.40929418376952, CurrSamplesPerSec=104.64552929453903, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:33:35,560] [INFO] [logging.py:96:log_dist] [Rank 0] step=520, skipped=8, lr=[4.76496706747041e-05, 4.76496706747041e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:33:35,576] [INFO] [timer.py:215:stop] epoch=0/micro_step=520/global_step=520, RunningAvgSamplesPerSec=103.41891074552596, CurrSamplesPerSec=103.3895491748031, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:33:38,644] [INFO] [logging.py:96:log_dist] [Rank 0] step=530, skipped=8, lr=[4.755850304477682e-05, 4.755850304477682e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:33:38,660] [INFO] [timer.py:215:stop] epoch=0/micro_step=530/global_step=530, RunningAvgSamplesPerSec=103.4298590253167, CurrSamplesPerSec=104.47723755701898, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:33:41,722] [INFO] [logging.py:96:log_dist] [Rank 0] step=540, skipped=8, lr=[4.74656913772435e-05, 4.74656913772435e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:33:41,738] [INFO] [timer.py:215:stop] epoch=0/micro_step=540/global_step=540, RunningAvgSamplesPerSec=103.44339373517894, CurrSamplesPerSec=104.63068220796148, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:33:44,804] [INFO] [logging.py:96:log_dist] [Rank 0] step=550, skipped=8, lr=[4.737124243611111e-05, 4.737124243611111e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:33:44,820] [INFO] [timer.py:215:stop] epoch=0/micro_step=550/global_step=550, RunningAvgSamplesPerSec=103.45434833997372, CurrSamplesPerSec=104.7151710056868, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:33:47,877] [INFO] [logging.py:96:log_dist] [Rank 0] step=560, skipped=8, lr=[4.72751631047092e-05, 4.72751631047092e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:33:47,893] [INFO] [timer.py:215:stop] epoch=0/micro_step=560/global_step=560, RunningAvgSamplesPerSec=103.47048588480253, CurrSamplesPerSec=104.77975191869166, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:33:50,635] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:33:50,635] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:33:50,635] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-30 00:33:50,635] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:33:50,635] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-30 00:33:50,635] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:33:50,635] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-30 00:33:50,635] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-30 00:33:50,636] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:33:50,636] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:33:50,636] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-30 00:33:50,636] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-30 00:33:50,636] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:33:50,636] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-30 00:33:50,639] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:33:50,639] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-30 00:33:50,960] [INFO] [logging.py:96:log_dist] [Rank 0] step=570, skipped=8, lr=[4.717746038518831e-05, 4.717746038518831e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:33:50,976] [INFO] [timer.py:215:stop] epoch=0/micro_step=570/global_step=570, RunningAvgSamplesPerSec=103.4795265209915, CurrSamplesPerSec=104.67490957188781, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:33:54,070] [INFO] [logging.py:96:log_dist] [Rank 0] step=580, skipped=8, lr=[4.707814139800961e-05, 4.707814139800961e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:33:54,085] [INFO] [timer.py:215:stop] epoch=0/micro_step=580/global_step=580, RunningAvgSamplesPerSec=103.47386557877749, CurrSamplesPerSec=102.97990622547165, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:33:57,152] [INFO] [logging.py:96:log_dist] [Rank 0] step=590, skipped=8, lr=[4.6977213381426e-05, 4.6977213381426e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:33:57,167] [INFO] [timer.py:215:stop] epoch=0/micro_step=590/global_step=590, RunningAvgSamplesPerSec=103.48333026458761, CurrSamplesPerSec=104.08234912766937, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:34:00,228] [INFO] [logging.py:96:log_dist] [Rank 0] step=600, skipped=8, lr=[4.687468369095457e-05, 4.687468369095457e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:34:00,244] [INFO] [timer.py:215:stop] epoch=0/micro_step=600/global_step=600, RunningAvgSamplesPerSec=103.4956668470901, CurrSamplesPerSec=103.79347362934539, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:34:03,323] [INFO] [logging.py:96:log_dist] [Rank 0] step=610, skipped=8, lr=[4.6770559798840544e-05, 4.6770559798840544e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:34:03,339] [INFO] [timer.py:215:stop] epoch=0/micro_step=610/global_step=610, RunningAvgSamplesPerSec=103.49775068206621, CurrSamplesPerSec=103.72425049420937, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:34:06,401] [INFO] [logging.py:96:log_dist] [Rank 0] step=620, skipped=8, lr=[4.666484929351275e-05, 4.666484929351275e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:34:06,417] [INFO] [timer.py:215:stop] epoch=0/micro_step=620/global_step=620, RunningAvgSamplesPerSec=103.50848911173765, CurrSamplesPerSec=104.32116102976724, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:34:09,489] [INFO] [logging.py:96:log_dist] [Rank 0] step=630, skipped=8, lr=[4.655755987903051e-05, 4.655755987903051e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:34:09,505] [INFO] [timer.py:215:stop] epoch=0/micro_step=630/global_step=630, RunningAvgSamplesPerSec=103.5142540189353, CurrSamplesPerSec=103.19090563815375, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:34:12,578] [INFO] [logging.py:96:log_dist] [Rank 0] step=640, skipped=8, lr=[4.644869937452224e-05, 4.644869937452224e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:34:12,594] [INFO] [timer.py:215:stop] epoch=0/micro_step=640/global_step=640, RunningAvgSamplesPerSec=103.51851391769982, CurrSamplesPerSec=104.05894757734389, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:34:15,663] [INFO] [logging.py:96:log_dist] [Rank 0] step=650, skipped=8, lr=[4.6338275713615597e-05, 4.6338275713615597e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:34:15,679] [INFO] [timer.py:215:stop] epoch=0/micro_step=650/global_step=650, RunningAvgSamplesPerSec=103.52502176111665, CurrSamplesPerSec=104.24524065275929, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:34:18,745] [INFO] [logging.py:96:log_dist] [Rank 0] step=660, skipped=8, lr=[4.6226296943859225e-05, 4.6226296943859225e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:34:18,761] [INFO] [timer.py:215:stop] epoch=0/micro_step=660/global_step=660, RunningAvgSamplesPerSec=103.53278538662174, CurrSamplesPerSec=104.3600147111079, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:34:21,510] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:34:21,511] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:34:21,511] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:34:21,511] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:34:21,511] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:34:21,511] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:34:21,511] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:34:21,511] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:34:21,511] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:34:21,512] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:34:21,512] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:34:21,512] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:34:21,512] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:34:21,512] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:34:21,512] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:34:21,512] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:34:21,834] [INFO] [logging.py:96:log_dist] [Rank 0] step=670, skipped=8, lr=[4.611277122613634e-05, 4.611277122613634e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:34:21,850] [INFO] [timer.py:215:stop] epoch=0/micro_step=670/global_step=670, RunningAvgSamplesPerSec=103.53705547725424, CurrSamplesPerSec=104.25908767481675, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:34:24,938] [INFO] [logging.py:96:log_dist] [Rank 0] step=680, skipped=8, lr=[4.599770683406991e-05, 4.599770683406991e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:34:24,954] [INFO] [timer.py:215:stop] epoch=0/micro_step=680/global_step=680, RunningAvgSamplesPerSec=103.53409429124896, CurrSamplesPerSec=102.45651554927225, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:34:28,029] [INFO] [logging.py:96:log_dist] [Rank 0] step=690, skipped=8, lr=[4.588111215341973e-05, 4.588111215341973e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:34:28,044] [INFO] [timer.py:215:stop] epoch=0/micro_step=690/global_step=690, RunningAvgSamplesPerSec=103.53729631019901, CurrSamplesPerSec=103.98816148111648, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:34:29,533] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 694 [2023-06-30 00:34:29,533] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 694 [2023-06-30 00:34:29,533] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 694 [2023-06-30 00:34:29,533] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 694 [2023-06-30 00:34:29,533] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 694 [2023-06-30 00:34:29,533] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 694 [2023-06-30 00:34:29,533] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 694 [2023-06-30 00:34:29,533] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 694 [2023-06-30 00:34:29,533] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:34:29,533] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:34:29,533] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:34:29,533] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:34:29,533] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:34:29,533] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:34:29,533] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:34:29,533] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:34:29,533] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-06-30 00:34:30,716] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 698 [2023-06-30 00:34:30,716] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 698 [2023-06-30 00:34:30,716] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 698 [2023-06-30 00:34:30,716] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 698 [2023-06-30 00:34:30,716] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 698 [2023-06-30 00:34:30,716] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 698 [2023-06-30 00:34:30,716] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-30 00:34:30,716] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-30 00:34:30,716] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 698 [2023-06-30 00:34:30,716] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 698 [2023-06-30 00:34:30,716] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-30 00:34:30,716] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-30 00:34:30,716] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-30 00:34:30,716] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-30 00:34:30,716] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-30 00:34:30,716] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0 [2023-06-30 00:34:30,716] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-06-30 00:34:31,011] [INFO] [logging.py:96:log_dist] [Rank 0] step=700, skipped=10, lr=[4.5786740307563636e-05, 4.5786740307563636e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:34:31,027] [INFO] [timer.py:215:stop] epoch=0/micro_step=700/global_step=700, RunningAvgSamplesPerSec=103.59203025170308, CurrSamplesPerSec=103.28977420082205, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:34:31,597] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 701 [2023-06-30 00:34:31,597] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 701 [2023-06-30 00:34:31,597] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-30 00:34:31,597] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 701 [2023-06-30 00:34:31,597] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-30 00:34:31,597] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-30 00:34:31,597] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 701 [2023-06-30 00:34:31,597] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 4096.0, reducing to 2048.0 [2023-06-30 00:34:31,597] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-30 00:34:31,597] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 701 [2023-06-30 00:34:31,597] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 701 [2023-06-30 00:34:31,597] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-30 00:34:31,597] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-30 00:34:31,597] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 701 [2023-06-30 00:34:31,598] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 701 [2023-06-30 00:34:31,598] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-30 00:34:31,598] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-06-30 00:34:31,856] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 702 [2023-06-30 00:34:31,856] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 702 [2023-06-30 00:34:31,856] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 702 [2023-06-30 00:34:31,856] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 702 [2023-06-30 00:34:31,856] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 702 [2023-06-30 00:34:31,856] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 702 [2023-06-30 00:34:31,856] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 702 [2023-06-30 00:34:31,856] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 702 [2023-06-30 00:34:31,856] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-06-30 00:34:31,856] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-06-30 00:34:31,856] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-06-30 00:34:31,856] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-06-30 00:34:31,856] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-06-30 00:34:31,856] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-06-30 00:34:31,856] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-06-30 00:34:31,856] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-06-30 00:34:31,856] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 2048.0, reducing to 1024.0 [2023-06-30 00:34:33,997] [INFO] [logging.py:96:log_dist] [Rank 0] step=710, skipped=12, lr=[4.569139891489183e-05, 4.569139891489183e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:34:34,013] [INFO] [timer.py:215:stop] epoch=0/micro_step=710/global_step=710, RunningAvgSamplesPerSec=103.64421969955059, CurrSamplesPerSec=104.12466001039557, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:34:37,087] [INFO] [logging.py:96:log_dist] [Rank 0] step=720, skipped=12, lr=[4.5570865527807505e-05, 4.5570865527807505e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:34:37,103] [INFO] [timer.py:215:stop] epoch=0/micro_step=720/global_step=720, RunningAvgSamplesPerSec=103.64588103910344, CurrSamplesPerSec=104.11924811318487, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:34:40,172] [INFO] [logging.py:96:log_dist] [Rank 0] step=730, skipped=12, lr=[4.544883295984006e-05, 4.544883295984006e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:34:40,188] [INFO] [timer.py:215:stop] epoch=0/micro_step=730/global_step=730, RunningAvgSamplesPerSec=103.64954159018527, CurrSamplesPerSec=102.72674239275773, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:34:43,250] [INFO] [logging.py:96:log_dist] [Rank 0] step=740, skipped=12, lr=[4.532531010458188e-05, 4.532531010458188e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:34:43,266] [INFO] [timer.py:215:stop] epoch=0/micro_step=740/global_step=740, RunningAvgSamplesPerSec=103.65666417907528, CurrSamplesPerSec=104.21723394921051, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:34:46,333] [INFO] [logging.py:96:log_dist] [Rank 0] step=750, skipped=12, lr=[4.520030596423575e-05, 4.520030596423575e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:34:46,349] [INFO] [timer.py:215:stop] epoch=0/micro_step=750/global_step=750, RunningAvgSamplesPerSec=103.66137529447003, CurrSamplesPerSec=104.35790500019827, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:34:49,409] [INFO] [logging.py:96:log_dist] [Rank 0] step=760, skipped=12, lr=[4.507382964895884e-05, 4.507382964895884e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:34:49,425] [INFO] [timer.py:215:stop] epoch=0/micro_step=760/global_step=760, RunningAvgSamplesPerSec=103.66895577694024, CurrSamplesPerSec=104.46625959783468, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:34:52,488] [INFO] [logging.py:96:log_dist] [Rank 0] step=770, skipped=12, lr=[4.494589037619867e-05, 4.494589037619867e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:34:52,504] [INFO] [timer.py:215:stop] epoch=0/micro_step=770/global_step=770, RunningAvgSamplesPerSec=103.6752039364773, CurrSamplesPerSec=104.25147540627073, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:34:55,582] [INFO] [logging.py:96:log_dist] [Rank 0] step=780, skipped=12, lr=[4.4816497470021454e-05, 4.4816497470021454e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:34:55,598] [INFO] [timer.py:215:stop] epoch=0/micro_step=780/global_step=780, RunningAvgSamplesPerSec=103.6749080603129, CurrSamplesPerSec=100.70312408557585, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:34:58,687] [INFO] [logging.py:96:log_dist] [Rank 0] step=790, skipped=12, lr=[4.468566036043251e-05, 4.468566036043251e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:34:58,703] [INFO] [timer.py:215:stop] epoch=0/micro_step=790/global_step=790, RunningAvgSamplesPerSec=103.66990462352513, CurrSamplesPerSec=103.95320709824527, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:35:01,771] [INFO] [logging.py:96:log_dist] [Rank 0] step=800, skipped=12, lr=[4.455338858268903e-05, 4.455338858268903e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:35:01,787] [INFO] [timer.py:215:stop] epoch=0/micro_step=800/global_step=800, RunningAvgSamplesPerSec=103.67361170836662, CurrSamplesPerSec=103.81627410683001, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:35:02,995] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:35:02,995] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-06-30 00:35:02,995] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:35:02,995] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-06-30 00:35:02,995] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:35:02,995] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-06-30 00:35:02,996] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:35:02,996] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-06-30 00:35:02,996] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:35:02,996] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-06-30 00:35:02,996] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:35:02,996] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-06-30 00:35:02,996] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:35:02,996] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-06-30 00:35:02,996] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:35:02,996] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-06-30 00:35:04,865] [INFO] [logging.py:96:log_dist] [Rank 0] step=810, skipped=12, lr=[4.4419691776605146e-05, 4.4419691776605146e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:35:04,881] [INFO] [timer.py:215:stop] epoch=0/micro_step=810/global_step=810, RunningAvgSamplesPerSec=103.6732195733665, CurrSamplesPerSec=103.86929984158542, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:35:07,958] [INFO] [logging.py:96:log_dist] [Rank 0] step=820, skipped=12, lr=[4.428457968584945e-05, 4.428457968584945e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:35:07,973] [INFO] [timer.py:215:stop] epoch=0/micro_step=820/global_step=820, RunningAvgSamplesPerSec=103.67335069390255, CurrSamplesPerSec=103.36606005529585, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:35:11,075] [INFO] [logging.py:96:log_dist] [Rank 0] step=830, skipped=12, lr=[4.41480621572348e-05, 4.41480621572348e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:35:11,091] [INFO] [timer.py:215:stop] epoch=0/micro_step=830/global_step=830, RunningAvgSamplesPerSec=103.66356144372286, CurrSamplesPerSec=103.41934395180772, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:35:14,161] [INFO] [logging.py:96:log_dist] [Rank 0] step=840, skipped=12, lr=[4.401014914000078e-05, 4.401014914000078e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:35:14,177] [INFO] [timer.py:215:stop] epoch=0/micro_step=840/global_step=840, RunningAvgSamplesPerSec=103.66608799772192, CurrSamplesPerSec=104.0730679432931, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:35:17,248] [INFO] [logging.py:96:log_dist] [Rank 0] step=850, skipped=12, lr=[4.387085068508852e-05, 4.387085068508852e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:35:17,264] [INFO] [timer.py:215:stop] epoch=0/micro_step=850/global_step=850, RunningAvgSamplesPerSec=103.66868331173174, CurrSamplesPerSec=104.40344501418828, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:35:20,341] [INFO] [logging.py:96:log_dist] [Rank 0] step=860, skipped=12, lr=[4.373017694440827e-05, 4.373017694440827e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:35:20,357] [INFO] [timer.py:215:stop] epoch=0/micro_step=860/global_step=860, RunningAvgSamplesPerSec=103.66858904058209, CurrSamplesPerSec=104.42269578808883, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:35:23,423] [INFO] [logging.py:96:log_dist] [Rank 0] step=870, skipped=12, lr=[4.358813817009955e-05, 4.358813817009955e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:35:23,439] [INFO] [timer.py:215:stop] epoch=0/micro_step=870/global_step=870, RunningAvgSamplesPerSec=103.67266558727816, CurrSamplesPerSec=104.23301622310063, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:35:26,516] [INFO] [logging.py:96:log_dist] [Rank 0] step=880, skipped=12, lr=[4.344474471378389e-05, 4.344474471378389e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:35:26,532] [INFO] [timer.py:215:stop] epoch=0/micro_step=880/global_step=880, RunningAvgSamplesPerSec=103.67301203941601, CurrSamplesPerSec=104.77746161334417, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:35:29,642] [INFO] [logging.py:96:log_dist] [Rank 0] step=890, skipped=12, lr=[4.330000702581053e-05, 4.330000702581053e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:35:29,658] [INFO] [timer.py:215:stop] epoch=0/micro_step=890/global_step=890, RunningAvgSamplesPerSec=103.66073898456521, CurrSamplesPerSec=103.05423146055855, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:35:32,744] [INFO] [logging.py:96:log_dist] [Rank 0] step=900, skipped=12, lr=[4.315393565449472e-05, 4.315393565449472e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:35:32,760] [INFO] [timer.py:215:stop] epoch=0/micro_step=900/global_step=900, RunningAvgSamplesPerSec=103.65747497608427, CurrSamplesPerSec=102.01946626999359, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:35:33,960] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:35:33,960] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-30 00:35:33,960] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:35:33,960] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-30 00:35:33,960] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:35:33,960] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-30 00:35:33,960] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:35:33,960] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-30 00:35:33,961] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:35:33,961] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:35:33,961] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:35:33,961] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-30 00:35:33,961] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-30 00:35:33,961] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-30 00:35:33,961] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:35:33,961] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-06-30 00:35:35,830] [INFO] [logging.py:96:log_dist] [Rank 0] step=910, skipped=12, lr=[4.300654124534902e-05, 4.300654124534902e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:35:35,846] [INFO] [timer.py:215:stop] epoch=0/micro_step=910/global_step=910, RunningAvgSamplesPerSec=103.65991450372336, CurrSamplesPerSec=104.04918659322685, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:35:38,915] [INFO] [logging.py:96:log_dist] [Rank 0] step=920, skipped=12, lr=[4.2857834540307485e-05, 4.2857834540307485e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:35:38,931] [INFO] [timer.py:215:stop] epoch=0/micro_step=920/global_step=920, RunningAvgSamplesPerSec=103.66306316758931, CurrSamplesPerSec=103.87950949342596, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:35:42,002] [INFO] [logging.py:96:log_dist] [Rank 0] step=930, skipped=12, lr=[4.270782637694273e-05, 4.270782637694273e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:35:42,018] [INFO] [timer.py:215:stop] epoch=0/micro_step=930/global_step=930, RunningAvgSamplesPerSec=103.66518075086356, CurrSamplesPerSec=103.1348452144299, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:35:45,097] [INFO] [logging.py:96:log_dist] [Rank 0] step=940, skipped=12, lr=[4.2556527687676186e-05, 4.2556527687676186e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:35:45,113] [INFO] [timer.py:215:stop] epoch=0/micro_step=940/global_step=940, RunningAvgSamplesPerSec=103.6643913753613, CurrSamplesPerSec=102.81393212878693, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:35:48,179] [INFO] [logging.py:96:log_dist] [Rank 0] step=950, skipped=12, lr=[4.2403949498981285e-05, 4.2403949498981285e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:35:48,195] [INFO] [timer.py:215:stop] epoch=0/micro_step=950/global_step=950, RunningAvgSamplesPerSec=103.66849044535626, CurrSamplesPerSec=104.1757372448148, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:35:51,266] [INFO] [logging.py:96:log_dist] [Rank 0] step=960, skipped=12, lr=[4.2250102930579936e-05, 4.2250102930579936e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:35:51,282] [INFO] [timer.py:215:stop] epoch=0/micro_step=960/global_step=960, RunningAvgSamplesPerSec=103.6705890027867, CurrSamplesPerSec=104.09566850141425, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:35:54,348] [INFO] [logging.py:96:log_dist] [Rank 0] step=970, skipped=12, lr=[4.209499919463207e-05, 4.209499919463207e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:35:54,363] [INFO] [timer.py:215:stop] epoch=0/micro_step=970/global_step=970, RunningAvgSamplesPerSec=103.67448439561957, CurrSamplesPerSec=104.31208041694451, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:35:57,434] [INFO] [logging.py:96:log_dist] [Rank 0] step=980, skipped=12, lr=[4.193864959491853e-05, 4.193864959491853e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:35:57,450] [INFO] [timer.py:215:stop] epoch=0/micro_step=980/global_step=980, RunningAvgSamplesPerSec=103.67654072323519, CurrSamplesPerSec=103.80366837948455, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:36:00,579] [INFO] [logging.py:96:log_dist] [Rank 0] step=990, skipped=12, lr=[4.178106552601727e-05, 4.178106552601727e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:36:00,595] [INFO] [timer.py:215:stop] epoch=0/micro_step=990/global_step=990, RunningAvgSamplesPerSec=103.6589761214091, CurrSamplesPerSec=101.64105756349645, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:36:03,665] [INFO] [logging.py:96:log_dist] [Rank 0] step=1000, skipped=12, lr=[4.162225847247295e-05, 4.162225847247295e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:36:03,681] [INFO] [timer.py:215:stop] epoch=0/micro_step=1000/global_step=1000, RunningAvgSamplesPerSec=103.6614336083926, CurrSamplesPerSec=103.91473613633795, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:36:04,883] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:36:04,883] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-30 00:36:04,883] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:36:04,884] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-30 00:36:04,884] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:36:04,884] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-30 00:36:04,884] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:36:04,884] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:36:04,884] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-30 00:36:04,884] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-30 00:36:04,884] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:36:04,884] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-30 00:36:04,884] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:36:04,884] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-30 00:36:04,886] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:36:04,886] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-06-30 00:36:06,754] [INFO] [logging.py:96:log_dist] [Rank 0] step=1010, skipped=12, lr=[4.146224000795992e-05, 4.146224000795992e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:36:06,770] [INFO] [timer.py:215:stop] epoch=0/micro_step=1010/global_step=1010, RunningAvgSamplesPerSec=103.66273298337852, CurrSamplesPerSec=102.83953481600898, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:36:09,839] [INFO] [logging.py:96:log_dist] [Rank 0] step=1020, skipped=12, lr=[4.130102179443877e-05, 4.130102179443877e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:36:09,855] [INFO] [timer.py:215:stop] epoch=0/micro_step=1020/global_step=1020, RunningAvgSamplesPerSec=103.66523145581374, CurrSamplesPerSec=104.14081827422196, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:36:12,927] [INFO] [logging.py:96:log_dist] [Rank 0] step=1030, skipped=12, lr=[4.1138615581306386e-05, 4.1138615581306386e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:36:12,943] [INFO] [timer.py:215:stop] epoch=0/micro_step=1030/global_step=1030, RunningAvgSamplesPerSec=103.66699592815904, CurrSamplesPerSec=103.82486702580272, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:36:16,013] [INFO] [logging.py:96:log_dist] [Rank 0] step=1040, skipped=12, lr=[4.097503320453971e-05, 4.097503320453971e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:36:16,028] [INFO] [timer.py:215:stop] epoch=0/micro_step=1040/global_step=1040, RunningAvgSamplesPerSec=103.66937985424632, CurrSamplesPerSec=103.5216850055302, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:36:19,114] [INFO] [logging.py:96:log_dist] [Rank 0] step=1050, skipped=12, lr=[4.081028658583314e-05, 4.081028658583314e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:36:19,130] [INFO] [timer.py:215:stop] epoch=0/micro_step=1050/global_step=1050, RunningAvgSamplesPerSec=103.66661752052754, CurrSamplesPerSec=104.17889080436127, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:36:22,212] [INFO] [logging.py:96:log_dist] [Rank 0] step=1060, skipped=12, lr=[4.0644387731729663e-05, 4.0644387731729663e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:36:22,228] [INFO] [timer.py:215:stop] epoch=0/micro_step=1060/global_step=1060, RunningAvgSamplesPerSec=103.66502302781356, CurrSamplesPerSec=104.24151635063085, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:36:25,296] [INFO] [logging.py:96:log_dist] [Rank 0] step=1070, skipped=12, lr=[4.047734873274586e-05, 4.047734873274586e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:36:25,312] [INFO] [timer.py:215:stop] epoch=0/micro_step=1070/global_step=1070, RunningAvgSamplesPerSec=103.66797382755875, CurrSamplesPerSec=104.15350602528376, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:36:28,385] [INFO] [logging.py:96:log_dist] [Rank 0] step=1080, skipped=12, lr=[4.030918176249072e-05, 4.030918176249072e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:36:28,401] [INFO] [timer.py:215:stop] epoch=0/micro_step=1080/global_step=1080, RunningAvgSamplesPerSec=103.66909012496352, CurrSamplesPerSec=104.13233456046365, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:36:31,500] [INFO] [logging.py:96:log_dist] [Rank 0] step=1090, skipped=12, lr=[4.013989907677852e-05, 4.013989907677852e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:36:31,516] [INFO] [timer.py:215:stop] epoch=0/micro_step=1090/global_step=1090, RunningAvgSamplesPerSec=103.66232907136913, CurrSamplesPerSec=102.91073787680712, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:36:34,684] [INFO] [logging.py:96:log_dist] [Rank 0] step=1100, skipped=12, lr=[3.996951301273557e-05, 3.996951301273557e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:36:34,700] [INFO] [timer.py:215:stop] epoch=0/micro_step=1100/global_step=1100, RunningAvgSamplesPerSec=103.63474853244514, CurrSamplesPerSec=100.62294619857437, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:36:35,939] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:36:35,939] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:36:35,940] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:36:35,940] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:36:35,940] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:36:35,940] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:36:35,940] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:36:35,940] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:36:35,940] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:36:35,940] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:36:35,941] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:36:35,941] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:36:35,941] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:36:35,941] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:36:35,948] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:36:35,948] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:36:37,852] [INFO] [logging.py:96:log_dist] [Rank 0] step=1110, skipped=12, lr=[3.9798035987901096e-05, 3.9798035987901096e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:36:37,868] [INFO] [timer.py:215:stop] epoch=0/micro_step=1110/global_step=1110, RunningAvgSamplesPerSec=103.61223685992795, CurrSamplesPerSec=101.40147337580754, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:36:39,395] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1114 [2023-06-30 00:36:39,395] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1114 [2023-06-30 00:36:39,395] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1114 [2023-06-30 00:36:39,395] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1114 [2023-06-30 00:36:39,395] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1114 [2023-06-30 00:36:39,395] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1114 [2023-06-30 00:36:39,395] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1114 [2023-06-30 00:36:39,395] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1114 [2023-06-30 00:36:39,395] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:36:39,395] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:36:39,395] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:36:39,395] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:36:39,395] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:36:39,395] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:36:39,395] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:36:39,395] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:36:39,395] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-06-30 00:36:40,953] [INFO] [logging.py:96:log_dist] [Rank 0] step=1120, skipped=13, lr=[3.964278422125467e-05, 3.964278422125467e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:36:40,969] [INFO] [timer.py:215:stop] epoch=0/micro_step=1120/global_step=1120, RunningAvgSamplesPerSec=103.6102283919055, CurrSamplesPerSec=100.88501604401073, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:36:44,112] [INFO] [logging.py:96:log_dist] [Rank 0] step=1130, skipped=13, lr=[3.9469268865260857e-05, 3.9469268865260857e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:36:44,128] [INFO] [timer.py:215:stop] epoch=0/micro_step=1130/global_step=1130, RunningAvgSamplesPerSec=103.59116082226474, CurrSamplesPerSec=101.57006273506731, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:36:47,279] [INFO] [logging.py:96:log_dist] [Rank 0] step=1140, skipped=13, lr=[3.9294699005690305e-05, 3.9294699005690305e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:36:47,295] [INFO] [timer.py:215:stop] epoch=0/micro_step=1140/global_step=1140, RunningAvgSamplesPerSec=103.56999320620119, CurrSamplesPerSec=101.8827819586541, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:36:50,460] [INFO] [logging.py:96:log_dist] [Rank 0] step=1150, skipped=13, lr=[3.9119087364992454e-05, 3.9119087364992454e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:36:50,476] [INFO] [timer.py:215:stop] epoch=0/micro_step=1150/global_step=1150, RunningAvgSamplesPerSec=103.54515793566132, CurrSamplesPerSec=100.03191953189456, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:36:53,644] [INFO] [logging.py:96:log_dist] [Rank 0] step=1160, skipped=13, lr=[3.8942446741540504e-05, 3.8942446741540504e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:36:53,660] [INFO] [timer.py:215:stop] epoch=0/micro_step=1160/global_step=1160, RunningAvgSamplesPerSec=103.51977970158758, CurrSamplesPerSec=101.81114024794034, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:36:56,817] [INFO] [logging.py:96:log_dist] [Rank 0] step=1170, skipped=13, lr=[3.876479000869877e-05, 3.876479000869877e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:36:56,833] [INFO] [timer.py:215:stop] epoch=0/micro_step=1170/global_step=1170, RunningAvgSamplesPerSec=103.49792358150417, CurrSamplesPerSec=101.48665725027145, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:36:59,991] [INFO] [logging.py:96:log_dist] [Rank 0] step=1180, skipped=13, lr=[3.858613011388442e-05, 3.858613011388442e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:37:00,007] [INFO] [timer.py:215:stop] epoch=0/micro_step=1180/global_step=1180, RunningAvgSamplesPerSec=103.47635176806598, CurrSamplesPerSec=100.97708228319503, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:37:03,203] [INFO] [logging.py:96:log_dist] [Rank 0] step=1190, skipped=13, lr=[3.840648007762392e-05, 3.840648007762392e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:37:03,219] [INFO] [timer.py:215:stop] epoch=0/micro_step=1190/global_step=1190, RunningAvgSamplesPerSec=103.44479736032861, CurrSamplesPerSec=95.24749937373461, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:37:06,326] [INFO] [logging.py:96:log_dist] [Rank 0] step=1200, skipped=13, lr=[3.822585299260408e-05, 3.822585299260408e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:37:06,342] [INFO] [timer.py:215:stop] epoch=0/micro_step=1200/global_step=1200, RunningAvgSamplesPerSec=103.43844835421754, CurrSamplesPerSec=104.02039832535328, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:37:09,419] [INFO] [logging.py:96:log_dist] [Rank 0] step=1210, skipped=13, lr=[3.8044262022717925e-05, 3.8044262022717925e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:37:09,435] [INFO] [timer.py:215:stop] epoch=0/micro_step=1210/global_step=1210, RunningAvgSamplesPerSec=103.44026684760018, CurrSamplesPerSec=104.21286431833057, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:37:11,260] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:37:11,261] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:37:11,261] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:37:11,261] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:37:11,262] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:37:11,262] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:37:11,262] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:37:11,262] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:37:11,263] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:37:11,263] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:37:11,263] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:37:11,263] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:37:11,263] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:37:11,263] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:37:11,263] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:37:11,263] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:37:12,521] [INFO] [logging.py:96:log_dist] [Rank 0] step=1220, skipped=13, lr=[3.786172040210525e-05, 3.786172040210525e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:37:12,537] [INFO] [timer.py:215:stop] epoch=0/micro_step=1220/global_step=1220, RunningAvgSamplesPerSec=103.43976875689424, CurrSamplesPerSec=102.72367613405889, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:37:15,628] [INFO] [logging.py:96:log_dist] [Rank 0] step=1230, skipped=13, lr=[3.767824143418822e-05, 3.767824143418822e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:37:15,644] [INFO] [timer.py:215:stop] epoch=0/micro_step=1230/global_step=1230, RunningAvgSamplesPerSec=103.43773207855115, CurrSamplesPerSec=102.84394764386673, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:37:18,730] [INFO] [logging.py:96:log_dist] [Rank 0] step=1240, skipped=13, lr=[3.749383849070175e-05, 3.749383849070175e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:37:18,746] [INFO] [timer.py:215:stop] epoch=0/micro_step=1240/global_step=1240, RunningAvgSamplesPerSec=103.4372662028141, CurrSamplesPerSec=104.09655658087203, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:37:21,843] [INFO] [logging.py:96:log_dist] [Rank 0] step=1250, skipped=13, lr=[3.730852501071905e-05, 3.730852501071905e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:37:21,859] [INFO] [timer.py:215:stop] epoch=0/micro_step=1250/global_step=1250, RunningAvgSamplesPerSec=103.43376830296017, CurrSamplesPerSec=103.09982847130524, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:37:24,969] [INFO] [logging.py:96:log_dist] [Rank 0] step=1260, skipped=13, lr=[3.712231449967218e-05, 3.712231449967218e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:37:24,985] [INFO] [timer.py:215:stop] epoch=0/micro_step=1260/global_step=1260, RunningAvgSamplesPerSec=103.42690860694216, CurrSamplesPerSec=102.16522700215873, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:37:28,079] [INFO] [logging.py:96:log_dist] [Rank 0] step=1270, skipped=13, lr=[3.693522052836776e-05, 3.693522052836776e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:37:28,095] [INFO] [timer.py:215:stop] epoch=0/micro_step=1270/global_step=1270, RunningAvgSamplesPerSec=103.42428566941705, CurrSamplesPerSec=102.6646666783953, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:37:31,187] [INFO] [logging.py:96:log_dist] [Rank 0] step=1280, skipped=13, lr=[3.674725673199799e-05, 3.674725673199799e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:37:31,203] [INFO] [timer.py:215:stop] epoch=0/micro_step=1280/global_step=1280, RunningAvgSamplesPerSec=103.42229492887722, CurrSamplesPerSec=102.7960571966883, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:37:34,300] [INFO] [logging.py:96:log_dist] [Rank 0] step=1290, skipped=13, lr=[3.6558436809146916e-05, 3.6558436809146916e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:37:34,315] [INFO] [timer.py:215:stop] epoch=0/micro_step=1290/global_step=1290, RunningAvgSamplesPerSec=103.41929427048527, CurrSamplesPerSec=101.71585163269513, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:37:35,821] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1294 [2023-06-30 00:37:35,821] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1294 [2023-06-30 00:37:35,821] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:37:35,821] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:37:35,821] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1294 [2023-06-30 00:37:35,821] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1294 [2023-06-30 00:37:35,821] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1294 [2023-06-30 00:37:35,821] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1294 [2023-06-30 00:37:35,821] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:37:35,821] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:37:35,821] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:37:35,821] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:37:35,821] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-06-30 00:37:35,821] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1294 [2023-06-30 00:37:35,822] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:37:35,821] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1294 [2023-06-30 00:37:35,822] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:37:37,355] [INFO] [logging.py:96:log_dist] [Rank 0] step=1300, skipped=14, lr=[3.6387778262614316e-05, 3.6387778262614316e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:37:37,371] [INFO] [timer.py:215:stop] epoch=0/micro_step=1300/global_step=1300, RunningAvgSamplesPerSec=103.43070247770949, CurrSamplesPerSec=103.53086810197138, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:37:40,454] [INFO] [logging.py:96:log_dist] [Rank 0] step=1310, skipped=14, lr=[3.619736966170205e-05, 3.619736966170205e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:37:40,470] [INFO] [timer.py:215:stop] epoch=0/micro_step=1310/global_step=1310, RunningAvgSamplesPerSec=103.43104950214516, CurrSamplesPerSec=104.1128676436366, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:37:43,552] [INFO] [logging.py:96:log_dist] [Rank 0] step=1320, skipped=14, lr=[3.600614500944205e-05, 3.600614500944205e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:37:43,567] [INFO] [timer.py:215:stop] epoch=0/micro_step=1320/global_step=1320, RunningAvgSamplesPerSec=103.43181435123343, CurrSamplesPerSec=103.40436475193549, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:37:46,653] [INFO] [logging.py:96:log_dist] [Rank 0] step=1330, skipped=14, lr=[3.5814118242065756e-05, 3.5814118242065756e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:37:46,668] [INFO] [timer.py:215:stop] epoch=0/micro_step=1330/global_step=1330, RunningAvgSamplesPerSec=103.43151492663492, CurrSamplesPerSec=103.3436955967695, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:37:49,759] [INFO] [logging.py:96:log_dist] [Rank 0] step=1340, skipped=14, lr=[3.562130335426184e-05, 3.562130335426184e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:37:49,775] [INFO] [timer.py:215:stop] epoch=0/micro_step=1340/global_step=1340, RunningAvgSamplesPerSec=103.42992747643464, CurrSamplesPerSec=103.19661818142676, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:37:52,866] [INFO] [logging.py:96:log_dist] [Rank 0] step=1350, skipped=14, lr=[3.5427714398156264e-05, 3.5427714398156264e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:37:52,882] [INFO] [timer.py:215:stop] epoch=0/micro_step=1350/global_step=1350, RunningAvgSamplesPerSec=103.4282848705966, CurrSamplesPerSec=103.50779367806696, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:37:55,971] [INFO] [logging.py:96:log_dist] [Rank 0] step=1360, skipped=14, lr=[3.5233365482288225e-05, 3.5233365482288225e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:37:55,987] [INFO] [timer.py:215:stop] epoch=0/micro_step=1360/global_step=1360, RunningAvgSamplesPerSec=103.42711518532879, CurrSamplesPerSec=103.23821186113449, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:37:59,079] [INFO] [logging.py:96:log_dist] [Rank 0] step=1370, skipped=14, lr=[3.5038270770581885e-05, 3.5038270770581885e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:37:59,095] [INFO] [timer.py:215:stop] epoch=0/micro_step=1370/global_step=1370, RunningAvgSamplesPerSec=103.42519722528172, CurrSamplesPerSec=103.07362109243607, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:38:02,181] [INFO] [logging.py:96:log_dist] [Rank 0] step=1380, skipped=14, lr=[3.4842444481314116e-05, 3.4842444481314116e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:38:02,197] [INFO] [timer.py:215:stop] epoch=0/micro_step=1380/global_step=1380, RunningAvgSamplesPerSec=103.42481959162731, CurrSamplesPerSec=103.77196686848757, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:38:05,284] [INFO] [logging.py:96:log_dist] [Rank 0] step=1390, skipped=14, lr=[3.464590088607839e-05, 3.464590088607839e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:38:05,300] [INFO] [timer.py:215:stop] epoch=0/micro_step=1390/global_step=1390, RunningAvgSamplesPerSec=103.42436931383565, CurrSamplesPerSec=103.74076581799069, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:38:07,132] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:38:07,132] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:38:07,132] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:38:07,132] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:38:07,132] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:38:07,132] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:38:07,132] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:38:07,132] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:38:07,132] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:38:07,132] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:38:07,132] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:38:07,133] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:38:07,133] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:38:07,133] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:38:07,133] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:38:07,133] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:38:08,388] [INFO] [logging.py:96:log_dist] [Rank 0] step=1400, skipped=14, lr=[3.444865430874453e-05, 3.444865430874453e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:38:08,404] [INFO] [timer.py:215:stop] epoch=0/micro_step=1400/global_step=1400, RunningAvgSamplesPerSec=103.42344600150899, CurrSamplesPerSec=103.30209640152886, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:38:11,476] [INFO] [logging.py:96:log_dist] [Rank 0] step=1410, skipped=14, lr=[3.425071912441493e-05, 3.425071912441493e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:38:11,492] [INFO] [timer.py:215:stop] epoch=0/micro_step=1410/global_step=1410, RunningAvgSamplesPerSec=103.42635958720163, CurrSamplesPerSec=103.86351258385561, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:38:14,563] [INFO] [logging.py:96:log_dist] [Rank 0] step=1420, skipped=14, lr=[3.405210975837685e-05, 3.405210975837685e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:38:14,579] [INFO] [timer.py:215:stop] epoch=0/micro_step=1420/global_step=1420, RunningAvgSamplesPerSec=103.42955993475982, CurrSamplesPerSec=104.0776679756078, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:38:17,653] [INFO] [logging.py:96:log_dist] [Rank 0] step=1430, skipped=14, lr=[3.385284068505113e-05, 3.385284068505113e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:38:17,669] [INFO] [timer.py:215:stop] epoch=0/micro_step=1430/global_step=1430, RunningAvgSamplesPerSec=103.43184355335282, CurrSamplesPerSec=102.97050461157936, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:38:20,747] [INFO] [logging.py:96:log_dist] [Rank 0] step=1440, skipped=14, lr=[3.365292642693732e-05, 3.365292642693732e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:38:20,763] [INFO] [timer.py:215:stop] epoch=0/micro_step=1440/global_step=1440, RunningAvgSamplesPerSec=103.4334536692826, CurrSamplesPerSec=103.61654897474082, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:38:23,833] [INFO] [logging.py:96:log_dist] [Rank 0] step=1450, skipped=14, lr=[3.34523815535553e-05, 3.34523815535553e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:38:23,849] [INFO] [timer.py:215:stop] epoch=0/micro_step=1450/global_step=1450, RunningAvgSamplesPerSec=103.43658026417592, CurrSamplesPerSec=104.19724992857763, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:38:26,920] [INFO] [logging.py:96:log_dist] [Rank 0] step=1460, skipped=14, lr=[3.3251220680383436e-05, 3.3251220680383436e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:38:26,936] [INFO] [timer.py:215:stop] epoch=0/micro_step=1460/global_step=1460, RunningAvgSamplesPerSec=103.4395497706907, CurrSamplesPerSec=103.10838240321422, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:38:29,048] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1466 [2023-06-30 00:38:29,048] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1466 [2023-06-30 00:38:29,048] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1466 [2023-06-30 00:38:29,048] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1466 [2023-06-30 00:38:29,048] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:38:29,048] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:38:29,048] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:38:29,048] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1466 [2023-06-30 00:38:29,048] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:38:29,048] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1466 [2023-06-30 00:38:29,048] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:38:29,048] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1466 [2023-06-30 00:38:29,048] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:38:29,048] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:38:29,048] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-06-30 00:38:29,048] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1466 [2023-06-30 00:38:29,048] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:38:29,958] [INFO] [logging.py:96:log_dist] [Rank 0] step=1470, skipped=15, lr=[3.306966133059528e-05, 3.306966133059528e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:38:29,974] [INFO] [timer.py:215:stop] epoch=0/micro_step=1470/global_step=1470, RunningAvgSamplesPerSec=103.453641143058, CurrSamplesPerSec=104.02410684391315, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:38:33,053] [INFO] [logging.py:96:log_dist] [Rank 0] step=1480, skipped=15, lr=[3.286737048339026e-05, 3.286737048339026e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:38:33,068] [INFO] [timer.py:215:stop] epoch=0/micro_step=1480/global_step=1480, RunningAvgSamplesPerSec=103.45479721164662, CurrSamplesPerSec=103.78705276743095, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:38:36,170] [INFO] [logging.py:96:log_dist] [Rank 0] step=1490, skipped=15, lr=[3.2664506271325465e-05, 3.2664506271325465e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:38:36,186] [INFO] [timer.py:215:stop] epoch=0/micro_step=1490/global_step=1490, RunningAvgSamplesPerSec=103.45092080321385, CurrSamplesPerSec=102.35743957171128, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:38:39,278] [INFO] [logging.py:96:log_dist] [Rank 0] step=1500, skipped=15, lr=[3.246108347890996e-05, 3.246108347890996e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:38:39,294] [INFO] [timer.py:215:stop] epoch=0/micro_step=1500/global_step=1500, RunningAvgSamplesPerSec=103.44895023744468, CurrSamplesPerSec=104.22370815004058, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:38:42,360] [INFO] [logging.py:96:log_dist] [Rank 0] step=1510, skipped=15, lr=[3.225711693136156e-05, 3.225711693136156e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:38:42,376] [INFO] [timer.py:215:stop] epoch=0/micro_step=1510/global_step=1510, RunningAvgSamplesPerSec=103.45300057509091, CurrSamplesPerSec=104.1757372448148, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:38:45,441] [INFO] [logging.py:96:log_dist] [Rank 0] step=1520, skipped=15, lr=[3.20526214935263e-05, 3.20526214935263e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:38:45,458] [INFO] [timer.py:215:stop] epoch=0/micro_step=1520/global_step=1520, RunningAvgSamplesPerSec=103.45681866736066, CurrSamplesPerSec=104.2773130462279, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:38:48,532] [INFO] [logging.py:96:log_dist] [Rank 0] step=1530, skipped=15, lr=[3.184761206879511e-05, 3.184761206879511e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:38:48,548] [INFO] [timer.py:215:stop] epoch=0/micro_step=1530/global_step=1530, RunningAvgSamplesPerSec=103.45872937010203, CurrSamplesPerSec=103.56426038881622, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:38:51,625] [INFO] [logging.py:96:log_dist] [Rank 0] step=1540, skipped=15, lr=[3.164210359801773e-05, 3.164210359801773e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:38:51,641] [INFO] [timer.py:215:stop] epoch=0/micro_step=1540/global_step=1540, RunningAvgSamplesPerSec=103.46017946278853, CurrSamplesPerSec=103.85676159819242, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:38:54,723] [INFO] [logging.py:96:log_dist] [Rank 0] step=1550, skipped=15, lr=[3.1436111058413756e-05, 3.1436111058413756e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:38:54,739] [INFO] [timer.py:215:stop] epoch=0/micro_step=1550/global_step=1550, RunningAvgSamplesPerSec=103.46043775839098, CurrSamplesPerSec=104.24386424923264, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:38:57,808] [INFO] [logging.py:96:log_dist] [Rank 0] step=1560, skipped=15, lr=[3.122964946248119e-05, 3.122964946248119e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:38:57,823] [INFO] [timer.py:215:stop] epoch=0/micro_step=1560/global_step=1560, RunningAvgSamplesPerSec=103.46347952309975, CurrSamplesPerSec=103.94894008545612, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:39:00,260] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:39:00,260] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:39:00,261] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:39:00,261] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:39:00,261] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:39:00,261] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:39:00,261] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:39:00,261] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:39:00,261] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:39:00,261] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:39:00,261] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:39:00,262] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:39:00,262] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:39:00,262] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:39:00,261] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:39:00,262] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:39:00,896] [INFO] [logging.py:96:log_dist] [Rank 0] step=1570, skipped=15, lr=[3.102273385690231e-05, 3.102273385690231e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:39:00,912] [INFO] [timer.py:215:stop] epoch=0/micro_step=1570/global_step=1570, RunningAvgSamplesPerSec=103.4656372992591, CurrSamplesPerSec=103.6520777488692, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:39:03,991] [INFO] [logging.py:96:log_dist] [Rank 0] step=1580, skipped=15, lr=[3.08153793214471e-05, 3.08153793214471e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:39:04,007] [INFO] [timer.py:215:stop] epoch=0/micro_step=1580/global_step=1580, RunningAvgSamplesPerSec=103.466692141174, CurrSamplesPerSec=103.26720437511973, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:39:07,098] [INFO] [logging.py:96:log_dist] [Rank 0] step=1590, skipped=15, lr=[3.0607600967874206e-05, 3.0607600967874206e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:39:07,114] [INFO] [timer.py:215:stop] epoch=0/micro_step=1590/global_step=1590, RunningAvgSamplesPerSec=103.46509850201409, CurrSamplesPerSec=103.97688330562532, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:39:10,211] [INFO] [logging.py:96:log_dist] [Rank 0] step=1600, skipped=15, lr=[3.039941393882969e-05, 3.039941393882969e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:39:10,227] [INFO] [timer.py:215:stop] epoch=0/micro_step=1600/global_step=1600, RunningAvgSamplesPerSec=103.46222496589522, CurrSamplesPerSec=103.63775128062971, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:39:13,295] [INFO] [logging.py:96:log_dist] [Rank 0] step=1610, skipped=15, lr=[3.0190833406743398e-05, 3.0190833406743398e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:39:13,311] [INFO] [timer.py:215:stop] epoch=0/micro_step=1610/global_step=1610, RunningAvgSamplesPerSec=103.4653660352398, CurrSamplesPerSec=103.8192453291378, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:39:16,377] [INFO] [logging.py:96:log_dist] [Rank 0] step=1620, skipped=15, lr=[2.9981874572723222e-05, 2.9981874572723222e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:39:16,393] [INFO] [timer.py:215:stop] epoch=0/micro_step=1620/global_step=1620, RunningAvgSamplesPerSec=103.46893127514939, CurrSamplesPerSec=103.59199714735135, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:39:19,457] [INFO] [logging.py:96:log_dist] [Rank 0] step=1630, skipped=15, lr=[2.9772552665447263e-05, 2.9772552665447263e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:39:19,473] [INFO] [timer.py:215:stop] epoch=0/micro_step=1630/global_step=1630, RunningAvgSamplesPerSec=103.47275376158888, CurrSamplesPerSec=104.30243399779144, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:39:22,543] [INFO] [logging.py:96:log_dist] [Rank 0] step=1640, skipped=15, lr=[2.9562882940053975e-05, 2.9562882940053975e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:39:22,559] [INFO] [timer.py:215:stop] epoch=0/micro_step=1640/global_step=1640, RunningAvgSamplesPerSec=103.47543296471976, CurrSamplesPerSec=104.28841339377429, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:39:25,629] [INFO] [logging.py:96:log_dist] [Rank 0] step=1650, skipped=15, lr=[2.9352880677030386e-05, 2.9352880677030386e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:39:25,645] [INFO] [timer.py:215:stop] epoch=0/micro_step=1650/global_step=1650, RunningAvgSamplesPerSec=103.47811212696614, CurrSamplesPerSec=104.23455423868735, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:39:28,716] [INFO] [logging.py:96:log_dist] [Rank 0] step=1660, skipped=15, lr=[2.9142561181098505e-05, 2.9142561181098505e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:39:28,731] [INFO] [timer.py:215:stop] epoch=0/micro_step=1660/global_step=1660, RunningAvgSamplesPerSec=103.4805831332416, CurrSamplesPerSec=103.07298784636627, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:39:31,171] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:39:31,171] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:39:31,171] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:39:31,171] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:39:31,171] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:39:31,172] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:39:31,172] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:39:31,172] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:39:31,172] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:39:31,172] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:39:31,172] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:39:31,172] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:39:31,172] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:39:31,172] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:39:31,175] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:39:31,176] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:39:31,806] [INFO] [logging.py:96:log_dist] [Rank 0] step=1670, skipped=15, lr=[2.89319397800999e-05, 2.89319397800999e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:39:31,822] [INFO] [timer.py:215:stop] epoch=0/micro_step=1670/global_step=1670, RunningAvgSamplesPerSec=103.48201549321435, CurrSamplesPerSec=103.58488169189961, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:39:33,329] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1674 [2023-06-30 00:39:33,329] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:39:33,329] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1674 [2023-06-30 00:39:33,329] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:39:33,329] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1674 [2023-06-30 00:39:33,329] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:39:33,329] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1674 [2023-06-30 00:39:33,329] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1674 [2023-06-30 00:39:33,329] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:39:33,329] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:39:33,329] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1674 [2023-06-30 00:39:33,330] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:39:33,330] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1674 [2023-06-30 00:39:33,330] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:39:33,330] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1674 [2023-06-30 00:39:33,330] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:39:33,330] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-06-30 00:39:34,855] [INFO] [logging.py:96:log_dist] [Rank 0] step=1680, skipped=16, lr=[2.874213507657861e-05, 2.874213507657861e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:39:34,871] [INFO] [timer.py:215:stop] epoch=0/micro_step=1680/global_step=1680, RunningAvgSamplesPerSec=103.49193852728861, CurrSamplesPerSec=103.67930094465477, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:39:37,944] [INFO] [logging.py:96:log_dist] [Rank 0] step=1690, skipped=16, lr=[2.8530982362057202e-05, 2.8530982362057202e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:39:37,960] [INFO] [timer.py:215:stop] epoch=0/micro_step=1690/global_step=1690, RunningAvgSamplesPerSec=103.49383450130455, CurrSamplesPerSec=103.81362424122419, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:39:41,042] [INFO] [logging.py:96:log_dist] [Rank 0] step=1700, skipped=16, lr=[2.8319572313625908e-05, 2.8319572313625908e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:39:41,058] [INFO] [timer.py:215:stop] epoch=0/micro_step=1700/global_step=1700, RunningAvgSamplesPerSec=103.49398339042415, CurrSamplesPerSec=103.11796766119082, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:39:44,154] [INFO] [logging.py:96:log_dist] [Rank 0] step=1710, skipped=16, lr=[2.8107920338604514e-05, 2.8107920338604514e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:39:44,170] [INFO] [timer.py:215:stop] epoch=0/micro_step=1710/global_step=1710, RunningAvgSamplesPerSec=103.49192781251041, CurrSamplesPerSec=103.7699610952702, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:39:47,236] [INFO] [logging.py:96:log_dist] [Rank 0] step=1720, skipped=16, lr=[2.7896041861944113e-05, 2.7896041861944113e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:39:47,252] [INFO] [timer.py:215:stop] epoch=0/micro_step=1720/global_step=1720, RunningAvgSamplesPerSec=103.49517885548333, CurrSamplesPerSec=103.83257777131381, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:39:50,321] [INFO] [logging.py:96:log_dist] [Rank 0] step=1730, skipped=16, lr=[2.7683952325102965e-05, 2.7683952325102965e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:39:50,337] [INFO] [timer.py:215:stop] epoch=0/micro_step=1730/global_step=1730, RunningAvgSamplesPerSec=103.49754505126633, CurrSamplesPerSec=102.82960733778512, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:39:53,402] [INFO] [logging.py:96:log_dist] [Rank 0] step=1740, skipped=16, lr=[2.747166718492119e-05, 2.747166718492119e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:39:53,418] [INFO] [timer.py:215:stop] epoch=0/micro_step=1740/global_step=1740, RunningAvgSamplesPerSec=103.50089230416906, CurrSamplesPerSec=104.41847138168157, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:39:56,486] [INFO] [logging.py:96:log_dist] [Rank 0] step=1750, skipped=16, lr=[2.725920191249422e-05, 2.725920191249422e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:39:56,502] [INFO] [timer.py:215:stop] epoch=0/micro_step=1750/global_step=1750, RunningAvgSamplesPerSec=103.50376046021381, CurrSamplesPerSec=103.92455237817965, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:39:59,569] [INFO] [logging.py:96:log_dist] [Rank 0] step=1760, skipped=16, lr=[2.7046571992045334e-05, 2.7046571992045334e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:39:59,585] [INFO] [timer.py:215:stop] epoch=0/micro_step=1760/global_step=1760, RunningAvgSamplesPerSec=103.50645220791056, CurrSamplesPerSec=104.45601558383746, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:40:02,655] [INFO] [logging.py:96:log_dist] [Rank 0] step=1770, skipped=16, lr=[2.6833792919797152e-05, 2.6833792919797152e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:40:02,671] [INFO] [timer.py:215:stop] epoch=0/micro_step=1770/global_step=1770, RunningAvgSamplesPerSec=103.50867487041802, CurrSamplesPerSec=103.71158696404727, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:40:04,496] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:40:04,496] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:40:04,496] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:40:04,496] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:40:04,497] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:40:04,497] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:40:04,497] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:40:04,497] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:40:04,497] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:40:04,497] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:40:04,497] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:40:04,497] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:40:04,497] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:40:04,497] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:40:04,498] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:40:04,498] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:40:05,104] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1777 [2023-06-30 00:40:05,104] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:40:05,104] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1777 [2023-06-30 00:40:05,104] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:40:05,104] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1777 [2023-06-30 00:40:05,104] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:40:05,104] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1777 [2023-06-30 00:40:05,104] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1777 [2023-06-30 00:40:05,104] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:40:05,104] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:40:05,104] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1777 [2023-06-30 00:40:05,104] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1777 [2023-06-30 00:40:05,105] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:40:05,105] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1777 [2023-06-30 00:40:05,105] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:40:05,105] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:40:05,105] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-06-30 00:40:05,709] [INFO] [logging.py:96:log_dist] [Rank 0] step=1780, skipped=17, lr=[2.6642177046391077e-05, 2.6642177046391077e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:40:05,724] [INFO] [timer.py:215:stop] epoch=0/micro_step=1780/global_step=1780, RunningAvgSamplesPerSec=103.51722419938098, CurrSamplesPerSec=103.17020296940278, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:40:08,803] [INFO] [logging.py:96:log_dist] [Rank 0] step=1790, skipped=17, lr=[2.6429157315837844e-05, 2.6429157315837844e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:40:08,819] [INFO] [timer.py:215:stop] epoch=0/micro_step=1790/global_step=1790, RunningAvgSamplesPerSec=103.5176423157018, CurrSamplesPerSec=103.21312463760984, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:40:11,912] [INFO] [logging.py:96:log_dist] [Rank 0] step=1800, skipped=17, lr=[2.6216033429952692e-05, 2.6216033429952692e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:40:11,927] [INFO] [timer.py:215:stop] epoch=0/micro_step=1800/global_step=1800, RunningAvgSamplesPerSec=103.51574890320559, CurrSamplesPerSec=99.88407535712027, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:40:15,009] [INFO] [logging.py:96:log_dist] [Rank 0] step=1810, skipped=17, lr=[2.6002820920957876e-05, 2.6002820920957876e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:40:15,025] [INFO] [timer.py:215:stop] epoch=0/micro_step=1810/global_step=1810, RunningAvgSamplesPerSec=103.5158044748933, CurrSamplesPerSec=103.21629956150402, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:40:18,126] [INFO] [logging.py:96:log_dist] [Rank 0] step=1820, skipped=17, lr=[2.578953532753442e-05, 2.578953532753442e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:40:18,141] [INFO] [timer.py:215:stop] epoch=0/micro_step=1820/global_step=1820, RunningAvgSamplesPerSec=103.51238761082102, CurrSamplesPerSec=103.66176436514397, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:40:21,204] [INFO] [logging.py:96:log_dist] [Rank 0] step=1830, skipped=17, lr=[2.5576192193689634e-05, 2.5576192193689634e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:40:21,220] [INFO] [timer.py:215:stop] epoch=0/micro_step=1830/global_step=1830, RunningAvgSamplesPerSec=103.51583535504373, CurrSamplesPerSec=103.99517129430542, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:40:24,285] [INFO] [logging.py:96:log_dist] [Rank 0] step=1840, skipped=17, lr=[2.536280706762431e-05, 2.536280706762431e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:40:24,301] [INFO] [timer.py:215:stop] epoch=0/micro_step=1840/global_step=1840, RunningAvgSamplesPerSec=103.51884308106756, CurrSamplesPerSec=104.16094232248221, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:40:27,366] [INFO] [logging.py:96:log_dist] [Rank 0] step=1850, skipped=17, lr=[2.5149395500599606e-05, 2.5149395500599606e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:40:27,380] [INFO] [timer.py:215:stop] epoch=0/micro_step=1850/global_step=1850, RunningAvgSamplesPerSec=103.52216805837585, CurrSamplesPerSec=103.52400058928383, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:40:30,443] [INFO] [logging.py:96:log_dist] [Rank 0] step=1860, skipped=17, lr=[2.493597304580363e-05, 2.493597304580363e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:40:30,459] [INFO] [timer.py:215:stop] epoch=0/micro_step=1860/global_step=1860, RunningAvgSamplesPerSec=103.52551607593473, CurrSamplesPerSec=104.1466364769255, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:40:33,529] [INFO] [logging.py:96:log_dist] [Rank 0] step=1870, skipped=17, lr=[2.472255525721801e-05, 2.472255525721801e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:40:33,545] [INFO] [timer.py:215:stop] epoch=0/micro_step=1870/global_step=1870, RunningAvgSamplesPerSec=103.5275458490315, CurrSamplesPerSec=103.7377189063686, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:40:36,292] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:40:36,292] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:40:36,292] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:40:36,292] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:40:36,292] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:40:36,292] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:40:36,292] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:40:36,292] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:40:36,292] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:40:36,292] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:40:36,293] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:40:36,293] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:40:36,293] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:40:36,293] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:40:36,295] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:40:36,295] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:40:36,620] [INFO] [logging.py:96:log_dist] [Rank 0] step=1880, skipped=17, lr=[2.4509157688484295e-05, 2.4509157688484295e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:40:36,636] [INFO] [timer.py:215:stop] epoch=0/micro_step=1880/global_step=1880, RunningAvgSamplesPerSec=103.52863572448355, CurrSamplesPerSec=102.55688060425685, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:40:38,445] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1885 [2023-06-30 00:40:38,445] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1885 [2023-06-30 00:40:38,445] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:40:38,445] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:40:38,445] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1885 [2023-06-30 00:40:38,446] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:40:38,446] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1885 [2023-06-30 00:40:38,446] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1885 [2023-06-30 00:40:38,446] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:40:38,446] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:40:38,446] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1885 [2023-06-30 00:40:38,446] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1885 [2023-06-30 00:40:38,446] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:40:38,446] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:40:38,446] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1885 [2023-06-30 00:40:38,446] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:40:38,446] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-06-30 00:40:39,662] [INFO] [logging.py:96:log_dist] [Rank 0] step=1890, skipped=18, lr=[2.431713001851286e-05, 2.431713001851286e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:40:39,678] [INFO] [timer.py:215:stop] epoch=0/micro_step=1890/global_step=1890, RunningAvgSamplesPerSec=103.53839694394475, CurrSamplesPerSec=104.05910893130991, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:40:42,747] [INFO] [logging.py:96:log_dist] [Rank 0] step=1900, skipped=18, lr=[2.410381371158917e-05, 2.410381371158917e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:40:42,763] [INFO] [timer.py:215:stop] epoch=0/micro_step=1900/global_step=1900, RunningAvgSamplesPerSec=103.54056650012203, CurrSamplesPerSec=103.63799135640505, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:40:45,860] [INFO] [logging.py:96:log_dist] [Rank 0] step=1910, skipped=18, lr=[2.389056271768547e-05, 2.389056271768547e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:40:45,876] [INFO] [timer.py:215:stop] epoch=0/micro_step=1910/global_step=1910, RunningAvgSamplesPerSec=103.5380232323054, CurrSamplesPerSec=103.74918584652251, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:40:48,953] [INFO] [logging.py:96:log_dist] [Rank 0] step=1920, skipped=18, lr=[2.3677392578287495e-05, 2.3677392578287495e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:40:48,969] [INFO] [timer.py:215:stop] epoch=0/micro_step=1920/global_step=1920, RunningAvgSamplesPerSec=103.53865184043329, CurrSamplesPerSec=103.92495472273434, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:40:52,070] [INFO] [logging.py:96:log_dist] [Rank 0] step=1930, skipped=18, lr=[2.3464318828988416e-05, 2.3464318828988416e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:40:52,086] [INFO] [timer.py:215:stop] epoch=0/micro_step=1930/global_step=1930, RunningAvgSamplesPerSec=103.53506693337205, CurrSamplesPerSec=103.20074430048825, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:40:55,163] [INFO] [logging.py:96:log_dist] [Rank 0] step=1940, skipped=18, lr=[2.3251356998356595e-05, 2.3251356998356595e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:40:55,179] [INFO] [timer.py:215:stop] epoch=0/micro_step=1940/global_step=1940, RunningAvgSamplesPerSec=103.5358392813737, CurrSamplesPerSec=103.57161277975882, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:40:58,250] [INFO] [logging.py:96:log_dist] [Rank 0] step=1950, skipped=18, lr=[2.303852260680388e-05, 2.303852260680388e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:40:58,268] [INFO] [timer.py:215:stop] epoch=0/micro_step=1950/global_step=1950, RunningAvgSamplesPerSec=103.5376059366276, CurrSamplesPerSec=103.58600091223906, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:41:01,339] [INFO] [logging.py:96:log_dist] [Rank 0] step=1960, skipped=18, lr=[2.2825831165454533e-05, 2.2825831165454533e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:41:01,355] [INFO] [timer.py:215:stop] epoch=0/micro_step=1960/global_step=1960, RunningAvgSamplesPerSec=103.53927097964268, CurrSamplesPerSec=103.26235791760722, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:41:04,431] [INFO] [logging.py:96:log_dist] [Rank 0] step=1970, skipped=18, lr=[2.261329817501475e-05, 2.261329817501475e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:41:04,447] [INFO] [timer.py:215:stop] epoch=0/micro_step=1970/global_step=1970, RunningAvgSamplesPerSec=103.54016127505473, CurrSamplesPerSec=103.64647474781016, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:41:07,528] [INFO] [logging.py:96:log_dist] [Rank 0] step=1980, skipped=18, lr=[2.240093912464302e-05, 2.240093912464302e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:41:07,544] [INFO] [timer.py:215:stop] epoch=0/micro_step=1980/global_step=1980, RunningAvgSamplesPerSec=103.54014857198324, CurrSamplesPerSec=102.9987146046239, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:41:09,670] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:41:09,670] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:41:09,670] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:41:09,671] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:41:09,671] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:41:09,671] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:41:09,671] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:41:09,671] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:41:09,671] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:41:09,671] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:41:09,672] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:41:09,672] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:41:09,672] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:41:09,672] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:41:09,673] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:41:09,673] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:41:10,613] [INFO] [logging.py:96:log_dist] [Rank 0] step=1990, skipped=18, lr=[2.218876949082127e-05, 2.218876949082127e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:41:10,629] [INFO] [timer.py:215:stop] epoch=0/micro_step=1990/global_step=1990, RunningAvgSamplesPerSec=103.54212397516946, CurrSamplesPerSec=103.32762463306986, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:41:12,134] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1994 [2023-06-30 00:41:12,134] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1994 [2023-06-30 00:41:12,134] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1994 [2023-06-30 00:41:12,134] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:41:12,134] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:41:12,134] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:41:12,134] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1994 [2023-06-30 00:41:12,134] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:41:12,134] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1994 [2023-06-30 00:41:12,134] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1994 [2023-06-30 00:41:12,134] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1994 [2023-06-30 00:41:12,135] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:41:12,135] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1994 [2023-06-30 00:41:12,135] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:41:12,135] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:41:12,135] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:41:12,135] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-06-30 00:41:13,668] [INFO] [logging.py:96:log_dist] [Rank 0] step=2000, skipped=19, lr=[2.1997991551738255e-05, 2.1997991551738255e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:41:13,684] [INFO] [timer.py:215:stop] epoch=0/micro_step=2000/global_step=2000, RunningAvgSamplesPerSec=103.54923294581425, CurrSamplesPerSec=103.47898505303947, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:41:16,780] [INFO] [logging.py:96:log_dist] [Rank 0] step=2010, skipped=19, lr=[2.1786224396720407e-05, 2.1786224396720407e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:41:16,796] [INFO] [timer.py:215:stop] epoch=0/micro_step=2010/global_step=2010, RunningAvgSamplesPerSec=103.54667068252671, CurrSamplesPerSec=103.7667520193961, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:41:19,890] [INFO] [logging.py:96:log_dist] [Rank 0] step=2020, skipped=19, lr=[2.1574691457950803e-05, 2.1574691457950803e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:41:19,906] [INFO] [timer.py:215:stop] epoch=0/micro_step=2020/global_step=2020, RunningAvgSamplesPerSec=103.54463733074316, CurrSamplesPerSec=104.02378435354115, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:41:22,993] [INFO] [logging.py:96:log_dist] [Rank 0] step=2030, skipped=19, lr=[2.1363408151705317e-05, 2.1363408151705317e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:41:23,008] [INFO] [timer.py:215:stop] epoch=0/micro_step=2030/global_step=2030, RunningAvgSamplesPerSec=103.543657377976, CurrSamplesPerSec=102.65116140336654, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:41:26,099] [INFO] [logging.py:96:log_dist] [Rank 0] step=2040, skipped=19, lr=[2.11523898760669e-05, 2.11523898760669e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:41:26,114] [INFO] [timer.py:215:stop] epoch=0/micro_step=2040/global_step=2040, RunningAvgSamplesPerSec=103.54217490868281, CurrSamplesPerSec=103.348310955707, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:41:29,198] [INFO] [logging.py:96:log_dist] [Rank 0] step=2050, skipped=19, lr=[2.0941652009803365e-05, 2.0941652009803365e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:41:29,214] [INFO] [timer.py:215:stop] epoch=0/micro_step=2050/global_step=2050, RunningAvgSamplesPerSec=103.5417656286957, CurrSamplesPerSec=103.62070874305844, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:41:32,291] [INFO] [logging.py:96:log_dist] [Rank 0] step=2060, skipped=19, lr=[2.0731209911246627e-05, 2.0731209911246627e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:41:32,307] [INFO] [timer.py:215:stop] epoch=0/micro_step=2060/global_step=2060, RunningAvgSamplesPerSec=103.5423457512477, CurrSamplesPerSec=103.52088655239224, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:41:35,391] [INFO] [logging.py:96:log_dist] [Rank 0] step=2070, skipped=19, lr=[2.052107891717339e-05, 2.052107891717339e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:41:35,407] [INFO] [timer.py:215:stop] epoch=0/micro_step=2070/global_step=2070, RunningAvgSamplesPerSec=103.54194300103738, CurrSamplesPerSec=103.56394074332421, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:41:38,492] [INFO] [logging.py:96:log_dist] [Rank 0] step=2080, skipped=19, lr=[2.0311274341687408e-05, 2.0311274341687408e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:41:38,508] [INFO] [timer.py:215:stop] epoch=0/micro_step=2080/global_step=2080, RunningAvgSamplesPerSec=103.54118434098626, CurrSamplesPerSec=103.33024974594278, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:41:41,585] [INFO] [logging.py:96:log_dist] [Rank 0] step=2090, skipped=19, lr=[2.0101811475103458e-05, 2.0101811475103458e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:41:41,601] [INFO] [timer.py:215:stop] epoch=0/micro_step=2090/global_step=2090, RunningAvgSamplesPerSec=103.54182619691316, CurrSamplesPerSec=103.56473986075392, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:41:43,440] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:41:43,440] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:41:43,440] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:41:43,440] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:41:43,441] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:41:43,441] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:41:43,442] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:41:43,442] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:41:43,442] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:41:43,442] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:41:43,442] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:41:43,442] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:41:43,442] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:41:43,442] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:41:43,443] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:41:43,443] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:41:44,712] [INFO] [logging.py:96:log_dist] [Rank 0] step=2100, skipped=19, lr=[1.9892705582832933e-05, 1.9892705582832933e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:41:44,728] [INFO] [timer.py:215:stop] epoch=0/micro_step=2100/global_step=2100, RunningAvgSamplesPerSec=103.53704487617583, CurrSamplesPerSec=100.59338359839911, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:41:47,851] [INFO] [logging.py:96:log_dist] [Rank 0] step=2110, skipped=19, lr=[1.9683971904271375e-05, 1.9683971904271375e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:41:47,866] [INFO] [timer.py:215:stop] epoch=0/micro_step=2110/global_step=2110, RunningAvgSamplesPerSec=103.53069822906667, CurrSamplesPerSec=103.60495138466479, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:41:50,956] [INFO] [logging.py:96:log_dist] [Rank 0] step=2120, skipped=19, lr=[1.947562565168781e-05, 1.947562565168781e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:41:50,972] [INFO] [timer.py:215:stop] epoch=0/micro_step=2120/global_step=2120, RunningAvgSamplesPerSec=103.5293844900595, CurrSamplesPerSec=103.64839570850384, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:41:52,792] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2125 [2023-06-30 00:41:52,792] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2125 [2023-06-30 00:41:52,792] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2125 [2023-06-30 00:41:52,792] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2125 [2023-06-30 00:41:52,792] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2125 [2023-06-30 00:41:52,792] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2125 [2023-06-30 00:41:52,792] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:41:52,792] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2125 [2023-06-30 00:41:52,792] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:41:52,792] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:41:52,792] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:41:52,792] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:41:52,792] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:41:52,792] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:41:52,792] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-06-30 00:41:52,792] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2125 [2023-06-30 00:41:52,793] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:41:54,026] [INFO] [logging.py:96:log_dist] [Rank 0] step=2130, skipped=20, lr=[1.9288457823779026e-05, 1.9288457823779026e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:41:54,042] [INFO] [timer.py:215:stop] epoch=0/micro_step=2130/global_step=2130, RunningAvgSamplesPerSec=103.53371514765188, CurrSamplesPerSec=103.11614553282571, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:41:57,131] [INFO] [logging.py:96:log_dist] [Rank 0] step=2140, skipped=20, lr=[1.9080889488338833e-05, 1.9080889488338833e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:41:57,147] [INFO] [timer.py:215:stop] epoch=0/micro_step=2140/global_step=2140, RunningAvgSamplesPerSec=103.53249196184484, CurrSamplesPerSec=102.62000290539869, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:42:00,241] [INFO] [logging.py:96:log_dist] [Rank 0] step=2150, skipped=20, lr=[1.887375253082546e-05, 1.887375253082546e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:42:00,257] [INFO] [timer.py:215:stop] epoch=0/micro_step=2150/global_step=2150, RunningAvgSamplesPerSec=103.5304521842314, CurrSamplesPerSec=103.72665531130889, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:42:03,353] [INFO] [logging.py:96:log_dist] [Rank 0] step=2160, skipped=20, lr=[1.866706204714074e-05, 1.866706204714074e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:42:03,369] [INFO] [timer.py:215:stop] epoch=0/micro_step=2160/global_step=2160, RunningAvgSamplesPerSec=103.52808817192113, CurrSamplesPerSec=102.98053832832949, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:42:06,453] [INFO] [logging.py:96:log_dist] [Rank 0] step=2170, skipped=20, lr=[1.846083310064804e-05, 1.846083310064804e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:42:06,469] [INFO] [timer.py:215:stop] epoch=0/micro_step=2170/global_step=2170, RunningAvgSamplesPerSec=103.52774204932862, CurrSamplesPerSec=103.29899570003532, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:42:09,585] [INFO] [logging.py:96:log_dist] [Rank 0] step=2180, skipped=20, lr=[1.825508072107439e-05, 1.825508072107439e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:42:09,601] [INFO] [timer.py:215:stop] epoch=0/micro_step=2180/global_step=2180, RunningAvgSamplesPerSec=103.52247045641272, CurrSamplesPerSec=101.12642976672416, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:42:12,749] [INFO] [logging.py:96:log_dist] [Rank 0] step=2190, skipped=20, lr=[1.8049819903415228e-05, 1.8049819903415228e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:42:12,765] [INFO] [timer.py:215:stop] epoch=0/micro_step=2190/global_step=2190, RunningAvgSamplesPerSec=103.51224733267122, CurrSamplesPerSec=100.81166556880088, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:42:15,931] [INFO] [logging.py:96:log_dist] [Rank 0] step=2200, skipped=20, lr=[1.7845065606841472e-05, 1.7845065606841472e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:42:15,943] [INFO] [timer.py:215:stop] epoch=0/micro_step=2200/global_step=2200, RunningAvgSamplesPerSec=103.4999381704744, CurrSamplesPerSec=100.83400797550253, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:42:19,100] [INFO] [logging.py:96:log_dist] [Rank 0] step=2210, skipped=20, lr=[1.76408327536094e-05, 1.76408327536094e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:42:19,116] [INFO] [timer.py:215:stop] epoch=0/micro_step=2210/global_step=2210, RunningAvgSamplesPerSec=103.48875213270716, CurrSamplesPerSec=101.88680368201548, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:42:22,276] [INFO] [logging.py:96:log_dist] [Rank 0] step=2220, skipped=20, lr=[1.743713622797311e-05, 1.743713622797311e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:42:22,292] [INFO] [timer.py:215:stop] epoch=0/micro_step=2220/global_step=2220, RunningAvgSamplesPerSec=103.47697393644144, CurrSamplesPerSec=102.18568386463632, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:42:24,477] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:42:24,477] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:42:24,477] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:42:24,478] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:42:24,478] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:42:24,478] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:42:24,478] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:42:24,478] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:42:24,479] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:42:24,479] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:42:24,479] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:42:24,479] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:42:24,479] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:42:24,479] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:42:24,486] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:42:24,486] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:42:25,417] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2229 [2023-06-30 00:42:25,417] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:42:25,417] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2229 [2023-06-30 00:42:25,417] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:42:25,417] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2229 [2023-06-30 00:42:25,417] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2229 [2023-06-30 00:42:25,417] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:42:25,417] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:42:25,417] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2229 [2023-06-30 00:42:25,417] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2229 [2023-06-30 00:42:25,417] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:42:25,417] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:42:25,417] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-06-30 00:42:25,417] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2229 [2023-06-30 00:42:25,418] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:42:25,418] [INFO] [logging.py:96:log_dist] [Rank 0] step=2230, skipped=21, lr=[1.725428018535056e-05, 1.725428018535056e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:42:25,419] [INFO] [timer.py:215:stop] epoch=0/micro_step=2230/global_step=2230, RunningAvgSamplesPerSec=103.47276429999418, CurrSamplesPerSec=115.47697046348158, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:42:25,420] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2229 [2023-06-30 00:42:25,421] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:42:28,571] [INFO] [logging.py:96:log_dist] [Rank 0] step=2240, skipped=21, lr=[1.7051643547404494e-05, 1.7051643547404494e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:42:28,587] [INFO] [timer.py:215:stop] epoch=0/micro_step=2240/global_step=2240, RunningAvgSamplesPerSec=103.4623340878375, CurrSamplesPerSec=100.79531266075743, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:42:31,735] [INFO] [logging.py:96:log_dist] [Rank 0] step=2250, skipped=21, lr=[1.6849586176481978e-05, 1.6849586176481978e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:42:31,751] [INFO] [timer.py:215:stop] epoch=0/micro_step=2250/global_step=2250, RunningAvgSamplesPerSec=103.45277386053641, CurrSamplesPerSec=101.55630598762569, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:42:34,900] [INFO] [logging.py:96:log_dist] [Rank 0] step=2260, skipped=21, lr=[1.6648122798290454e-05, 1.6648122798290454e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:42:34,916] [INFO] [timer.py:215:stop] epoch=0/micro_step=2260/global_step=2260, RunningAvgSamplesPerSec=103.44322732111681, CurrSamplesPerSec=101.41028415307153, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:42:38,068] [INFO] [logging.py:96:log_dist] [Rank 0] step=2270, skipped=21, lr=[1.6447268095247876e-05, 1.6447268095247876e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:42:38,083] [INFO] [timer.py:215:stop] epoch=0/micro_step=2270/global_step=2270, RunningAvgSamplesPerSec=103.43327066132912, CurrSamplesPerSec=100.02342116892709, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:42:41,180] [INFO] [logging.py:96:log_dist] [Rank 0] step=2280, skipped=21, lr=[1.6247036705412644e-05, 1.6247036705412644e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:42:41,196] [INFO] [timer.py:215:stop] epoch=0/micro_step=2280/global_step=2280, RunningAvgSamplesPerSec=103.43134624507731, CurrSamplesPerSec=103.84430672577737, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:42:43,010] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2285 [2023-06-30 00:42:43,010] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2285 [2023-06-30 00:42:43,010] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2285 [2023-06-30 00:42:43,010] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2285 [2023-06-30 00:42:43,010] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:42:43,010] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2285 [2023-06-30 00:42:43,010] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:42:43,010] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2285 [2023-06-30 00:42:43,011] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:42:43,011] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:42:43,011] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:42:43,011] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:42:43,010] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2285 [2023-06-30 00:42:43,011] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-06-30 00:42:43,011] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:42:43,011] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2285 [2023-06-30 00:42:43,011] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:42:44,244] [INFO] [logging.py:96:log_dist] [Rank 0] step=2290, skipped=22, lr=[1.6067373449119387e-05, 1.6067373449119387e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:42:44,260] [INFO] [timer.py:215:stop] epoch=0/micro_step=2290/global_step=2290, RunningAvgSamplesPerSec=103.43667662283369, CurrSamplesPerSec=100.62023054132831, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:42:47,336] [INFO] [logging.py:96:log_dist] [Rank 0] step=2300, skipped=22, lr=[1.5868366518677517e-05, 1.5868366518677517e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:42:47,352] [INFO] [timer.py:215:stop] epoch=0/micro_step=2300/global_step=2300, RunningAvgSamplesPerSec=103.43789217763306, CurrSamplesPerSec=104.0400726168901, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:42:50,450] [INFO] [logging.py:96:log_dist] [Rank 0] step=2310, skipped=22, lr=[1.567002509112022e-05, 1.567002509112022e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:42:50,465] [INFO] [timer.py:215:stop] epoch=0/micro_step=2310/global_step=2310, RunningAvgSamplesPerSec=103.43590284897634, CurrSamplesPerSec=103.19701490930703, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:42:53,549] [INFO] [logging.py:96:log_dist] [Rank 0] step=2320, skipped=22, lr=[1.5472363621341286e-05, 1.5472363621341286e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:42:53,565] [INFO] [timer.py:215:stop] epoch=0/micro_step=2320/global_step=2320, RunningAvgSamplesPerSec=103.43596142616832, CurrSamplesPerSec=103.01768799587984, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:42:56,647] [INFO] [logging.py:96:log_dist] [Rank 0] step=2330, skipped=22, lr=[1.5275396514679986e-05, 1.5275396514679986e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:42:56,663] [INFO] [timer.py:215:stop] epoch=0/micro_step=2330/global_step=2330, RunningAvgSamplesPerSec=103.43633212455562, CurrSamplesPerSec=103.42022052874502, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:42:59,736] [INFO] [logging.py:96:log_dist] [Rank 0] step=2340, skipped=22, lr=[1.5079138125871195e-05, 1.5079138125871195e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:42:59,752] [INFO] [timer.py:215:stop] epoch=0/micro_step=2340/global_step=2340, RunningAvgSamplesPerSec=103.43787261539713, CurrSamplesPerSec=104.01644823772756, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:43:02,828] [INFO] [logging.py:96:log_dist] [Rank 0] step=2350, skipped=22, lr=[1.4883602757999259e-05, 1.4883602757999259e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:43:02,844] [INFO] [timer.py:215:stop] epoch=0/micro_step=2350/global_step=2350, RunningAvgSamplesPerSec=103.43903664396422, CurrSamplesPerSec=103.70205128323995, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:43:05,920] [INFO] [logging.py:96:log_dist] [Rank 0] step=2360, skipped=22, lr=[1.468880466145559e-05, 1.468880466145559e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:43:05,936] [INFO] [timer.py:215:stop] epoch=0/micro_step=2360/global_step=2360, RunningAvgSamplesPerSec=103.44019319197177, CurrSamplesPerSec=103.99629939849729, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:43:09,023] [INFO] [logging.py:96:log_dist] [Rank 0] step=2370, skipped=22, lr=[1.4494758032900119e-05, 1.4494758032900119e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:43:09,038] [INFO] [timer.py:215:stop] epoch=0/micro_step=2370/global_step=2370, RunningAvgSamplesPerSec=103.43994942536546, CurrSamplesPerSec=103.6760974735572, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:43:12,117] [INFO] [logging.py:96:log_dist] [Rank 0] step=2380, skipped=22, lr=[1.4301477014226664e-05, 1.4301477014226664e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:43:12,132] [INFO] [timer.py:215:stop] epoch=0/micro_step=2380/global_step=2380, RunningAvgSamplesPerSec=103.4407976463293, CurrSamplesPerSec=104.03015390863763, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:43:14,269] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:43:14,269] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:43:14,269] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:43:14,269] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:43:14,269] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:43:14,269] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:43:14,269] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:43:14,270] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:43:14,270] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:43:14,270] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:43:14,270] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:43:14,270] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:43:14,270] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:43:14,270] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:43:14,270] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:43:14,270] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:43:15,218] [INFO] [logging.py:96:log_dist] [Rank 0] step=2390, skipped=22, lr=[1.4108975691532272e-05, 1.4108975691532272e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:43:15,234] [INFO] [timer.py:215:stop] epoch=0/micro_step=2390/global_step=2390, RunningAvgSamplesPerSec=103.44052527800905, CurrSamplesPerSec=101.98659760051609, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:43:18,319] [INFO] [logging.py:96:log_dist] [Rank 0] step=2400, skipped=22, lr=[1.3917268094090663e-05, 1.3917268094090663e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:43:18,334] [INFO] [timer.py:215:stop] epoch=0/micro_step=2400/global_step=2400, RunningAvgSamplesPerSec=103.44049843348043, CurrSamplesPerSec=102.57788660908696, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:43:21,418] [INFO] [logging.py:96:log_dist] [Rank 0] step=2410, skipped=22, lr=[1.3726368193329758e-05, 1.3726368193329758e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:43:21,434] [INFO] [timer.py:215:stop] epoch=0/micro_step=2410/global_step=2410, RunningAvgSamplesPerSec=103.44061801089195, CurrSamplesPerSec=103.93380709119026, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:43:24,543] [INFO] [logging.py:96:log_dist] [Rank 0] step=2420, skipped=22, lr=[1.3536289901813486e-05, 1.3536289901813486e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:43:24,559] [INFO] [timer.py:215:stop] epoch=0/micro_step=2420/global_step=2420, RunningAvgSamplesPerSec=103.4371816152632, CurrSamplesPerSec=101.66461622131781, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:43:27,716] [INFO] [logging.py:96:log_dist] [Rank 0] step=2430, skipped=22, lr=[1.334704707222787e-05, 1.334704707222787e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:43:27,732] [INFO] [timer.py:215:stop] epoch=0/micro_step=2430/global_step=2430, RunningAvgSamplesPerSec=103.42707496821346, CurrSamplesPerSec=101.08728132971717, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:43:30,874] [INFO] [logging.py:96:log_dist] [Rank 0] step=2440, skipped=22, lr=[1.3158653496371395e-05, 1.3158653496371395e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:43:30,890] [INFO] [timer.py:215:stop] epoch=0/micro_step=2440/global_step=2440, RunningAvgSamplesPerSec=103.41915515860043, CurrSamplesPerSec=101.87651798353644, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:43:34,039] [INFO] [logging.py:96:log_dist] [Rank 0] step=2450, skipped=22, lr=[1.2971122904149943e-05, 1.2971122904149943e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:43:34,055] [INFO] [timer.py:215:stop] epoch=0/micro_step=2450/global_step=2450, RunningAvgSamplesPerSec=103.4103441097284, CurrSamplesPerSec=101.08438829052123, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:43:37,203] [INFO] [logging.py:96:log_dist] [Rank 0] step=2460, skipped=22, lr=[1.2784468962576136e-05, 1.2784468962576136e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:43:37,219] [INFO] [timer.py:215:stop] epoch=0/micro_step=2460/global_step=2460, RunningAvgSamplesPerSec=103.40165864168407, CurrSamplesPerSec=101.607047660432, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:43:40,423] [INFO] [logging.py:96:log_dist] [Rank 0] step=2470, skipped=22, lr=[1.2598705274773297e-05, 1.2598705274773297e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:43:40,439] [INFO] [timer.py:215:stop] epoch=0/micro_step=2470/global_step=2470, RunningAvgSamplesPerSec=103.38556328014667, CurrSamplesPerSec=100.85385936727447, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:43:43,607] [INFO] [logging.py:96:log_dist] [Rank 0] step=2480, skipped=22, lr=[1.2413845378984126e-05, 1.2413845378984126e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:43:43,623] [INFO] [timer.py:215:stop] epoch=0/micro_step=2480/global_step=2480, RunningAvgSamplesPerSec=103.37452513899954, CurrSamplesPerSec=102.10934298806117, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:43:45,818] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:43:45,818] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:43:45,818] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:43:45,818] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:43:45,819] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:43:45,819] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:43:45,819] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:43:45,819] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:43:45,819] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:43:45,819] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:43:45,819] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:43:45,819] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:43:45,820] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:43:45,820] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:43:45,830] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:43:45,830] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:43:46,785] [INFO] [logging.py:96:log_dist] [Rank 0] step=2490, skipped=22, lr=[1.2229902747583971e-05, 1.2229902747583971e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:43:46,801] [INFO] [timer.py:215:stop] epoch=0/micro_step=2490/global_step=2490, RunningAvgSamplesPerSec=103.36421151954558, CurrSamplesPerSec=101.79299465467761, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:43:48,355] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2494 [2023-06-30 00:43:48,355] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:43:48,355] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2494 [2023-06-30 00:43:48,355] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2494 [2023-06-30 00:43:48,355] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:43:48,355] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:43:48,355] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2494 [2023-06-30 00:43:48,355] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2494 [2023-06-30 00:43:48,355] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:43:48,355] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2494 [2023-06-30 00:43:48,355] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:43:48,355] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2494 [2023-06-30 00:43:48,355] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:43:48,355] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:43:48,355] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-06-30 00:43:48,359] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2494 [2023-06-30 00:43:48,359] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:43:49,933] [INFO] [logging.py:96:log_dist] [Rank 0] step=2500, skipped=23, lr=[1.2065149721440866e-05, 1.2065149721440866e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:43:49,949] [INFO] [timer.py:215:stop] epoch=0/micro_step=2500/global_step=2500, RunningAvgSamplesPerSec=103.35813007887495, CurrSamplesPerSec=100.6233233847854, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:43:53,114] [INFO] [logging.py:96:log_dist] [Rank 0] step=2510, skipped=23, lr=[1.188298676856629e-05, 1.188298676856629e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:43:53,130] [INFO] [timer.py:215:stop] epoch=0/micro_step=2510/global_step=2510, RunningAvgSamplesPerSec=103.34773544112691, CurrSamplesPerSec=101.65175769179501, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:43:56,211] [INFO] [logging.py:96:log_dist] [Rank 0] step=2520, skipped=23, lr=[1.1701779768442154e-05, 1.1701779768442154e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:43:56,227] [INFO] [timer.py:215:stop] epoch=0/micro_step=2520/global_step=2520, RunningAvgSamplesPerSec=103.3484995051263, CurrSamplesPerSec=103.38636358384397, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:43:59,299] [INFO] [logging.py:96:log_dist] [Rank 0] step=2530, skipped=23, lr=[1.1521541927224994e-05, 1.1521541927224994e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:43:59,315] [INFO] [timer.py:215:stop] epoch=0/micro_step=2530/global_step=2530, RunningAvgSamplesPerSec=103.35045122027238, CurrSamplesPerSec=103.9222993062438, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:44:02,391] [INFO] [logging.py:96:log_dist] [Rank 0] step=2540, skipped=23, lr=[1.1342286380440201e-05, 1.1342286380440201e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:44:02,407] [INFO] [timer.py:215:stop] epoch=0/micro_step=2540/global_step=2540, RunningAvgSamplesPerSec=103.35178553910923, CurrSamplesPerSec=103.33876239304658, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:44:05,485] [INFO] [logging.py:96:log_dist] [Rank 0] step=2550, skipped=23, lr=[1.1164026192024646e-05, 1.1164026192024646e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:44:05,501] [INFO] [timer.py:215:stop] epoch=0/micro_step=2550/global_step=2550, RunningAvgSamplesPerSec=103.3529094229318, CurrSamplesPerSec=104.07137329704499, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:44:08,589] [INFO] [logging.py:96:log_dist] [Rank 0] step=2560, skipped=23, lr=[1.0986774353374651e-05, 1.0986774353374651e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:44:08,605] [INFO] [timer.py:215:stop] epoch=0/micro_step=2560/global_step=2560, RunningAvgSamplesPerSec=103.35270884641778, CurrSamplesPerSec=102.5267191200061, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:44:11,695] [INFO] [logging.py:96:log_dist] [Rank 0] step=2570, skipped=23, lr=[1.0810543782399172e-05, 1.0810543782399172e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:44:11,711] [INFO] [timer.py:215:stop] epoch=0/micro_step=2570/global_step=2570, RunningAvgSamplesPerSec=103.35235751170696, CurrSamplesPerSec=104.29748988634474, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:44:14,805] [INFO] [logging.py:96:log_dist] [Rank 0] step=2580, skipped=23, lr=[1.063534732257834e-05, 1.063534732257834e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:44:14,821] [INFO] [timer.py:215:stop] epoch=0/micro_step=2580/global_step=2580, RunningAvgSamplesPerSec=103.35136284914803, CurrSamplesPerSec=102.88833337549511, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:44:17,903] [INFO] [logging.py:96:log_dist] [Rank 0] step=2590, skipped=23, lr=[1.0461197742027507e-05, 1.0461197742027507e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:44:17,919] [INFO] [timer.py:215:stop] epoch=0/micro_step=2590/global_step=2590, RunningAvgSamplesPerSec=103.35197207495254, CurrSamplesPerSec=103.43353038594415, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:44:19,742] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:44:19,743] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:44:19,743] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:44:19,743] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:44:19,743] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:44:19,743] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:44:19,743] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:44:19,743] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:44:19,743] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:44:19,743] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:44:19,744] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:44:19,744] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:44:19,744] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:44:19,744] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:44:19,744] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:44:19,744] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:44:20,993] [INFO] [logging.py:96:log_dist] [Rank 0] step=2600, skipped=23, lr=[1.0288107732566627e-05, 1.0288107732566627e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:44:21,009] [INFO] [timer.py:215:stop] epoch=0/micro_step=2600/global_step=2600, RunningAvgSamplesPerSec=103.35365176552274, CurrSamplesPerSec=104.13936382511864, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:44:21,278] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2600 [2023-06-30 00:44:21,278] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2600 [2023-06-30 00:44:21,278] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2600 [2023-06-30 00:44:21,278] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:44:21,278] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:44:21,278] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:44:21,278] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2600 [2023-06-30 00:44:21,278] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2600 [2023-06-30 00:44:21,278] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:44:21,278] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:44:21,278] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2600 [2023-06-30 00:44:21,278] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:44:21,278] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2600 [2023-06-30 00:44:21,278] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:44:21,278] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2600 [2023-06-30 00:44:21,278] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:44:21,278] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-06-30 00:44:24,056] [INFO] [logging.py:96:log_dist] [Rank 0] step=2610, skipped=24, lr=[1.0133243084910764e-05, 1.0133243084910764e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:44:24,071] [INFO] [timer.py:215:stop] epoch=0/micro_step=2610/global_step=2610, RunningAvgSamplesPerSec=103.35877472330053, CurrSamplesPerSec=103.00519639391291, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:44:27,150] [INFO] [logging.py:96:log_dist] [Rank 0] step=2620, skipped=24, lr=[9.962200949179345e-06, 9.962200949179345e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:44:27,163] [INFO] [timer.py:215:stop] epoch=0/micro_step=2620/global_step=2620, RunningAvgSamplesPerSec=103.36011624601979, CurrSamplesPerSec=104.25843977889566, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:44:30,236] [INFO] [logging.py:96:log_dist] [Rank 0] step=2630, skipped=24, lr=[9.792254750846891e-06, 9.792254750846891e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:44:30,251] [INFO] [timer.py:215:stop] epoch=0/micro_step=2630/global_step=2630, RunningAvgSamplesPerSec=103.3619358210195, CurrSamplesPerSec=104.03966938048762, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:44:33,325] [INFO] [logging.py:96:log_dist] [Rank 0] step=2640, skipped=24, lr=[9.623416875395763e-06, 9.623416875395763e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:44:33,341] [INFO] [timer.py:215:stop] epoch=0/micro_step=2640/global_step=2640, RunningAvgSamplesPerSec=103.36362144594474, CurrSamplesPerSec=103.901543454061, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:44:36,416] [INFO] [logging.py:96:log_dist] [Rank 0] step=2650, skipped=24, lr=[9.455699627535e-06, 9.455699627535e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:44:36,432] [INFO] [timer.py:215:stop] epoch=0/micro_step=2650/global_step=2650, RunningAvgSamplesPerSec=103.36504385565334, CurrSamplesPerSec=103.93131218019403, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:44:39,513] [INFO] [logging.py:96:log_dist] [Rank 0] step=2660, skipped=24, lr=[9.28911523030361e-06, 9.28911523030361e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:44:39,528] [INFO] [timer.py:215:stop] epoch=0/micro_step=2660/global_step=2660, RunningAvgSamplesPerSec=103.36578425014015, CurrSamplesPerSec=103.3772061032249, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:44:42,609] [INFO] [logging.py:96:log_dist] [Rank 0] step=2670, skipped=24, lr=[9.123675824179758e-06, 9.123675824179758e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:44:42,625] [INFO] [timer.py:215:stop] epoch=0/micro_step=2670/global_step=2670, RunningAvgSamplesPerSec=103.36649988951213, CurrSamplesPerSec=103.83418432078822, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:44:45,712] [INFO] [logging.py:96:log_dist] [Rank 0] step=2680, skipped=24, lr=[8.959393466195972e-06, 8.959393466195972e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:44:45,728] [INFO] [timer.py:215:stop] epoch=0/micro_step=2680/global_step=2680, RunningAvgSamplesPerSec=103.36642993638284, CurrSamplesPerSec=102.94025728597275, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:44:48,823] [INFO] [logging.py:96:log_dist] [Rank 0] step=2690, skipped=24, lr=[8.796280129060475e-06, 8.796280129060475e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:44:48,838] [INFO] [timer.py:215:stop] epoch=0/micro_step=2690/global_step=2690, RunningAvgSamplesPerSec=103.36535296418477, CurrSamplesPerSec=102.51003999807533, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:44:51,951] [INFO] [logging.py:96:log_dist] [Rank 0] step=2700, skipped=24, lr=[8.634347700284575e-06, 8.634347700284575e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:44:51,967] [INFO] [timer.py:215:stop] epoch=0/micro_step=2700/global_step=2700, RunningAvgSamplesPerSec=103.36211026526347, CurrSamplesPerSec=101.53594669386038, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:44:52,576] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:44:52,576] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:44:52,576] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:44:52,576] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:44:52,576] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:44:52,576] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:44:52,576] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:44:52,576] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:44:52,576] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:44:52,576] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:44:52,576] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:44:52,577] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:44:52,577] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:44:52,577] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:44:52,587] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:44:52,588] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:44:53,872] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2705 [2023-06-30 00:44:53,872] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:44:53,872] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2705 [2023-06-30 00:44:53,872] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2705 [2023-06-30 00:44:53,872] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2705 [2023-06-30 00:44:53,872] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:44:53,872] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2705 [2023-06-30 00:44:53,872] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:44:53,872] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:44:53,872] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2705 [2023-06-30 00:44:53,872] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:44:53,872] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:44:53,872] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2705 [2023-06-30 00:44:53,872] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2705 [2023-06-30 00:44:53,873] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:44:53,873] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:44:53,873] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-06-30 00:44:55,100] [INFO] [logging.py:96:log_dist] [Rank 0] step=2710, skipped=25, lr=[8.489627946722731e-06, 8.489627946722731e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:44:55,116] [INFO] [timer.py:215:stop] epoch=0/micro_step=2710/global_step=2710, RunningAvgSamplesPerSec=103.35636152155158, CurrSamplesPerSec=104.27569275519504, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:44:58,177] [INFO] [logging.py:96:log_dist] [Rank 0] step=2720, skipped=25, lr=[8.3299716849951e-06, 8.3299716849951e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:44:58,193] [INFO] [timer.py:215:stop] epoch=0/micro_step=2720/global_step=2720, RunningAvgSamplesPerSec=103.35937612791821, CurrSamplesPerSec=104.10204686300212, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:45:01,259] [INFO] [logging.py:96:log_dist] [Rank 0] step=2730, skipped=25, lr=[8.171530315647041e-06, 8.171530315647041e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:45:01,275] [INFO] [timer.py:215:stop] epoch=0/micro_step=2730/global_step=2730, RunningAvgSamplesPerSec=103.36196462083142, CurrSamplesPerSec=104.40279532287472, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:45:04,351] [INFO] [logging.py:96:log_dist] [Rank 0] step=2740, skipped=25, lr=[8.014315385702261e-06, 8.014315385702261e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:45:04,366] [INFO] [timer.py:215:stop] epoch=0/micro_step=2740/global_step=2740, RunningAvgSamplesPerSec=103.36324071754808, CurrSamplesPerSec=103.52823278588376, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:45:07,434] [INFO] [logging.py:96:log_dist] [Rank 0] step=2750, skipped=25, lr=[7.858338352803005e-06, 7.858338352803005e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:45:07,450] [INFO] [timer.py:215:stop] epoch=0/micro_step=2750/global_step=2750, RunningAvgSamplesPerSec=103.36554670314844, CurrSamplesPerSec=104.16328659280484, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:45:10,524] [INFO] [logging.py:96:log_dist] [Rank 0] step=2760, skipped=25, lr=[7.703610584374984e-06, 7.703610584374984e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:45:10,540] [INFO] [timer.py:215:stop] epoch=0/micro_step=2760/global_step=2760, RunningAvgSamplesPerSec=103.36704148764203, CurrSamplesPerSec=102.75969656971884, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:45:13,605] [INFO] [logging.py:96:log_dist] [Rank 0] step=2770, skipped=25, lr=[7.550143356798969e-06, 7.550143356798969e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:45:13,621] [INFO] [timer.py:215:stop] epoch=0/micro_step=2770/global_step=2770, RunningAvgSamplesPerSec=103.36952602540546, CurrSamplesPerSec=104.57158038832848, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:45:16,699] [INFO] [logging.py:96:log_dist] [Rank 0] step=2780, skipped=25, lr=[7.397947854588977e-06, 7.397947854588977e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:45:16,715] [INFO] [timer.py:215:stop] epoch=0/micro_step=2780/global_step=2780, RunningAvgSamplesPerSec=103.37060247032471, CurrSamplesPerSec=103.86712954436058, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:45:19,781] [INFO] [logging.py:96:log_dist] [Rank 0] step=2790, skipped=25, lr=[7.247035169577138e-06, 7.247035169577138e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:45:19,797] [INFO] [timer.py:215:stop] epoch=0/micro_step=2790/global_step=2790, RunningAvgSamplesPerSec=103.37293295800039, CurrSamplesPerSec=103.91007004874302, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:45:22,876] [INFO] [logging.py:96:log_dist] [Rank 0] step=2800, skipped=25, lr=[7.097416300105375e-06, 7.097416300105375e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:45:22,892] [INFO] [timer.py:215:stop] epoch=0/micro_step=2800/global_step=2800, RunningAvgSamplesPerSec=103.37380888189655, CurrSamplesPerSec=103.12834709861133, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:45:25,043] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2806 [2023-06-30 00:45:25,043] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:45:25,043] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2806 [2023-06-30 00:45:25,043] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:45:25,043] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2806 [2023-06-30 00:45:25,043] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2806 [2023-06-30 00:45:25,043] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2806 [2023-06-30 00:45:25,043] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:45:25,043] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:45:25,043] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:45:25,043] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2806 [2023-06-30 00:45:25,044] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:45:25,044] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-06-30 00:45:25,044] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2806 [2023-06-30 00:45:25,044] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:45:25,044] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2806 [2023-06-30 00:45:25,044] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:45:25,967] [INFO] [logging.py:96:log_dist] [Rank 0] step=2810, skipped=26, lr=[6.9638745440261084e-06, 6.9638745440261084e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:45:25,983] [INFO] [timer.py:215:stop] epoch=0/micro_step=2810/global_step=2810, RunningAvgSamplesPerSec=103.37520413426518, CurrSamplesPerSec=103.35626943930855, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:45:29,044] [INFO] [logging.py:96:log_dist] [Rank 0] step=2820, skipped=26, lr=[6.81674388616593e-06, 6.81674388616593e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:45:29,060] [INFO] [timer.py:215:stop] epoch=0/micro_step=2820/global_step=2820, RunningAvgSamplesPerSec=103.37812819233396, CurrSamplesPerSec=104.18277237052808, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:45:32,126] [INFO] [logging.py:96:log_dist] [Rank 0] step=2830, skipped=26, lr=[6.67093840297679e-06, 6.67093840297679e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:45:32,142] [INFO] [timer.py:215:stop] epoch=0/micro_step=2830/global_step=2830, RunningAvgSamplesPerSec=103.38042886032306, CurrSamplesPerSec=103.53294450611203, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:45:35,211] [INFO] [logging.py:96:log_dist] [Rank 0] step=2840, skipped=26, lr=[6.526468720593626e-06, 6.526468720593626e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:45:35,227] [INFO] [timer.py:215:stop] epoch=0/micro_step=2840/global_step=2840, RunningAvgSamplesPerSec=103.38233015864466, CurrSamplesPerSec=104.3739734091487, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:45:38,297] [INFO] [logging.py:96:log_dist] [Rank 0] step=2850, skipped=26, lr=[6.383345367799784e-06, 6.383345367799784e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:45:38,313] [INFO] [timer.py:215:stop] epoch=0/micro_step=2850/global_step=2850, RunningAvgSamplesPerSec=103.38425667719001, CurrSamplesPerSec=103.22042725459163, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:45:41,381] [INFO] [logging.py:96:log_dist] [Rank 0] step=2860, skipped=26, lr=[6.241578775259638e-06, 6.241578775259638e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:45:41,397] [INFO] [timer.py:215:stop] epoch=0/micro_step=2860/global_step=2860, RunningAvgSamplesPerSec=103.38628547914146, CurrSamplesPerSec=104.5504015155469, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:45:44,463] [INFO] [logging.py:96:log_dist] [Rank 0] step=2870, skipped=26, lr=[6.101179274758461e-06, 6.101179274758461e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:45:44,479] [INFO] [timer.py:215:stop] epoch=0/micro_step=2870/global_step=2870, RunningAvgSamplesPerSec=103.38852247227517, CurrSamplesPerSec=103.249806912977, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:45:47,550] [INFO] [logging.py:96:log_dist] [Rank 0] step=2880, skipped=26, lr=[5.962157098449431e-06, 5.962157098449431e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:45:47,566] [INFO] [timer.py:215:stop] epoch=0/micro_step=2880/global_step=2880, RunningAvgSamplesPerSec=103.39020687767702, CurrSamplesPerSec=104.55162314040317, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:45:50,634] [INFO] [logging.py:96:log_dist] [Rank 0] step=2890, skipped=26, lr=[5.824522378107935e-06, 5.824522378107935e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:45:50,650] [INFO] [timer.py:215:stop] epoch=0/micro_step=2890/global_step=2890, RunningAvgSamplesPerSec=103.39217065086036, CurrSamplesPerSec=104.29043925962014, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:45:53,727] [INFO] [logging.py:96:log_dist] [Rank 0] step=2900, skipped=26, lr=[5.688285144393169e-06, 5.688285144393169e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:45:53,743] [INFO] [timer.py:215:stop] epoch=0/micro_step=2900/global_step=2900, RunningAvgSamplesPerSec=103.39313069827813, CurrSamplesPerSec=103.95618616388055, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:45:56,190] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:45:56,190] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:45:56,191] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:45:56,191] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:45:56,191] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:45:56,191] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:45:56,191] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:45:56,191] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:45:56,191] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:45:56,192] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:45:56,191] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:45:56,191] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:45:56,192] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:45:56,192] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:45:56,192] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:45:56,192] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:45:56,841] [INFO] [logging.py:96:log_dist] [Rank 0] step=2910, skipped=26, lr=[5.553455326117138e-06, 5.553455326117138e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:45:56,857] [INFO] [timer.py:215:stop] epoch=0/micro_step=2910/global_step=2910, RunningAvgSamplesPerSec=103.39177523395497, CurrSamplesPerSec=101.50446348893279, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:45:59,942] [INFO] [logging.py:96:log_dist] [Rank 0] step=2920, skipped=26, lr=[5.420042749521021e-06, 5.420042749521021e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:45:59,957] [INFO] [timer.py:215:stop] epoch=0/micro_step=2920/global_step=2920, RunningAvgSamplesPerSec=103.39185552106042, CurrSamplesPerSec=104.43748390264476, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:46:03,027] [INFO] [logging.py:96:log_dist] [Rank 0] step=2930, skipped=26, lr=[5.2880571375590655e-06, 5.2880571375590655e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:46:03,046] [INFO] [timer.py:215:stop] epoch=0/micro_step=2930/global_step=2930, RunningAvgSamplesPerSec=103.39328425277789, CurrSamplesPerSec=102.69663808577938, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:46:06,107] [INFO] [logging.py:96:log_dist] [Rank 0] step=2940, skipped=26, lr=[5.157508109189993e-06, 5.157508109189993e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:46:06,123] [INFO] [timer.py:215:stop] epoch=0/micro_step=2940/global_step=2940, RunningAvgSamplesPerSec=103.39601599685938, CurrSamplesPerSec=103.86383408125536, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:46:08,223] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2946 [2023-06-30 00:46:08,223] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2946 [2023-06-30 00:46:08,223] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2946 [2023-06-30 00:46:08,223] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2946 [2023-06-30 00:46:08,223] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2946 [2023-06-30 00:46:08,223] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2946 [2023-06-30 00:46:08,223] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2946 [2023-06-30 00:46:08,224] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:46:08,224] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:46:08,224] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:46:08,224] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:46:08,224] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:46:08,224] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:46:08,224] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:46:08,224] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2946 [2023-06-30 00:46:08,224] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-06-30 00:46:08,224] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:46:09,130] [INFO] [logging.py:96:log_dist] [Rank 0] step=2950, skipped=27, lr=[5.041250128341901e-06, 5.041250128341901e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:46:09,146] [INFO] [timer.py:215:stop] epoch=0/micro_step=2950/global_step=2950, RunningAvgSamplesPerSec=103.40482728693426, CurrSamplesPerSec=104.44252256851092, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:46:12,212] [INFO] [logging.py:96:log_dist] [Rank 0] step=2960, skipped=27, lr=[4.9134567335009905e-06, 4.9134567335009905e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:46:12,228] [INFO] [timer.py:215:stop] epoch=0/micro_step=2960/global_step=2960, RunningAvgSamplesPerSec=103.40695783639015, CurrSamplesPerSec=104.27788015998559, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:46:15,291] [INFO] [logging.py:96:log_dist] [Rank 0] step=2970, skipped=27, lr=[4.787127222697066e-06, 4.787127222697066e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:46:15,307] [INFO] [timer.py:215:stop] epoch=0/micro_step=2970/global_step=2970, RunningAvgSamplesPerSec=103.40941793211337, CurrSamplesPerSec=104.14631322638667, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:46:18,369] [INFO] [logging.py:96:log_dist] [Rank 0] step=2980, skipped=27, lr=[4.662270802678737e-06, 4.662270802678737e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:46:18,385] [INFO] [timer.py:215:stop] epoch=0/micro_step=2980/global_step=2980, RunningAvgSamplesPerSec=103.4118896612421, CurrSamplesPerSec=104.61078393166123, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:46:21,444] [INFO] [logging.py:96:log_dist] [Rank 0] step=2990, skipped=27, lr=[4.538896572837459e-06, 4.538896572837459e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:46:21,461] [INFO] [timer.py:215:stop] epoch=0/micro_step=2990/global_step=2990, RunningAvgSamplesPerSec=103.4147220064043, CurrSamplesPerSec=104.22378908264119, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:46:24,535] [INFO] [logging.py:96:log_dist] [Rank 0] step=3000, skipped=27, lr=[4.417013524544378e-06, 4.417013524544378e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:46:24,551] [INFO] [timer.py:215:stop] epoch=0/micro_step=3000/global_step=3000, RunningAvgSamplesPerSec=103.41581591844039, CurrSamplesPerSec=100.95247436281228, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:46:27,614] [INFO] [logging.py:96:log_dist] [Rank 0] step=3010, skipped=27, lr=[4.2966305404950695e-06, 4.2966305404950695e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:46:27,630] [INFO] [timer.py:215:stop] epoch=0/micro_step=3010/global_step=3010, RunningAvgSamplesPerSec=103.41819386449988, CurrSamplesPerSec=104.48358142960123, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:46:30,738] [INFO] [logging.py:96:log_dist] [Rank 0] step=3020, skipped=27, lr=[4.177756394062146e-06, 4.177756394062146e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:46:30,754] [INFO] [timer.py:215:stop] epoch=0/micro_step=3020/global_step=3020, RunningAvgSamplesPerSec=103.41560685052895, CurrSamplesPerSec=103.39775297518166, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:46:33,825] [INFO] [logging.py:96:log_dist] [Rank 0] step=3030, skipped=27, lr=[4.060399748655883e-06, 4.060399748655883e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:46:33,841] [INFO] [timer.py:215:stop] epoch=0/micro_step=3030/global_step=3030, RunningAvgSamplesPerSec=103.41710928471934, CurrSamplesPerSec=104.18988932644932, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:46:36,908] [INFO] [logging.py:96:log_dist] [Rank 0] step=3040, skipped=27, lr=[3.944569157092839e-06, 3.944569157092839e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:46:36,924] [INFO] [timer.py:215:stop] epoch=0/micro_step=3040/global_step=3040, RunningAvgSamplesPerSec=103.41895768349853, CurrSamplesPerSec=103.71134654718513, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:46:39,363] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:46:39,363] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:46:39,363] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:46:39,363] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:46:39,363] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:46:39,363] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:46:39,364] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:46:39,364] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:46:39,364] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:46:39,364] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:46:39,364] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:46:39,364] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:46:39,364] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:46:39,364] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:46:39,364] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:46:39,364] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:46:39,997] [INFO] [logging.py:96:log_dist] [Rank 0] step=3050, skipped=27, lr=[3.830273060972528e-06, 3.830273060972528e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:46:40,012] [INFO] [timer.py:215:stop] epoch=0/micro_step=3050/global_step=3050, RunningAvgSamplesPerSec=103.42032215385785, CurrSamplesPerSec=103.85161857909206, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:46:43,112] [INFO] [logging.py:96:log_dist] [Rank 0] step=3060, skipped=27, lr=[3.7175197900622294e-06, 3.7175197900622294e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:46:43,128] [INFO] [timer.py:215:stop] epoch=0/micro_step=3060/global_step=3060, RunningAvgSamplesPerSec=103.4187453300653, CurrSamplesPerSec=103.35412052794504, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:46:46,226] [INFO] [logging.py:96:log_dist] [Rank 0] step=3070, skipped=27, lr=[3.606317561689887e-06, 3.606317561689887e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:46:46,242] [INFO] [timer.py:215:stop] epoch=0/micro_step=3070/global_step=3070, RunningAvgSamplesPerSec=103.41730771064866, CurrSamplesPerSec=102.8523015335383, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:46:49,331] [INFO] [logging.py:96:log_dist] [Rank 0] step=3080, skipped=27, lr=[3.4966744801452624e-06, 3.4966744801452624e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:46:49,347] [INFO] [timer.py:215:stop] epoch=0/micro_step=3080/global_step=3080, RunningAvgSamplesPerSec=103.41673758370753, CurrSamplesPerSec=103.34799264185564, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:46:52,452] [INFO] [logging.py:96:log_dist] [Rank 0] step=3090, skipped=27, lr=[3.3885985360893046e-06, 3.3885985360893046e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:46:52,468] [INFO] [timer.py:215:stop] epoch=0/micro_step=3090/global_step=3090, RunningAvgSamplesPerSec=103.41451801260327, CurrSamplesPerSec=103.61046991408878, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:46:53,668] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3093 [2023-06-30 00:46:53,668] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3093 [2023-06-30 00:46:53,668] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:46:53,668] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3093 [2023-06-30 00:46:53,668] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:46:53,668] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3093 [2023-06-30 00:46:53,669] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:46:53,669] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3093 [2023-06-30 00:46:53,669] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3093 [2023-06-30 00:46:53,669] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:46:53,669] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3093 [2023-06-30 00:46:53,669] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:46:53,669] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:46:53,669] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:46:53,669] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-06-30 00:46:53,669] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3093 [2023-06-30 00:46:53,669] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:46:55,516] [INFO] [logging.py:96:log_dist] [Rank 0] step=3100, skipped=28, lr=[3.292676601246661e-06, 3.292676601246661e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:46:55,532] [INFO] [timer.py:215:stop] epoch=0/micro_step=3100/global_step=3100, RunningAvgSamplesPerSec=103.41852781076601, CurrSamplesPerSec=103.46582301509847, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:46:58,651] [INFO] [logging.py:96:log_dist] [Rank 0] step=3110, skipped=28, lr=[3.187599823180071e-06, 3.187599823180071e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:46:58,667] [INFO] [timer.py:215:stop] epoch=0/micro_step=3110/global_step=3110, RunningAvgSamplesPerSec=103.41482774438373, CurrSamplesPerSec=102.6502978148011, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:47:01,787] [INFO] [logging.py:96:log_dist] [Rank 0] step=3120, skipped=28, lr=[3.084112707605613e-06, 3.084112707605613e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:47:01,803] [INFO] [timer.py:215:stop] epoch=0/micro_step=3120/global_step=3120, RunningAvgSamplesPerSec=103.41105390656718, CurrSamplesPerSec=102.83197084922473, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:47:04,914] [INFO] [logging.py:96:log_dist] [Rank 0] step=3130, skipped=28, lr=[2.982222796544551e-06, 2.982222796544551e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:47:04,929] [INFO] [timer.py:215:stop] epoch=0/micro_step=3130/global_step=3130, RunningAvgSamplesPerSec=103.40839181428267, CurrSamplesPerSec=102.18288320181499, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:47:08,025] [INFO] [logging.py:96:log_dist] [Rank 0] step=3140, skipped=28, lr=[2.881937515615732e-06, 2.881937515615732e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:47:08,041] [INFO] [timer.py:215:stop] epoch=0/micro_step=3140/global_step=3140, RunningAvgSamplesPerSec=103.40722576619446, CurrSamplesPerSec=103.10339243711293, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:47:11,131] [INFO] [logging.py:96:log_dist] [Rank 0] step=3150, skipped=28, lr=[2.7832641734944238e-06, 2.7832641734944238e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:47:11,146] [INFO] [timer.py:215:stop] epoch=0/micro_step=3150/global_step=3150, RunningAvgSamplesPerSec=103.40675941544023, CurrSamplesPerSec=103.28444874012695, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:47:14,235] [INFO] [logging.py:96:log_dist] [Rank 0] step=3160, skipped=28, lr=[2.686209961379646e-06, 2.686209961379646e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:47:14,250] [INFO] [timer.py:215:stop] epoch=0/micro_step=3160/global_step=3160, RunningAvgSamplesPerSec=103.40651418905415, CurrSamplesPerSec=102.25489341602035, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:47:17,345] [INFO] [logging.py:96:log_dist] [Rank 0] step=3170, skipped=28, lr=[2.5907819524701173e-06, 2.5907819524701173e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:47:17,361] [INFO] [timer.py:215:stop] epoch=0/micro_step=3170/global_step=3170, RunningAvgSamplesPerSec=103.40549973995546, CurrSamplesPerSec=102.51677363818497, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:47:20,456] [INFO] [logging.py:96:log_dist] [Rank 0] step=3180, skipped=28, lr=[2.4969871014487277e-06, 2.4969871014487277e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:47:20,472] [INFO] [timer.py:215:stop] epoch=0/micro_step=3180/global_step=3180, RunningAvgSamplesPerSec=103.40442975244424, CurrSamplesPerSec=102.86893454276506, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:47:23,577] [INFO] [logging.py:96:log_dist] [Rank 0] step=3190, skipped=28, lr=[2.404832243975716e-06, 2.404832243975716e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:47:23,593] [INFO] [timer.py:215:stop] epoch=0/micro_step=3190/global_step=3190, RunningAvgSamplesPerSec=103.40231771541507, CurrSamplesPerSec=103.16132161251669, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:47:25,109] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:47:25,109] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:47:25,109] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:47:25,109] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:47:25,110] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:47:25,110] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:47:25,110] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:47:25,110] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:47:25,110] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:47:25,110] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:47:25,110] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:47:25,110] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:47:25,110] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:47:25,110] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:47:25,110] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:47:25,111] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:47:26,674] [INFO] [logging.py:96:log_dist] [Rank 0] step=3200, skipped=28, lr=[2.314324096190493e-06, 2.314324096190493e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:47:26,690] [INFO] [timer.py:215:stop] epoch=0/micro_step=3200/global_step=3200, RunningAvgSamplesPerSec=103.40265848315846, CurrSamplesPerSec=104.30308244061843, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:47:29,767] [INFO] [logging.py:96:log_dist] [Rank 0] step=3210, skipped=28, lr=[2.225469254222162e-06, 2.225469254222162e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:47:29,783] [INFO] [timer.py:215:stop] epoch=0/micro_step=3210/global_step=3210, RunningAvgSamplesPerSec=103.40350374078245, CurrSamplesPerSec=103.14602071869909, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:47:32,880] [INFO] [logging.py:96:log_dist] [Rank 0] step=3220, skipped=28, lr=[2.138274193708828e-06, 2.138274193708828e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:47:32,896] [INFO] [timer.py:215:stop] epoch=0/micro_step=3220/global_step=3220, RunningAvgSamplesPerSec=103.40227548030245, CurrSamplesPerSec=103.62382878848707, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:47:35,988] [INFO] [logging.py:96:log_dist] [Rank 0] step=3230, skipped=28, lr=[2.0527452693256287e-06, 2.0527452693256287e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:47:36,004] [INFO] [timer.py:215:stop] epoch=0/micro_step=3230/global_step=3230, RunningAvgSamplesPerSec=103.40154715864838, CurrSamplesPerSec=103.65079700950574, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:47:39,096] [INFO] [logging.py:96:log_dist] [Rank 0] step=3240, skipped=28, lr=[1.9688887143216263e-06, 1.9688887143216263e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:47:39,112] [INFO] [timer.py:215:stop] epoch=0/micro_step=3240/global_step=3240, RunningAvgSamplesPerSec=103.40085682837653, CurrSamplesPerSec=101.63697825165082, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:47:42,196] [INFO] [logging.py:96:log_dist] [Rank 0] step=3250, skipped=28, lr=[1.8867106400655533e-06, 1.8867106400655533e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:47:42,211] [INFO] [timer.py:215:stop] epoch=0/micro_step=3250/global_step=3250, RunningAvgSamplesPerSec=103.40103111723874, CurrSamplesPerSec=103.44301674599848, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:47:45,315] [INFO] [logging.py:96:log_dist] [Rank 0] step=3260, skipped=28, lr=[1.8062170356003855e-06, 1.8062170356003855e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:47:45,331] [INFO] [timer.py:215:stop] epoch=0/micro_step=3260/global_step=3260, RunningAvgSamplesPerSec=103.39908917962244, CurrSamplesPerSec=99.06895454723275, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:47:48,443] [INFO] [logging.py:96:log_dist] [Rank 0] step=3270, skipped=28, lr=[1.7274137672069145e-06, 1.7274137672069145e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:47:48,459] [INFO] [timer.py:215:stop] epoch=0/micro_step=3270/global_step=3270, RunningAvgSamplesPerSec=103.39634891877847, CurrSamplesPerSec=102.76662042013899, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:47:51,562] [INFO] [logging.py:96:log_dist] [Rank 0] step=3280, skipped=28, lr=[1.6503065779761796e-06, 1.6503065779761796e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:47:51,577] [INFO] [timer.py:215:stop] epoch=0/micro_step=3280/global_step=3280, RunningAvgSamplesPerSec=103.39460245144883, CurrSamplesPerSec=103.45083034917366, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:47:54,663] [INFO] [logging.py:96:log_dist] [Rank 0] step=3290, skipped=28, lr=[1.5749010873909175e-06, 1.5749010873909175e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:47:54,679] [INFO] [timer.py:215:stop] epoch=0/micro_step=3290/global_step=3290, RunningAvgSamplesPerSec=103.3946293293107, CurrSamplesPerSec=103.82012869829694, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:47:56,204] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:47:56,205] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:47:56,205] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:47:56,205] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:47:56,205] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:47:56,205] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:47:56,205] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:47:56,205] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:47:56,205] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:47:56,206] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:47:56,206] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:47:56,206] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:47:56,206] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:47:56,206] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:47:56,213] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:47:56,213] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:47:56,828] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3296 [2023-06-30 00:47:56,828] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:47:56,828] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3296 [2023-06-30 00:47:56,828] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:47:56,828] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3296 [2023-06-30 00:47:56,828] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3296 [2023-06-30 00:47:56,828] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3296 [2023-06-30 00:47:56,828] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:47:56,828] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:47:56,828] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:47:56,828] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-06-30 00:47:56,828] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3296 [2023-06-30 00:47:56,828] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:47:56,828] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3296 [2023-06-30 00:47:56,828] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:47:56,831] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3296 [2023-06-30 00:47:56,831] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:47:57,757] [INFO] [logging.py:96:log_dist] [Rank 0] step=3300, skipped=29, lr=[1.5084956427768333e-06, 1.5084956427768333e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:47:57,773] [INFO] [timer.py:215:stop] epoch=0/micro_step=3300/global_step=3300, RunningAvgSamplesPerSec=103.39533679374672, CurrSamplesPerSec=101.3761222032723, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:48:00,931] [INFO] [logging.py:96:log_dist] [Rank 0] step=3310, skipped=29, lr=[1.4363384168002398e-06, 1.4363384168002398e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:48:00,947] [INFO] [timer.py:215:stop] epoch=0/micro_step=3310/global_step=3310, RunningAvgSamplesPerSec=103.38799473019206, CurrSamplesPerSec=100.32255045737153, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:48:04,128] [INFO] [logging.py:96:log_dist] [Rank 0] step=3320, skipped=29, lr=[1.3658984832211923e-06, 1.3658984832211923e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:48:04,143] [INFO] [timer.py:215:stop] epoch=0/micro_step=3320/global_step=3320, RunningAvgSamplesPerSec=103.37845141279838, CurrSamplesPerSec=100.3367250441625, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:48:07,292] [INFO] [logging.py:96:log_dist] [Rank 0] step=3330, skipped=29, lr=[1.2971809756205738e-06, 1.2971809756205738e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:48:07,308] [INFO] [timer.py:215:stop] epoch=0/micro_step=3330/global_step=3330, RunningAvgSamplesPerSec=103.37209865087493, CurrSamplesPerSec=101.81531078773553, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:48:10,456] [INFO] [logging.py:96:log_dist] [Rank 0] step=3340, skipped=29, lr=[1.2301909020508346e-06, 1.2301909020508346e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:48:10,470] [INFO] [timer.py:215:stop] epoch=0/micro_step=3340/global_step=3340, RunningAvgSamplesPerSec=103.36608588314704, CurrSamplesPerSec=101.40384830416532, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:48:13,640] [INFO] [logging.py:96:log_dist] [Rank 0] step=3350, skipped=29, lr=[1.1649331446710487e-06, 1.1649331446710487e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:48:13,656] [INFO] [timer.py:215:stop] epoch=0/micro_step=3350/global_step=3350, RunningAvgSamplesPerSec=103.35776937514916, CurrSamplesPerSec=100.60424131035083, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:48:16,805] [INFO] [logging.py:96:log_dist] [Rank 0] step=3360, skipped=29, lr=[1.1014124593910825e-06, 1.1014124593910825e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:48:16,821] [INFO] [timer.py:215:stop] epoch=0/micro_step=3360/global_step=3360, RunningAvgSamplesPerSec=103.35154164890108, CurrSamplesPerSec=101.65653113747874, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:48:19,971] [INFO] [logging.py:96:log_dist] [Rank 0] step=3370, skipped=29, lr=[1.0396334755249954e-06, 1.0396334755249954e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:48:19,987] [INFO] [timer.py:215:stop] epoch=0/micro_step=3370/global_step=3370, RunningAvgSamplesPerSec=103.3452334920379, CurrSamplesPerSec=101.67516726461747, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:48:23,138] [INFO] [logging.py:96:log_dist] [Rank 0] step=3380, skipped=29, lr=[9.796006954536723e-07, 9.796006954536723e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:48:23,154] [INFO] [timer.py:215:stop] epoch=0/micro_step=3380/global_step=3380, RunningAvgSamplesPerSec=103.33885321574286, CurrSamplesPerSec=101.6106630166356, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:48:26,298] [INFO] [logging.py:96:log_dist] [Rank 0] step=3390, skipped=29, lr=[9.213184942966662e-07, 9.213184942966662e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:48:26,314] [INFO] [timer.py:215:stop] epoch=0/micro_step=3390/global_step=3390, RunningAvgSamplesPerSec=103.33322371545579, CurrSamplesPerSec=101.84003211103499, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:48:28,810] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:48:28,810] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:48:28,810] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:48:28,810] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:48:28,810] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:48:28,810] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:48:28,810] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:48:28,811] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:48:28,811] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:48:28,811] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:48:28,811] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:48:28,811] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:48:28,811] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:48:28,811] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:48:28,820] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:48:28,820] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:48:29,458] [INFO] [logging.py:96:log_dist] [Rank 0] step=3400, skipped=29, lr=[8.647911195933861e-07, 8.647911195933861e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:48:29,474] [INFO] [timer.py:215:stop] epoch=0/micro_step=3400/global_step=3400, RunningAvgSamplesPerSec=103.3276956587676, CurrSamplesPerSec=101.93446980571213, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:48:30,694] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3403 [2023-06-30 00:48:30,694] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3403 [2023-06-30 00:48:30,694] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:48:30,694] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:48:30,695] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3403 [2023-06-30 00:48:30,695] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:48:30,695] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3403 [2023-06-30 00:48:30,695] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:48:30,695] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3403 [2023-06-30 00:48:30,695] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:48:30,696] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-06-30 00:48:30,696] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3403 [2023-06-30 00:48:30,696] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:48:30,696] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3403 [2023-06-30 00:48:30,696] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:48:30,698] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3403 [2023-06-30 00:48:30,698] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:48:32,572] [INFO] [logging.py:96:log_dist] [Rank 0] step=3410, skipped=30, lr=[8.154202665162147e-07, 8.154202665162147e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:48:32,588] [INFO] [timer.py:215:stop] epoch=0/micro_step=3410/global_step=3410, RunningAvgSamplesPerSec=103.32661667595924, CurrSamplesPerSec=101.06231528632925, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:48:35,752] [INFO] [logging.py:96:log_dist] [Rank 0] step=3420, skipped=30, lr=[7.622383057669186e-07, 7.622383057669186e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:48:35,767] [INFO] [timer.py:215:stop] epoch=0/micro_step=3420/global_step=3420, RunningAvgSamplesPerSec=103.3192286305815, CurrSamplesPerSec=99.70066119103525, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:48:38,906] [INFO] [logging.py:96:log_dist] [Rank 0] step=3430, skipped=30, lr=[7.108227650514637e-07, 7.108227650514637e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:48:38,921] [INFO] [timer.py:215:stop] epoch=0/micro_step=3430/global_step=3430, RunningAvgSamplesPerSec=103.31436602717464, CurrSamplesPerSec=101.59120466561455, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:48:41,393] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3437 [2023-06-30 00:48:41,393] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3437 [2023-06-30 00:48:41,393] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3437 [2023-06-30 00:48:41,393] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3437 [2023-06-30 00:48:41,393] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:48:41,393] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3437 [2023-06-30 00:48:41,393] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:48:41,393] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3437 [2023-06-30 00:48:41,393] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3437 [2023-06-30 00:48:41,393] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:48:41,393] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:48:41,393] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:48:41,393] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:48:41,393] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:48:41,393] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3437 [2023-06-30 00:48:41,393] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-06-30 00:48:41,393] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-06-30 00:48:42,002] [INFO] [logging.py:96:log_dist] [Rank 0] step=3440, skipped=31, lr=[6.660621671832845e-07, 6.660621671832845e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:48:42,018] [INFO] [timer.py:215:stop] epoch=0/micro_step=3440/global_step=3440, RunningAvgSamplesPerSec=103.3149634731505, CurrSamplesPerSec=101.60466320459054, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:48:45,168] [INFO] [logging.py:96:log_dist] [Rank 0] step=3450, skipped=31, lr=[6.180130411873486e-07, 6.180130411873486e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:48:45,184] [INFO] [timer.py:215:stop] epoch=0/micro_step=3450/global_step=3450, RunningAvgSamplesPerSec=103.3088782091793, CurrSamplesPerSec=101.24168977367702, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:48:48,349] [INFO] [logging.py:96:log_dist] [Rank 0] step=3460, skipped=31, lr=[5.717408461956952e-07, 5.717408461956952e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:48:48,365] [INFO] [timer.py:215:stop] epoch=0/micro_step=3460/global_step=3460, RunningAvgSamplesPerSec=103.30146197745825, CurrSamplesPerSec=101.47461088265959, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:48:51,501] [INFO] [logging.py:96:log_dist] [Rank 0] step=3470, skipped=31, lr=[5.272489544723619e-07, 5.272489544723619e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:48:51,517] [INFO] [timer.py:215:stop] epoch=0/micro_step=3470/global_step=3470, RunningAvgSamplesPerSec=103.2968206753969, CurrSamplesPerSec=103.10236282168182, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:48:54,585] [INFO] [logging.py:96:log_dist] [Rank 0] step=3480, skipped=31, lr=[4.84540608534953e-07, 4.84540608534953e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:48:54,601] [INFO] [timer.py:215:stop] epoch=0/micro_step=3480/global_step=3480, RunningAvgSamplesPerSec=103.29876910160802, CurrSamplesPerSec=103.78946049755137, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:48:57,666] [INFO] [logging.py:96:log_dist] [Rank 0] step=3490, skipped=31, lr=[4.4361892091831225e-07, 4.4361892091831225e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:48:57,682] [INFO] [timer.py:215:stop] epoch=0/micro_step=3490/global_step=3490, RunningAvgSamplesPerSec=103.30091419609958, CurrSamplesPerSec=103.77164593956216, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:49:00,746] [INFO] [logging.py:96:log_dist] [Rank 0] step=3500, skipped=31, lr=[4.044868739476959e-07, 4.044868739476959e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:49:00,762] [INFO] [timer.py:215:stop] epoch=0/micro_step=3500/global_step=3500, RunningAvgSamplesPerSec=103.30323104450524, CurrSamplesPerSec=104.23763040603038, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:49:03,825] [INFO] [logging.py:96:log_dist] [Rank 0] step=3510, skipped=31, lr=[3.671473195214159e-07, 3.671473195214159e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:49:03,841] [INFO] [timer.py:215:stop] epoch=0/micro_step=3510/global_step=3510, RunningAvgSamplesPerSec=103.30556441167293, CurrSamplesPerSec=104.24410714133153, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:49:06,932] [INFO] [logging.py:96:log_dist] [Rank 0] step=3520, skipped=31, lr=[3.3160297890300894e-07, 3.3160297890300894e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:49:06,948] [INFO] [timer.py:215:stop] epoch=0/micro_step=3520/global_step=3520, RunningAvgSamplesPerSec=103.30532431015848, CurrSamplesPerSec=102.10491529910811, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:49:10,041] [INFO] [logging.py:96:log_dist] [Rank 0] step=3530, skipped=31, lr=[2.97856442522898e-07, 2.97856442522898e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:49:10,057] [INFO] [timer.py:215:stop] epoch=0/micro_step=3530/global_step=3530, RunningAvgSamplesPerSec=103.30480878908143, CurrSamplesPerSec=101.44301235370365, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:49:12,822] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:49:12,822] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:49:12,823] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:49:12,823] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:49:12,823] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:49:12,823] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:49:12,823] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:49:12,823] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:49:12,824] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:49:12,824] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:49:12,824] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:49:12,824] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:49:12,825] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:49:12,825] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:49:12,828] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:49:12,828] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-06-30 00:49:13,152] [INFO] [logging.py:96:log_dist] [Rank 0] step=3540, skipped=31, lr=[2.6591016978961826e-07, 2.6591016978961826e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:49:13,168] [INFO] [timer.py:215:stop] epoch=0/micro_step=3540/global_step=3540, RunningAvgSamplesPerSec=103.30419523682326, CurrSamplesPerSec=102.50440321219504, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:49:16,255] [INFO] [logging.py:96:log_dist] [Rank 0] step=3550, skipped=31, lr=[2.3576648891056875e-07, 2.3576648891056875e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:49:16,271] [INFO] [timer.py:215:stop] epoch=0/micro_step=3550/global_step=3550, RunningAvgSamplesPerSec=103.30424086424571, CurrSamplesPerSec=103.05217421348638, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:49:19,364] [INFO] [logging.py:96:log_dist] [Rank 0] step=3560, skipped=31, lr=[2.074275967223427e-07, 2.074275967223427e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:49:19,380] [INFO] [timer.py:215:stop] epoch=0/micro_step=3560/global_step=3560, RunningAvgSamplesPerSec=103.3037398072294, CurrSamplesPerSec=103.3674929839532, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:49:22,485] [INFO] [logging.py:96:log_dist] [Rank 0] step=3570, skipped=31, lr=[1.8089555853061934e-07, 1.8089555853061934e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:49:22,501] [INFO] [timer.py:215:stop] epoch=0/micro_step=3570/global_step=3570, RunningAvgSamplesPerSec=103.30220868824269, CurrSamplesPerSec=103.2340827455716, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:49:25,589] [INFO] [logging.py:96:log_dist] [Rank 0] step=3580, skipped=31, lr=[1.561723079596511e-07, 1.561723079596511e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:49:25,605] [INFO] [timer.py:215:stop] epoch=0/micro_step=3580/global_step=3580, RunningAvgSamplesPerSec=103.30222350710812, CurrSamplesPerSec=103.16092515902515, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:49:28,690] [INFO] [logging.py:96:log_dist] [Rank 0] step=3590, skipped=31, lr=[1.3325964681134007e-07, 1.3325964681134007e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:49:28,706] [INFO] [timer.py:215:stop] epoch=0/micro_step=3590/global_step=3590, RunningAvgSamplesPerSec=103.30258204693664, CurrSamplesPerSec=103.18789093973147, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:49:31,790] [INFO] [logging.py:96:log_dist] [Rank 0] step=3600, skipped=31, lr=[1.1215924493392094e-07, 1.1215924493392094e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:49:31,806] [INFO] [timer.py:215:stop] epoch=0/micro_step=3600/global_step=3600, RunningAvgSamplesPerSec=103.30293464068404, CurrSamplesPerSec=103.46829562738449, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:49:34,900] [INFO] [logging.py:96:log_dist] [Rank 0] step=3610, skipped=31, lr=[9.287264010027775e-08, 9.287264010027775e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:49:34,916] [INFO] [timer.py:215:stop] epoch=0/micro_step=3610/global_step=3610, RunningAvgSamplesPerSec=103.30239186724934, CurrSamplesPerSec=103.6422328785106, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:49:38,023] [INFO] [logging.py:96:log_dist] [Rank 0] step=3620, skipped=31, lr=[7.540123789585574e-08, 7.540123789585574e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:49:38,038] [INFO] [timer.py:215:stop] epoch=0/micro_step=3620/global_step=3620, RunningAvgSamplesPerSec=103.3007186845449, CurrSamplesPerSec=102.44666187831218, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:49:41,126] [INFO] [logging.py:96:log_dist] [Rank 0] step=3630, skipped=31, lr=[5.974631161624056e-08, 5.974631161624056e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:49:41,142] [INFO] [timer.py:215:stop] epoch=0/micro_step=3630/global_step=3630, RunningAvgSamplesPerSec=103.30080928495603, CurrSamplesPerSec=103.22868363126231, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:49:43,919] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:49:43,920] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:49:43,920] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:49:43,920] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:49:43,920] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:49:43,921] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:49:43,921] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:49:43,921] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:49:43,921] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:49:43,921] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:49:43,921] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:49:43,922] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:49:43,922] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:49:43,923] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:49:43,922] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-06-30 00:49:43,923] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-06-30 00:49:44,248] [INFO] [logging.py:96:log_dist] [Rank 0] step=3640, skipped=31, lr=[4.5909002174351904e-08, 4.5909002174351904e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:49:44,264] [INFO] [timer.py:215:stop] epoch=0/micro_step=3640/global_step=3640, RunningAvgSamplesPerSec=103.29915289199997, CurrSamplesPerSec=102.84560255195048, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:49:44,532] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3640 [2023-06-30 00:49:44,532] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:49:44,532] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3640 [2023-06-30 00:49:44,532] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:49:44,532] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3640 [2023-06-30 00:49:44,532] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:49:44,532] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3640 [2023-06-30 00:49:44,532] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3640 [2023-06-30 00:49:44,532] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:49:44,532] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:49:44,532] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3640 [2023-06-30 00:49:44,533] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:49:44,533] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-06-30 00:49:44,533] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3640 [2023-06-30 00:49:44,533] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3640 [2023-06-30 00:49:44,533] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:49:44,533] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-06-30 00:49:47,308] [INFO] [logging.py:96:log_dist] [Rank 0] step=3650, skipped=32, lr=[3.5010322292722275e-08, 3.5010322292722275e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:49:47,324] [INFO] [timer.py:215:stop] epoch=0/micro_step=3650/global_step=3650, RunningAvgSamplesPerSec=103.30320925772118, CurrSamplesPerSec=103.62262874877919, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:49:50,406] [INFO] [logging.py:96:log_dist] [Rank 0] step=3660, skipped=32, lr=[2.462915357190343e-08, 2.462915357190343e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:49:50,422] [INFO] [timer.py:215:stop] epoch=0/micro_step=3660/global_step=3660, RunningAvgSamplesPerSec=103.3037569797642, CurrSamplesPerSec=103.9863084700459, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:49:53,498] [INFO] [logging.py:96:log_dist] [Rank 0] step=3670, skipped=32, lr=[1.6068160986662527e-08, 1.6068160986662527e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:49:53,514] [INFO] [timer.py:215:stop] epoch=0/micro_step=3670/global_step=3670, RunningAvgSamplesPerSec=103.30486770859807, CurrSamplesPerSec=103.0969774760554, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-06-30 00:49:56,612] [INFO] [logging.py:96:log_dist] [Rank 0] step=3680, skipped=32, lr=[9.327968452232938e-09, 9.327968452232938e-09], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-30 00:49:56,628] [INFO] [timer.py:215:stop] epoch=0/micro_step=3680/global_step=3680, RunningAvgSamplesPerSec=103.30396609503609, CurrSamplesPerSec=102.70221744742946, MemAllocated=4.34GB, MaxMemAllocated=12.81GB Epoch 1/1 with loss 0.585697713105575 ***** Evaluating reward, Epoch 1/1 ***** chosen_last_scores (higher is better) : 2.3001439571380615, acc (higher is better) : 0.6574999690055847 saving model ... [2023-06-30 00:50:17,927] [INFO] [launch.py:347:main] Process 1144371 exits successfully. [2023-06-30 00:50:17,927] [INFO] [launch.py:347:main] Process 1144370 exits successfully. [2023-06-30 00:50:17,927] [INFO] [launch.py:347:main] Process 1144372 exits successfully. [2023-06-30 00:50:18,929] [INFO] [launch.py:347:main] Process 1144368 exits successfully. [2023-06-30 00:50:19,930] [INFO] [launch.py:347:main] Process 1144369 exits successfully. [2023-06-30 00:50:19,930] [INFO] [launch.py:347:main] Process 1144365 exits successfully. [2023-06-30 00:50:20,932] [INFO] [launch.py:347:main] Process 1144367 exits successfully. [2023-06-30 00:50:21,933] [INFO] [launch.py:347:main] Process 1144366 exits successfully.