[2023-05-15 22:48:04,301] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-05-15 22:48:04,364] [INFO] [runner.py:541:main] cmd = /home/users/liyurui/environment/deepspeed/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_split 2,4,4 --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets --model_name_or_path facebook/opt-350m --num_padding_at_beginning 1 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --max_seq_len 512 --learning_rate 5e-5 --weight_decay 0.1 --num_train_epochs 2 --disable_dropout --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --zero_stage 0 --deepspeed --output_dir ./output [2023-05-15 22:48:06,326] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} [2023-05-15 22:48:06,326] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=8, node_rank=0 [2023-05-15 22:48:06,326] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) [2023-05-15 22:48:06,326] [INFO] [launch.py:247:main] dist_world_size=8 [2023-05-15 22:48:06,326] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 [2023-05-15 22:48:10,802] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 3.307976007461548 seconds Loading extension module fused_adam... Loading extension module fused_adam... Time to load fused_adam op: 3.1059069633483887 seconds Time to load fused_adam op: 2.6053833961486816 seconds Loading extension module fused_adam... Time to load fused_adam op: 2.1047401428222656 seconds Loading extension module fused_adam... Time to load fused_adam op: 3.205968141555786 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.9032485485076904 seconds [2023-05-15 22:49:53,509] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.2, git-hash=unknown, git-branch=unknown [2023-05-15 22:49:53,509] [INFO] [comm.py:616:init_distributed] Distributed backend already initialized Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 3.0536563396453857 seconds Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 2.9490580558776855 seconds [2023-05-15 22:50:06,701] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False [2023-05-15 22:50:06,702] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer [2023-05-15 22:50:06,702] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer [2023-05-15 22:50:06,721] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam [2023-05-15 22:50:06,721] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 optimizer with dynamic loss scale Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... [2023-05-15 22:50:06,907] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam [2023-05-15 22:50:06,908] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler [2023-05-15 22:50:06,908] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = [2023-05-15 22:50:06,908] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-05, 5e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:06,908] [INFO] [config.py:955:print] DeepSpeedEngine configuration: [2023-05-15 22:50:06,909] [INFO] [config.py:959:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2023-05-15 22:50:06,909] [INFO] [config.py:959:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2023-05-15 22:50:06,909] [INFO] [config.py:959:print] amp_enabled .................. False [2023-05-15 22:50:06,909] [INFO] [config.py:959:print] amp_params ................... False [2023-05-15 22:50:06,909] [INFO] [config.py:959:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2023-05-15 22:50:06,909] [INFO] [config.py:959:print] bfloat16_enabled ............. False [2023-05-15 22:50:06,909] [INFO] [config.py:959:print] checkpoint_parallel_write_pipeline False [2023-05-15 22:50:06,909] [INFO] [config.py:959:print] checkpoint_tag_validation_enabled True [2023-05-15 22:50:06,909] [INFO] [config.py:959:print] checkpoint_tag_validation_fail False [2023-05-15 22:50:06,909] [INFO] [config.py:959:print] comms_config ................. [2023-05-15 22:50:06,909] [INFO] [config.py:959:print] communication_data_type ...... None [2023-05-15 22:50:06,909] [INFO] [config.py:959:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2023-05-15 22:50:06,909] [INFO] [config.py:959:print] curriculum_enabled_legacy .... False [2023-05-15 22:50:06,909] [INFO] [config.py:959:print] curriculum_params_legacy ..... False [2023-05-15 22:50:06,909] [INFO] [config.py:959:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2023-05-15 22:50:06,909] [INFO] [config.py:959:print] data_efficiency_enabled ...... False [2023-05-15 22:50:06,909] [INFO] [config.py:959:print] dataloader_drop_last ......... False [2023-05-15 22:50:06,909] [INFO] [config.py:959:print] disable_allgather ............ False [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] dump_state ................... False [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 100, 'delayed_shift': 2, 'min_scale': 1} [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] eigenvalue_enabled ........... False [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] eigenvalue_gas_boundary_resolution 1 Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...[2023-05-15 22:50:06,910] [INFO] [config.py:959:print] eigenvalue_layer_name ........ bert.encoder.layer [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] eigenvalue_layer_num ......... 0 [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] eigenvalue_max_iter .......... 100 [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] eigenvalue_stability ......... 1e-06 [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] eigenvalue_tol ............... 0.01 [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] eigenvalue_verbose ........... False [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] elasticity_enabled ........... False [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] flops_profiler_config ........ { "enabled": false, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] fp16_auto_cast ............... False [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] fp16_enabled ................. True [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] fp16_master_weights_and_gradients False [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] global_rank .................. 0 [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] grad_accum_dtype ............. None [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] gradient_accumulation_steps .. 1 [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] gradient_clipping ............ 1.0 [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] gradient_predivide_factor .... 1.0 [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] initial_dynamic_scale ........ 65536 [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] load_universal_checkpoint .... False [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] loss_scale ................... 0 [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] memory_breakdown ............. False [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] mics_hierarchial_params_gather False [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] mics_shard_size .............. -1 [2023-05-15 22:50:06,910] [INFO] [config.py:959:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2023-05-15 22:50:06,911] [INFO] [config.py:959:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2023-05-15 22:50:06,911] [INFO] [config.py:959:print] optimizer_legacy_fusion ...... False [2023-05-15 22:50:06,911] [INFO] [config.py:959:print] optimizer_name ............... None [2023-05-15 22:50:06,911] [INFO] [config.py:959:print] optimizer_params ............. None [2023-05-15 22:50:06,911] [INFO] [config.py:959:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2023-05-15 22:50:06,911] [INFO] [config.py:959:print] pld_enabled .................. False [2023-05-15 22:50:06,911] [INFO] [config.py:959:print] pld_params ................... False [2023-05-15 22:50:06,911] [INFO] [config.py:959:print] prescale_gradients ........... False [2023-05-15 22:50:06,911] [INFO] [config.py:959:print] scheduler_name ............... None [2023-05-15 22:50:06,911] [INFO] [config.py:959:print] scheduler_params ............. None [2023-05-15 22:50:06,911] [INFO] [config.py:959:print] sparse_attention ............. None [2023-05-15 22:50:06,911] [INFO] [config.py:959:print] sparse_gradients_enabled ..... False [2023-05-15 22:50:06,911] [INFO] [config.py:959:print] steps_per_print .............. 10 [2023-05-15 22:50:06,911] [INFO] [config.py:959:print] train_batch_size ............. 32 [2023-05-15 22:50:06,911] [INFO] [config.py:959:print] train_micro_batch_size_per_gpu 4 [2023-05-15 22:50:06,911] [INFO] [config.py:959:print] use_node_local_storage ....... False [2023-05-15 22:50:06,911] [INFO] [config.py:959:print] wall_clock_breakdown ......... False [2023-05-15 22:50:06,911] [INFO] [config.py:959:print] world_size ................... 8 [2023-05-15 22:50:06,911] [INFO] [config.py:959:print] zero_allow_untested_optimizer False [2023-05-15 22:50:06,911] [INFO] [config.py:959:print] zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=30000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=30000000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=False [2023-05-15 22:50:06,911] [INFO] [config.py:959:print] zero_enabled ................. False [2023-05-15 22:50:06,911] [INFO] [config.py:959:print] zero_force_ds_cpu_optimizer .. True [2023-05-15 22:50:06,911] [INFO] [config.py:959:print] zero_optimization_stage ...... 0 [2023-05-15 22:50:06,911] [INFO] [config.py:945:print_user_config] json = { "train_batch_size": 32, "train_micro_batch_size_per_gpu": 4, "steps_per_print": 10, "zero_optimization": { "stage": 0, "offload_param": { "device": "none" }, "offload_optimizer": { "device": "none" }, "stage3_param_persistence_threshold": 1.000000e+04, "stage3_max_live_parameters": 3.000000e+07, "stage3_prefetch_bucket_size": 3.000000e+07, "memory_efficient_linear": false }, "fp16": { "enabled": true, "loss_scale_window": 100 }, "gradient_clipping": 1.0, "prescale_gradients": false, "wall_clock_breakdown": false, "hybrid_engine": { "enabled": false, "max_out_tokens": 512, "inference_tp_size": 1, "release_inference_cache": false, "pin_parameters": true, "tp_gather_partition_size": 8 } } Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 3.048469066619873 seconds Loading extension module utils... Loading extension module utils... Time to load utils op: 3.105386257171631 seconds Time to load utils op: 3.1055259704589844 seconds Loading extension module utils... Loading extension module utils... Time to load utils op: 3.1060147285461426 seconds Time to load utils op: 3.1066737174987793 seconds Loading extension module utils... Loading extension module utils... Time to load utils op: 3.1064646244049072 seconds Loading extension module utils... Time to load utils op: 3.1047708988189697 seconds ***** Running training ***** ***** Evaluating reward, Epoch 0/2 ***** Time to load utils op: 3.1067636013031006 seconds chosen_last_scores (higher is better) : 2.576474905014038, acc (higher is better) : 0.4899999797344208 Beginning of Epoch 1/2, Total Micro Batches 3680 [2023-05-15 22:50:15,587] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-05-15 22:50:15,587] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-05-15 22:50:15,587] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-05-15 22:50:15,587] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-05-15 22:50:15,587] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-05-15 22:50:15,587] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-05-15 22:50:15,587] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-05-15 22:50:15,587] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-05-15 22:50:15,587] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-05-15 22:50:15,587] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-05-15 22:50:15,587] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 65536, reducing to 32768.0 [2023-05-15 22:50:15,587] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-05-15 22:50:15,587] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-05-15 22:50:15,587] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-05-15 22:50:15,588] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-05-15 22:50:15,587] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-05-15 22:50:15,588] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-05-15 22:50:15,712] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-05-15 22:50:15,712] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-05-15 22:50:15,712] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:50:15,712] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:50:15,712] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-05-15 22:50:15,712] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-05-15 22:50:15,712] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:50:15,712] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:50:15,712] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-05-15 22:50:15,712] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-05-15 22:50:15,712] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:50:15,712] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-05-15 22:50:15,712] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:50:15,712] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-05-15 22:50:15,712] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-05-15 22:50:15,712] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:50:15,712] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:50:15,834] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-05-15 22:50:15,834] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:50:15,834] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-05-15 22:50:15,834] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:50:15,834] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-05-15 22:50:15,834] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:50:15,834] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-05-15 22:50:15,834] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-05-15 22:50:15,834] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:50:15,834] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:50:15,834] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-05-15 22:50:15,834] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-05-15 22:50:15,834] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:50:15,834] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-05-15 22:50:15,834] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:50:15,834] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-05-15 22:50:15,834] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:50:15,959] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-05-15 22:50:15,959] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:50:15,959] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-05-15 22:50:15,959] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-05-15 22:50:15,959] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:50:15,959] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-05-15 22:50:15,959] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:50:15,959] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:50:15,959] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-05-15 22:50:15,959] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-05-15 22:50:15,959] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:50:15,959] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:50:15,961] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-05-15 22:50:15,961] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:50:15,961] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0 [2023-05-15 22:50:15,962] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-05-15 22:50:15,963] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:50:16,085] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-05-15 22:50:16,085] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-05-15 22:50:16,085] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-05-15 22:50:16,085] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-05-15 22:50:16,085] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-05-15 22:50:16,085] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-05-15 22:50:16,085] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-05-15 22:50:16,085] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-05-15 22:50:16,085] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 4096.0, reducing to 2048.0 [2023-05-15 22:50:16,085] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-05-15 22:50:16,085] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-05-15 22:50:16,085] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-05-15 22:50:16,085] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-05-15 22:50:16,085] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-05-15 22:50:16,085] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-05-15 22:50:16,085] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-05-15 22:50:16,085] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-05-15 22:50:16,207] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-05-15 22:50:16,207] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-05-15 22:50:16,207] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-05-15 22:50:16,207] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-05-15 22:50:16,208] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-05-15 22:50:16,208] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-05-15 22:50:16,208] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-05-15 22:50:16,208] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-05-15 22:50:16,208] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-05-15 22:50:16,208] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 2048.0, reducing to 1024.0 [2023-05-15 22:50:16,208] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-05-15 22:50:16,208] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-05-15 22:50:16,208] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-05-15 22:50:16,208] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-05-15 22:50:16,208] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-05-15 22:50:16,208] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-05-15 22:50:16,208] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-05-15 22:50:16,329] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-05-15 22:50:16,329] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-05-15 22:50:16,329] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-05-15 22:50:16,329] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-05-15 22:50:16,329] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-05-15 22:50:16,330] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-05-15 22:50:16,330] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-05-15 22:50:16,330] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-05-15 22:50:16,330] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-05-15 22:50:16,330] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-05-15 22:50:16,330] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-05-15 22:50:16,330] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-05-15 22:50:16,330] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-05-15 22:50:16,330] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1024.0, reducing to 512.0 [2023-05-15 22:50:16,330] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-05-15 22:50:16,330] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-05-15 22:50:16,330] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-05-15 22:50:16,823] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=7, lr=[4.999997950270367e-05, 4.999997950270367e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:16,829] [INFO] [timer.py:199:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=230.22027214548274, CurrSamplesPerSec=192.82327825234964, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:50:18,485] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=7, lr=[4.999961510725946e-05, 4.999961510725946e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:18,489] [INFO] [timer.py:199:stop] epoch=0/micro_step=20/global_step=20, RunningAvgSamplesPerSec=208.25657241304523, CurrSamplesPerSec=191.49668420632113, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:50:20,141] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=7, lr=[4.9998795223983205e-05, 4.9998795223983205e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:20,147] [INFO] [timer.py:199:stop] epoch=0/micro_step=30/global_step=30, RunningAvgSamplesPerSec=202.8314273353189, CurrSamplesPerSec=199.1827876564536, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:50:21,770] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=7, lr=[4.999751986781301e-05, 4.999751986781301e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:21,776] [INFO] [timer.py:199:stop] epoch=0/micro_step=40/global_step=40, RunningAvgSamplesPerSec=201.30693543726474, CurrSamplesPerSec=197.826157063834, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:50:23,397] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=7, lr=[4.9995789061985624e-05, 4.9995789061985624e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:23,403] [INFO] [timer.py:199:stop] epoch=0/micro_step=50/global_step=50, RunningAvgSamplesPerSec=200.47277843663835, CurrSamplesPerSec=196.6856946909204, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:50:25,024] [INFO] [logging.py:96:log_dist] [Rank 0] step=60, skipped=7, lr=[4.999360283803594e-05, 4.999360283803594e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:25,030] [INFO] [timer.py:199:stop] epoch=0/micro_step=60/global_step=60, RunningAvgSamplesPerSec=199.93851369782917, CurrSamplesPerSec=198.89500116327835, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:50:26,645] [INFO] [logging.py:96:log_dist] [Rank 0] step=70, skipped=7, lr=[4.9990961235796527e-05, 4.9990961235796527e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:26,651] [INFO] [timer.py:199:stop] epoch=0/micro_step=70/global_step=70, RunningAvgSamplesPerSec=199.66188897910482, CurrSamplesPerSec=198.70890030187329, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:50:28,278] [INFO] [logging.py:96:log_dist] [Rank 0] step=80, skipped=7, lr=[4.998786430339683e-05, 4.998786430339683e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:28,285] [INFO] [timer.py:199:stop] epoch=0/micro_step=80/global_step=80, RunningAvgSamplesPerSec=199.26954399367315, CurrSamplesPerSec=195.30760231195032, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:50:29,928] [INFO] [logging.py:96:log_dist] [Rank 0] step=90, skipped=7, lr=[4.998431209726232e-05, 4.998431209726232e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:29,935] [INFO] [timer.py:199:stop] epoch=0/micro_step=90/global_step=90, RunningAvgSamplesPerSec=198.73876151404275, CurrSamplesPerSec=205.90372894252792, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:50:31,485] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=7, lr=[4.9980304682113455e-05, 4.9980304682113455e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:31,492] [INFO] [timer.py:199:stop] epoch=0/micro_step=100/global_step=100, RunningAvgSamplesPerSec=199.48873384685842, CurrSamplesPerSec=206.02667861934307, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:50:32,719] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:50:32,720] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-05-15 22:50:32,720] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:50:32,720] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-05-15 22:50:32,720] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:50:32,720] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:50:32,720] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-05-15 22:50:32,720] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-05-15 22:50:32,720] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:50:32,720] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-05-15 22:50:32,720] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:50:32,720] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:50:32,720] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-05-15 22:50:32,720] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:50:32,720] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-05-15 22:50:32,720] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-05-15 22:50:33,044] [INFO] [logging.py:96:log_dist] [Rank 0] step=110, skipped=7, lr=[4.997584213096451e-05, 4.997584213096451e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:33,050] [INFO] [timer.py:199:stop] epoch=0/micro_step=110/global_step=110, RunningAvgSamplesPerSec=200.09537459002092, CurrSamplesPerSec=206.17384211504577, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:50:34,601] [INFO] [logging.py:96:log_dist] [Rank 0] step=120, skipped=7, lr=[4.997092452512226e-05, 4.997092452512226e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:34,607] [INFO] [timer.py:199:stop] epoch=0/micro_step=120/global_step=120, RunningAvgSamplesPerSec=200.6056577671834, CurrSamplesPerSec=206.22326020186313, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:50:36,186] [INFO] [logging.py:96:log_dist] [Rank 0] step=130, skipped=7, lr=[4.996555195418446e-05, 4.996555195418446e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:36,192] [INFO] [timer.py:199:stop] epoch=0/micro_step=130/global_step=130, RunningAvgSamplesPerSec=200.77198918197692, CurrSamplesPerSec=204.10766093405698, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:50:37,755] [INFO] [logging.py:96:log_dist] [Rank 0] step=140, skipped=7, lr=[4.995972451603824e-05, 4.995972451603824e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:37,761] [INFO] [timer.py:199:stop] epoch=0/micro_step=140/global_step=140, RunningAvgSamplesPerSec=201.0521165246602, CurrSamplesPerSec=204.10766093405698, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:50:39,334] [INFO] [logging.py:96:log_dist] [Rank 0] step=150, skipped=7, lr=[4.995344231685833e-05, 4.995344231685833e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:39,340] [INFO] [timer.py:199:stop] epoch=0/micro_step=150/global_step=150, RunningAvgSamplesPerSec=201.2187550623208, CurrSamplesPerSec=204.83500598246772, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:50:40,935] [INFO] [logging.py:96:log_dist] [Rank 0] step=160, skipped=7, lr=[4.994670547110511e-05, 4.994670547110511e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:40,941] [INFO] [timer.py:199:stop] epoch=0/micro_step=160/global_step=160, RunningAvgSamplesPerSec=201.18131807890558, CurrSamplesPerSec=197.80866381832303, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:50:42,540] [INFO] [logging.py:96:log_dist] [Rank 0] step=170, skipped=7, lr=[4.99395141015225e-05, 4.99395141015225e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:42,546] [INFO] [timer.py:199:stop] epoch=0/micro_step=170/global_step=170, RunningAvgSamplesPerSec=201.11974752252124, CurrSamplesPerSec=201.2513240048222, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:50:44,151] [INFO] [logging.py:96:log_dist] [Rank 0] step=180, skipped=7, lr=[4.993186833913579e-05, 4.993186833913579e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:44,157] [INFO] [timer.py:199:stop] epoch=0/micro_step=180/global_step=180, RunningAvgSamplesPerSec=201.01777054480732, CurrSamplesPerSec=198.86258513883723, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:50:45,797] [INFO] [logging.py:96:log_dist] [Rank 0] step=190, skipped=7, lr=[4.992376832324919e-05, 4.992376832324919e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:45,803] [INFO] [timer.py:199:stop] epoch=0/micro_step=190/global_step=190, RunningAvgSamplesPerSec=200.70340685414658, CurrSamplesPerSec=195.82993934742674, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:50:47,436] [INFO] [logging.py:96:log_dist] [Rank 0] step=200, skipped=7, lr=[4.99152142014433e-05, 4.99152142014433e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:47,442] [INFO] [timer.py:199:stop] epoch=0/micro_step=200/global_step=200, RunningAvgSamplesPerSec=200.45726200622573, CurrSamplesPerSec=195.9643222676297, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:50:48,737] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:50:48,738] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-05-15 22:50:48,737] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:50:48,737] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:50:48,738] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-05-15 22:50:48,738] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-05-15 22:50:48,738] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:50:48,738] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-05-15 22:50:48,738] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:50:48,738] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-05-15 22:50:48,738] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:50:48,739] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-05-15 22:50:48,742] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:50:48,742] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-05-15 22:50:48,743] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:50:48,743] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-05-15 22:50:49,079] [INFO] [logging.py:96:log_dist] [Rank 0] step=210, skipped=7, lr=[4.990620612957248e-05, 4.990620612957248e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:49,085] [INFO] [timer.py:199:stop] epoch=0/micro_step=210/global_step=210, RunningAvgSamplesPerSec=200.2188537577519, CurrSamplesPerSec=196.48327916849655, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:50:50,720] [INFO] [logging.py:96:log_dist] [Rank 0] step=220, skipped=7, lr=[4.989674427176193e-05, 4.989674427176193e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:50,726] [INFO] [timer.py:199:stop] epoch=0/micro_step=220/global_step=220, RunningAvgSamplesPerSec=200.00926446737535, CurrSamplesPerSec=196.26291077066153, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:50:52,357] [INFO] [logging.py:96:log_dist] [Rank 0] step=230, skipped=7, lr=[4.988682880040475e-05, 4.988682880040475e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:52,363] [INFO] [timer.py:199:stop] epoch=0/micro_step=230/global_step=230, RunningAvgSamplesPerSec=199.83872955152168, CurrSamplesPerSec=196.46889848495937, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:50:53,993] [INFO] [logging.py:96:log_dist] [Rank 0] step=240, skipped=7, lr=[4.987645989615879e-05, 4.987645989615879e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:53,999] [INFO] [timer.py:199:stop] epoch=0/micro_step=240/global_step=240, RunningAvgSamplesPerSec=199.68925131078848, CurrSamplesPerSec=196.0215624447758, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:50:55,646] [INFO] [logging.py:96:log_dist] [Rank 0] step=250, skipped=7, lr=[4.986563774794334e-05, 4.986563774794334e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:55,653] [INFO] [timer.py:199:stop] epoch=0/micro_step=250/global_step=250, RunningAvgSamplesPerSec=199.4630158790758, CurrSamplesPerSec=196.33066376451623, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:50:57,274] [INFO] [logging.py:96:log_dist] [Rank 0] step=260, skipped=7, lr=[4.9854362552935706e-05, 4.9854362552935706e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:57,280] [INFO] [timer.py:199:stop] epoch=0/micro_step=260/global_step=260, RunningAvgSamplesPerSec=199.38141072821605, CurrSamplesPerSec=200.00346905049645, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:50:58,891] [INFO] [logging.py:96:log_dist] [Rank 0] step=270, skipped=7, lr=[4.984263451656762e-05, 4.984263451656762e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:50:58,897] [INFO] [timer.py:199:stop] epoch=0/micro_step=270/global_step=270, RunningAvgSamplesPerSec=199.3534234287843, CurrSamplesPerSec=197.80924687666263, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:00,513] [INFO] [logging.py:96:log_dist] [Rank 0] step=280, skipped=7, lr=[4.983045385252149e-05, 4.983045385252149e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:00,519] [INFO] [timer.py:199:stop] epoch=0/micro_step=280/global_step=280, RunningAvgSamplesPerSec=199.30944030228625, CurrSamplesPerSec=199.0336251217845, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:02,129] [INFO] [logging.py:96:log_dist] [Rank 0] step=290, skipped=7, lr=[4.98178207827265e-05, 4.98178207827265e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:02,135] [INFO] [timer.py:199:stop] epoch=0/micro_step=290/global_step=290, RunningAvgSamplesPerSec=199.2914217729972, CurrSamplesPerSec=199.72103460282787, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:03,742] [INFO] [logging.py:96:log_dist] [Rank 0] step=300, skipped=7, lr=[4.9804735537354575e-05, 4.9804735537354575e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:03,748] [INFO] [timer.py:199:stop] epoch=0/micro_step=300/global_step=300, RunningAvgSamplesPerSec=199.28442022259134, CurrSamplesPerSec=200.4057276813044, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:05,019] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:05,019] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-05-15 22:51:05,019] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:05,019] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:05,019] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-05-15 22:51:05,019] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-05-15 22:51:05,019] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:05,019] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-05-15 22:51:05,019] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:05,019] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:05,020] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-05-15 22:51:05,020] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-05-15 22:51:05,020] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:05,020] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-05-15 22:51:05,022] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:05,023] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-05-15 22:51:05,356] [INFO] [logging.py:96:log_dist] [Rank 0] step=310, skipped=7, lr=[4.9791198354816186e-05, 4.9791198354816186e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:05,363] [INFO] [timer.py:199:stop] epoch=0/micro_step=310/global_step=310, RunningAvgSamplesPerSec=199.2738641613571, CurrSamplesPerSec=197.17342946839253, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:06,941] [INFO] [logging.py:96:log_dist] [Rank 0] step=320, skipped=7, lr=[4.977720948175601e-05, 4.977720948175601e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:06,948] [INFO] [timer.py:199:stop] epoch=0/micro_step=320/global_step=320, RunningAvgSamplesPerSec=199.37840634238515, CurrSamplesPerSec=206.64490869256622, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:08,521] [INFO] [logging.py:96:log_dist] [Rank 0] step=330, skipped=7, lr=[4.976276917304844e-05, 4.976276917304844e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:08,528] [INFO] [timer.py:199:stop] epoch=0/micro_step=330/global_step=330, RunningAvgSamplesPerSec=199.49621543978037, CurrSamplesPerSec=200.2884967214925, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:10,133] [INFO] [logging.py:96:log_dist] [Rank 0] step=340, skipped=7, lr=[4.97478776917929e-05, 4.97478776917929e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:10,139] [INFO] [timer.py:199:stop] epoch=0/micro_step=340/global_step=340, RunningAvgSamplesPerSec=199.49249342828898, CurrSamplesPerSec=199.6931024221865, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:11,733] [INFO] [logging.py:96:log_dist] [Rank 0] step=350, skipped=7, lr=[4.973253530930912e-05, 4.973253530930912e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:11,740] [INFO] [timer.py:199:stop] epoch=0/micro_step=350/global_step=350, RunningAvgSamplesPerSec=199.52555053740423, CurrSamplesPerSec=201.12315592777904, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:13,330] [INFO] [logging.py:96:log_dist] [Rank 0] step=360, skipped=7, lr=[4.9716742305132146e-05, 4.9716742305132146e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:13,336] [INFO] [timer.py:199:stop] epoch=0/micro_step=360/global_step=360, RunningAvgSamplesPerSec=199.57299233718192, CurrSamplesPerSec=207.08266166360147, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:14,885] [INFO] [logging.py:96:log_dist] [Rank 0] step=370, skipped=7, lr=[4.970049896700726e-05, 4.970049896700726e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:14,891] [INFO] [timer.py:199:stop] epoch=0/micro_step=370/global_step=370, RunningAvgSamplesPerSec=199.75700540119473, CurrSamplesPerSec=205.70744050675205, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:16,439] [INFO] [logging.py:96:log_dist] [Rank 0] step=380, skipped=7, lr=[4.968380559088472e-05, 4.968380559088472e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:16,446] [INFO] [timer.py:199:stop] epoch=0/micro_step=380/global_step=380, RunningAvgSamplesPerSec=199.93609989699414, CurrSamplesPerSec=207.3956097274545, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:18,014] [INFO] [logging.py:96:log_dist] [Rank 0] step=390, skipped=7, lr=[4.9666662480914414e-05, 4.9666662480914414e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:18,020] [INFO] [timer.py:199:stop] epoch=0/micro_step=390/global_step=390, RunningAvgSamplesPerSec=200.03830125928516, CurrSamplesPerSec=206.4589551664836, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:19,601] [INFO] [logging.py:96:log_dist] [Rank 0] step=400, skipped=7, lr=[4.964906994944026e-05, 4.964906994944026e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:19,608] [INFO] [timer.py:199:stop] epoch=0/micro_step=400/global_step=400, RunningAvgSamplesPerSec=200.09611426828963, CurrSamplesPerSec=206.59274434984985, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:20,840] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:20,840] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:51:20,840] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:20,840] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:51:20,840] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:20,840] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:51:20,840] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:20,841] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:51:20,840] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:20,841] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:51:20,841] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:20,841] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:51:20,841] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:20,841] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:51:20,841] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:20,841] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:51:21,163] [INFO] [logging.py:96:log_dist] [Rank 0] step=410, skipped=7, lr=[4.963102831699457e-05, 4.963102831699457e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:21,170] [INFO] [timer.py:199:stop] epoch=0/micro_step=410/global_step=410, RunningAvgSamplesPerSec=200.22999488986164, CurrSamplesPerSec=206.62391255821115, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:22,716] [INFO] [logging.py:96:log_dist] [Rank 0] step=420, skipped=7, lr=[4.9612537912292134e-05, 4.9612537912292134e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:22,722] [INFO] [timer.py:199:stop] epoch=0/micro_step=420/global_step=420, RunningAvgSamplesPerSec=200.38474773146098, CurrSamplesPerSec=206.51867808575395, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:24,268] [INFO] [logging.py:96:log_dist] [Rank 0] step=430, skipped=7, lr=[4.959359907222434e-05, 4.959359907222434e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:24,275] [INFO] [timer.py:199:stop] epoch=0/micro_step=430/global_step=430, RunningAvgSamplesPerSec=200.53201598116783, CurrSamplesPerSec=206.88890361822786, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:25,832] [INFO] [logging.py:96:log_dist] [Rank 0] step=440, skipped=7, lr=[4.9574212141852936e-05, 4.9574212141852936e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:25,838] [INFO] [timer.py:199:stop] epoch=0/micro_step=440/global_step=440, RunningAvgSamplesPerSec=200.64129615595752, CurrSamplesPerSec=206.76556005730748, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:27,385] [INFO] [logging.py:96:log_dist] [Rank 0] step=450, skipped=7, lr=[4.95543774744038e-05, 4.95543774744038e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:27,392] [INFO] [timer.py:199:stop] epoch=0/micro_step=450/global_step=450, RunningAvgSamplesPerSec=200.77409485453848, CurrSamplesPerSec=206.50247324053746, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:28,942] [INFO] [logging.py:96:log_dist] [Rank 0] step=460, skipped=7, lr=[4.9534095431260486e-05, 4.9534095431260486e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:28,948] [INFO] [timer.py:199:stop] epoch=0/micro_step=460/global_step=460, RunningAvgSamplesPerSec=200.89449382320797, CurrSamplesPerSec=203.49832613652904, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:30,497] [INFO] [logging.py:96:log_dist] [Rank 0] step=470, skipped=7, lr=[4.9513366381957635e-05, 4.9513366381957635e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:30,503] [INFO] [timer.py:199:stop] epoch=0/micro_step=470/global_step=470, RunningAvgSamplesPerSec=201.01216837965868, CurrSamplesPerSec=206.73084642173376, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:32,051] [INFO] [logging.py:96:log_dist] [Rank 0] step=480, skipped=7, lr=[4.949219070417425e-05, 4.949219070417425e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:32,057] [INFO] [timer.py:199:stop] epoch=0/micro_step=480/global_step=480, RunningAvgSamplesPerSec=201.1295361784753, CurrSamplesPerSec=206.43005913701484, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:33,606] [INFO] [logging.py:96:log_dist] [Rank 0] step=490, skipped=7, lr=[4.947056878372681e-05, 4.947056878372681e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:33,612] [INFO] [timer.py:199:stop] epoch=0/micro_step=490/global_step=490, RunningAvgSamplesPerSec=201.2379242200637, CurrSamplesPerSec=206.47832956172994, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:35,160] [INFO] [logging.py:96:log_dist] [Rank 0] step=500, skipped=7, lr=[4.9448501014562253e-05, 4.9448501014562253e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:35,167] [INFO] [timer.py:199:stop] epoch=0/micro_step=500/global_step=500, RunningAvgSamplesPerSec=201.3431639978898, CurrSamplesPerSec=206.36562790211383, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:36,423] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:36,423] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:36,423] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:51:36,423] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:51:36,423] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:36,423] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:51:36,423] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:36,423] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:51:36,423] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:36,423] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:51:36,424] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:36,424] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:51:36,424] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:36,424] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:51:36,428] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:36,428] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:51:36,754] [INFO] [logging.py:96:log_dist] [Rank 0] step=510, skipped=7, lr=[4.9425987798750784e-05, 4.9425987798750784e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:36,761] [INFO] [timer.py:199:stop] epoch=0/micro_step=510/global_step=510, RunningAvgSamplesPerSec=201.34718391125853, CurrSamplesPerSec=201.25675590570071, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:38,349] [INFO] [logging.py:96:log_dist] [Rank 0] step=520, skipped=7, lr=[4.940302954647854e-05, 4.940302954647854e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:38,356] [INFO] [timer.py:199:stop] epoch=0/micro_step=520/global_step=520, RunningAvgSamplesPerSec=201.34682580119127, CurrSamplesPerSec=201.71226585230963, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:39,446] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 526 [2023-05-15 22:51:39,446] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 526 [2023-05-15 22:51:39,446] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 526 [2023-05-15 22:51:39,446] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:51:39,446] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:51:39,446] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:51:39,446] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-05-15 22:51:39,446] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 526 [2023-05-15 22:51:39,446] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:51:39,446] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 526 [2023-05-15 22:51:39,446] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:51:39,447] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 526 [2023-05-15 22:51:39,447] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:51:39,447] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 526 [2023-05-15 22:51:39,447] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:51:39,447] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 526 [2023-05-15 22:51:39,447] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:51:39,922] [INFO] [logging.py:96:log_dist] [Rank 0] step=530, skipped=8, lr=[4.938198695881265e-05, 4.938198695881265e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:39,928] [INFO] [timer.py:199:stop] epoch=0/micro_step=530/global_step=530, RunningAvgSamplesPerSec=201.40277317111895, CurrSamplesPerSec=199.4803020642444, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:41,523] [INFO] [logging.py:96:log_dist] [Rank 0] step=540, skipped=8, lr=[4.935818429636215e-05, 4.935818429636215e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:41,529] [INFO] [timer.py:199:stop] epoch=0/micro_step=540/global_step=540, RunningAvgSamplesPerSec=201.38765663048596, CurrSamplesPerSec=200.85137764050256, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:43,125] [INFO] [logging.py:96:log_dist] [Rank 0] step=550, skipped=8, lr=[4.9333937832816726e-05, 4.9333937832816726e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:43,132] [INFO] [timer.py:199:stop] epoch=0/micro_step=550/global_step=550, RunningAvgSamplesPerSec=201.369706613799, CurrSamplesPerSec=200.8402573169071, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:44,742] [INFO] [logging.py:96:log_dist] [Rank 0] step=560, skipped=8, lr=[4.9309248009941914e-05, 4.9309248009941914e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:44,749] [INFO] [timer.py:199:stop] epoch=0/micro_step=560/global_step=560, RunningAvgSamplesPerSec=201.32044902038487, CurrSamplesPerSec=200.95422382325552, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:46,343] [INFO] [logging.py:96:log_dist] [Rank 0] step=570, skipped=8, lr=[4.928411527758123e-05, 4.928411527758123e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:46,350] [INFO] [timer.py:199:stop] epoch=0/micro_step=570/global_step=570, RunningAvgSamplesPerSec=201.3088759328753, CurrSamplesPerSec=200.87362198242963, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:47,944] [INFO] [logging.py:96:log_dist] [Rank 0] step=580, skipped=8, lr=[4.925854009364785e-05, 4.925854009364785e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:47,950] [INFO] [timer.py:199:stop] epoch=0/micro_step=580/global_step=580, RunningAvgSamplesPerSec=201.2992057306363, CurrSamplesPerSec=200.57194007546605, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:49,553] [INFO] [logging.py:96:log_dist] [Rank 0] step=590, skipped=8, lr=[4.923252292411637e-05, 4.923252292411637e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:49,559] [INFO] [timer.py:199:stop] epoch=0/micro_step=590/global_step=590, RunningAvgSamplesPerSec=201.27165870027378, CurrSamplesPerSec=197.72066143704194, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:51,165] [INFO] [logging.py:96:log_dist] [Rank 0] step=600, skipped=8, lr=[4.920606424301423e-05, 4.920606424301423e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:51,171] [INFO] [timer.py:199:stop] epoch=0/micro_step=600/global_step=600, RunningAvgSamplesPerSec=201.23700232976466, CurrSamplesPerSec=198.53929878436628, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:52,772] [INFO] [logging.py:96:log_dist] [Rank 0] step=610, skipped=8, lr=[4.917916453241317e-05, 4.917916453241317e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:52,778] [INFO] [timer.py:199:stop] epoch=0/micro_step=610/global_step=610, RunningAvgSamplesPerSec=201.2150270375705, CurrSamplesPerSec=199.3851785167821, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:54,372] [INFO] [logging.py:96:log_dist] [Rank 0] step=620, skipped=8, lr=[4.915182428242035e-05, 4.915182428242035e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:54,378] [INFO] [timer.py:199:stop] epoch=0/micro_step=620/global_step=620, RunningAvgSamplesPerSec=201.20673773696925, CurrSamplesPerSec=200.1630442239263, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:55,634] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:55,635] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:51:55,634] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:55,635] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:51:55,635] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:55,635] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:51:55,635] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:55,635] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:51:55,635] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:55,635] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:51:55,635] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:55,635] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:51:55,635] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:55,636] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:51:55,636] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:51:55,636] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:51:55,958] [INFO] [logging.py:96:log_dist] [Rank 0] step=630, skipped=8, lr=[4.9124043991169505e-05, 4.9124043991169505e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:55,965] [INFO] [timer.py:199:stop] epoch=0/micro_step=630/global_step=630, RunningAvgSamplesPerSec=201.22721459428553, CurrSamplesPerSec=205.9555458712288, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:57,521] [INFO] [logging.py:96:log_dist] [Rank 0] step=640, skipped=8, lr=[4.90958241648118e-05, 4.90958241648118e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:57,528] [INFO] [timer.py:199:stop] epoch=0/micro_step=640/global_step=640, RunningAvgSamplesPerSec=201.29273522449753, CurrSamplesPerSec=206.24797813015263, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:51:58,305] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 644 [2023-05-15 22:51:58,305] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:51:58,305] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 644 [2023-05-15 22:51:58,305] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 644 [2023-05-15 22:51:58,305] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 644 [2023-05-15 22:51:58,305] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 644 [2023-05-15 22:51:58,305] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:51:58,305] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:51:58,305] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:51:58,305] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:51:58,305] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 644 [2023-05-15 22:51:58,305] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:51:58,305] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 644 [2023-05-15 22:51:58,305] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-05-15 22:51:58,305] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:51:58,305] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 644 [2023-05-15 22:51:58,305] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:51:59,114] [INFO] [logging.py:96:log_dist] [Rank 0] step=650, skipped=9, lr=[4.9070050943360935e-05, 4.9070050943360935e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:51:59,120] [INFO] [timer.py:199:stop] epoch=0/micro_step=650/global_step=650, RunningAvgSamplesPerSec=201.30070575438245, CurrSamplesPerSec=194.84120506868624, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:00,740] [INFO] [logging.py:96:log_dist] [Rank 0] step=660, skipped=9, lr=[4.9040997423420656e-05, 4.9040997423420656e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:00,746] [INFO] [timer.py:199:stop] epoch=0/micro_step=660/global_step=660, RunningAvgSamplesPerSec=201.2424984790873, CurrSamplesPerSec=206.76141928665947, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:02,307] [INFO] [logging.py:96:log_dist] [Rank 0] step=670, skipped=9, lr=[4.901150588146487e-05, 4.901150588146487e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:02,313] [INFO] [timer.py:199:stop] epoch=0/micro_step=670/global_step=670, RunningAvgSamplesPerSec=201.29841564601836, CurrSamplesPerSec=199.75224468352678, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:03,910] [INFO] [logging.py:96:log_dist] [Rank 0] step=680, skipped=9, lr=[4.8981576854823367e-05, 4.8981576854823367e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:03,917] [INFO] [timer.py:199:stop] epoch=0/micro_step=680/global_step=680, RunningAvgSamplesPerSec=201.2834020147982, CurrSamplesPerSec=200.23889432602104, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:05,522] [INFO] [logging.py:96:log_dist] [Rank 0] step=690, skipped=9, lr=[4.895121088879685e-05, 4.895121088879685e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:05,528] [INFO] [timer.py:199:stop] epoch=0/micro_step=690/global_step=690, RunningAvgSamplesPerSec=201.25452196587702, CurrSamplesPerSec=200.39196388046153, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:07,122] [INFO] [logging.py:96:log_dist] [Rank 0] step=700, skipped=9, lr=[4.8920408536646975e-05, 4.8920408536646975e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:07,129] [INFO] [timer.py:199:stop] epoch=0/micro_step=700/global_step=700, RunningAvgSamplesPerSec=201.24670686022466, CurrSamplesPerSec=201.59259342702697, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:08,726] [INFO] [logging.py:96:log_dist] [Rank 0] step=710, skipped=9, lr=[4.8889170359586226e-05, 4.8889170359586226e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:08,732] [INFO] [timer.py:199:stop] epoch=0/micro_step=710/global_step=710, RunningAvgSamplesPerSec=201.23321893373162, CurrSamplesPerSec=198.7586304940173, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:10,340] [INFO] [logging.py:96:log_dist] [Rank 0] step=720, skipped=9, lr=[4.885749692676775e-05, 4.885749692676775e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:10,347] [INFO] [timer.py:199:stop] epoch=0/micro_step=720/global_step=720, RunningAvgSamplesPerSec=201.20127639108878, CurrSamplesPerSec=199.36474324410636, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:11,948] [INFO] [logging.py:96:log_dist] [Rank 0] step=730, skipped=9, lr=[4.882538881527497e-05, 4.882538881527497e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:11,955] [INFO] [timer.py:199:stop] epoch=0/micro_step=730/global_step=730, RunningAvgSamplesPerSec=201.18157518053118, CurrSamplesPerSec=200.39734949041295, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:13,552] [INFO] [logging.py:96:log_dist] [Rank 0] step=740, skipped=9, lr=[4.8792846610111046e-05, 4.8792846610111046e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:13,558] [INFO] [timer.py:199:stop] epoch=0/micro_step=740/global_step=740, RunningAvgSamplesPerSec=201.16945575771678, CurrSamplesPerSec=200.52818901543503, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:14,501] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:52:14,501] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:52:14,501] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:52:14,501] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:52:14,501] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:52:14,501] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:52:14,501] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:52:14,501] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:52:14,501] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:52:14,501] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:52:14,501] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:52:14,502] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:52:14,504] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:52:14,505] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:52:14,506] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:52:14,506] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:52:15,156] [INFO] [logging.py:96:log_dist] [Rank 0] step=750, skipped=9, lr=[4.875987090418826e-05, 4.875987090418826e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:15,163] [INFO] [timer.py:199:stop] epoch=0/micro_step=750/global_step=750, RunningAvgSamplesPerSec=201.1554760554473, CurrSamplesPerSec=199.7671100553529, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:16,762] [INFO] [logging.py:96:log_dist] [Rank 0] step=760, skipped=9, lr=[4.872646229831716e-05, 4.872646229831716e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:16,768] [INFO] [timer.py:199:stop] epoch=0/micro_step=760/global_step=760, RunningAvgSamplesPerSec=201.1408756840143, CurrSamplesPerSec=200.0905327340355, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:18,368] [INFO] [logging.py:96:log_dist] [Rank 0] step=770, skipped=9, lr=[4.869262140119566e-05, 4.869262140119566e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:18,374] [INFO] [timer.py:199:stop] epoch=0/micro_step=770/global_step=770, RunningAvgSamplesPerSec=201.12550734627243, CurrSamplesPerSec=199.89831851421593, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:18,659] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 771 [2023-05-15 22:52:18,660] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 771 [2023-05-15 22:52:18,660] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:52:18,660] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:52:18,660] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 771 [2023-05-15 22:52:18,660] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 771 [2023-05-15 22:52:18,660] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 771 [2023-05-15 22:52:18,660] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 771 [2023-05-15 22:52:18,660] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:52:18,660] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:52:18,660] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:52:18,660] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 771 [2023-05-15 22:52:18,660] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 771 [2023-05-15 22:52:18,660] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:52:18,660] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-05-15 22:52:18,660] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:52:18,660] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:52:19,944] [INFO] [logging.py:96:log_dist] [Rank 0] step=780, skipped=10, lr=[4.866179549420321e-05, 4.866179549420321e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:19,950] [INFO] [timer.py:199:stop] epoch=0/micro_step=780/global_step=780, RunningAvgSamplesPerSec=201.16029177839974, CurrSamplesPerSec=198.82723425183727, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:21,548] [INFO] [logging.py:96:log_dist] [Rank 0] step=790, skipped=10, lr=[4.8627134948868466e-05, 4.8627134948868466e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:21,554] [INFO] [timer.py:199:stop] epoch=0/micro_step=790/global_step=790, RunningAvgSamplesPerSec=201.14823439661546, CurrSamplesPerSec=199.91380116715175, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:23,144] [INFO] [logging.py:96:log_dist] [Rank 0] step=800, skipped=10, lr=[4.8592043922007136e-05, 4.8592043922007136e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:23,150] [INFO] [timer.py:199:stop] epoch=0/micro_step=800/global_step=800, RunningAvgSamplesPerSec=201.1502614877882, CurrSamplesPerSec=202.18902930449835, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:24,746] [INFO] [logging.py:96:log_dist] [Rank 0] step=810, skipped=10, lr=[4.855652305297052e-05, 4.855652305297052e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:24,752] [INFO] [timer.py:199:stop] epoch=0/micro_step=810/global_step=810, RunningAvgSamplesPerSec=201.1411657346496, CurrSamplesPerSec=199.94060373490962, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:26,371] [INFO] [logging.py:96:log_dist] [Rank 0] step=820, skipped=10, lr=[4.85205729889415e-05, 4.85205729889415e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:26,374] [INFO] [timer.py:199:stop] epoch=0/micro_step=820/global_step=820, RunningAvgSamplesPerSec=201.1031172672498, CurrSamplesPerSec=191.02001175571453, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:27,954] [INFO] [logging.py:96:log_dist] [Rank 0] step=830, skipped=10, lr=[4.848419438492284e-05, 4.848419438492284e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:27,960] [INFO] [timer.py:199:stop] epoch=0/micro_step=830/global_step=830, RunningAvgSamplesPerSec=201.12047775907396, CurrSamplesPerSec=195.43301624558987, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:29,557] [INFO] [logging.py:96:log_dist] [Rank 0] step=840, skipped=10, lr=[4.844738790372521e-05, 4.844738790372521e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:29,563] [INFO] [timer.py:199:stop] epoch=0/micro_step=840/global_step=840, RunningAvgSamplesPerSec=201.1117174493568, CurrSamplesPerSec=206.8232295604123, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:31,118] [INFO] [logging.py:96:log_dist] [Rank 0] step=850, skipped=10, lr=[4.841015421595511e-05, 4.841015421595511e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:31,124] [INFO] [timer.py:199:stop] epoch=0/micro_step=850/global_step=850, RunningAvgSamplesPerSec=201.16593715816896, CurrSamplesPerSec=206.84235362288464, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:32,670] [INFO] [logging.py:96:log_dist] [Rank 0] step=860, skipped=10, lr=[4.837249400000263e-05, 4.837249400000263e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:32,677] [INFO] [timer.py:199:stop] epoch=0/micro_step=860/global_step=860, RunningAvgSamplesPerSec=201.23098160216264, CurrSamplesPerSec=207.211822173368, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:34,224] [INFO] [logging.py:96:log_dist] [Rank 0] step=870, skipped=10, lr=[4.833440794202916e-05, 4.833440794202916e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:34,230] [INFO] [timer.py:199:stop] epoch=0/micro_step=870/global_step=870, RunningAvgSamplesPerSec=201.2931659711249, CurrSamplesPerSec=206.76938245434573, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:34,678] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:52:34,678] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:52:34,678] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:52:34,678] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:52:34,678] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:52:34,678] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:52:34,678] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:52:34,679] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:52:34,679] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:52:34,679] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:52:34,679] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:52:34,679] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:52:34,679] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:52:34,679] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:52:34,679] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:52:34,679] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:52:35,778] [INFO] [logging.py:96:log_dist] [Rank 0] step=880, skipped=10, lr=[4.829589673595482e-05, 4.829589673595482e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:35,784] [INFO] [timer.py:199:stop] epoch=0/micro_step=880/global_step=880, RunningAvgSamplesPerSec=201.3534948220966, CurrSamplesPerSec=206.8146248859742, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:37,331] [INFO] [logging.py:96:log_dist] [Rank 0] step=890, skipped=10, lr=[4.8256961083445826e-05, 4.8256961083445826e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:37,337] [INFO] [timer.py:199:stop] epoch=0/micro_step=890/global_step=890, RunningAvgSamplesPerSec=201.4134120678127, CurrSamplesPerSec=207.3027146581656, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:38,078] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 894 [2023-05-15 22:52:38,078] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 894 [2023-05-15 22:52:38,078] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:52:38,078] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:52:38,078] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 894 [2023-05-15 22:52:38,078] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 894 [2023-05-15 22:52:38,078] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:52:38,078] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:52:38,078] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 894 [2023-05-15 22:52:38,078] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-05-15 22:52:38,078] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:52:38,078] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 894 [2023-05-15 22:52:38,078] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:52:38,078] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 894 [2023-05-15 22:52:38,078] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 894 [2023-05-15 22:52:38,078] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:52:38,078] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:52:38,850] [INFO] [logging.py:96:log_dist] [Rank 0] step=900, skipped=11, lr=[4.8221556680645284e-05, 4.8221556680645284e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:38,857] [INFO] [timer.py:199:stop] epoch=0/micro_step=900/global_step=900, RunningAvgSamplesPerSec=201.51982471464007, CurrSamplesPerSec=206.3240704729608, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:40,419] [INFO] [logging.py:96:log_dist] [Rank 0] step=910, skipped=11, lr=[4.818181654068745e-05, 4.818181654068745e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:40,425] [INFO] [timer.py:199:stop] epoch=0/micro_step=910/global_step=910, RunningAvgSamplesPerSec=201.5554293329527, CurrSamplesPerSec=206.90485021419994, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:41,997] [INFO] [logging.py:96:log_dist] [Rank 0] step=920, skipped=11, lr=[4.8141654032812614e-05, 4.8141654032812614e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:42,003] [INFO] [timer.py:199:stop] epoch=0/micro_step=920/global_step=920, RunningAvgSamplesPerSec=201.57730673962956, CurrSamplesPerSec=197.31430427962596, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:43,631] [INFO] [logging.py:96:log_dist] [Rank 0] step=930, skipped=11, lr=[4.810106988877338e-05, 4.810106988877338e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:43,638] [INFO] [timer.py:199:stop] epoch=0/micro_step=930/global_step=930, RunningAvgSamplesPerSec=201.52233576965452, CurrSamplesPerSec=195.8745226006716, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:45,231] [INFO] [logging.py:96:log_dist] [Rank 0] step=940, skipped=11, lr=[4.806006484800448e-05, 4.806006484800448e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:45,237] [INFO] [timer.py:199:stop] epoch=0/micro_step=940/global_step=940, RunningAvgSamplesPerSec=201.515266791332, CurrSamplesPerSec=206.59433433692286, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:46,810] [INFO] [logging.py:96:log_dist] [Rank 0] step=950, skipped=11, lr=[4.801863965760931e-05, 4.801863965760931e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:46,817] [INFO] [timer.py:199:stop] epoch=0/micro_step=950/global_step=950, RunningAvgSamplesPerSec=201.53493671293683, CurrSamplesPerSec=205.52663147755968, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:47,876] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 956 [2023-05-15 22:52:47,877] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:52:47,876] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 956 [2023-05-15 22:52:47,876] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 956 [2023-05-15 22:52:47,877] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:52:47,877] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:52:47,877] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 956 [2023-05-15 22:52:47,877] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 956 [2023-05-15 22:52:47,877] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 956 [2023-05-15 22:52:47,877] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 956 [2023-05-15 22:52:47,877] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:52:47,877] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:52:47,877] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:52:47,877] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:52:47,877] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0 [2023-05-15 22:52:47,878] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 956 [2023-05-15 22:52:47,878] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:52:48,347] [INFO] [logging.py:96:log_dist] [Rank 0] step=960, skipped=12, lr=[4.798099838197308e-05, 4.798099838197308e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:48,353] [INFO] [timer.py:199:stop] epoch=0/micro_step=960/global_step=960, RunningAvgSamplesPerSec=201.61093939693916, CurrSamplesPerSec=201.74046255893967, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:49,944] [INFO] [logging.py:96:log_dist] [Rank 0] step=970, skipped=12, lr=[4.793877699296326e-05, 4.793877699296326e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:49,951] [INFO] [timer.py:199:stop] epoch=0/micro_step=970/global_step=970, RunningAvgSamplesPerSec=201.60634027046495, CurrSamplesPerSec=200.53777665552548, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:51,534] [INFO] [logging.py:96:log_dist] [Rank 0] step=980, skipped=12, lr=[4.789613766416689e-05, 4.789613766416689e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:51,540] [INFO] [timer.py:199:stop] epoch=0/micro_step=980/global_step=980, RunningAvgSamplesPerSec=201.61085681450723, CurrSamplesPerSec=202.2499642116081, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:53,112] [INFO] [logging.py:96:log_dist] [Rank 0] step=990, skipped=12, lr=[4.785308117246372e-05, 4.785308117246372e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:53,119] [INFO] [timer.py:199:stop] epoch=0/micro_step=990/global_step=990, RunningAvgSamplesPerSec=201.63024091840776, CurrSamplesPerSec=206.35420587186204, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:54,669] [INFO] [logging.py:96:log_dist] [Rank 0] step=1000, skipped=12, lr=[4.780960830233417e-05, 4.780960830233417e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:54,675] [INFO] [timer.py:199:stop] epoch=0/micro_step=1000/global_step=1000, RunningAvgSamplesPerSec=201.67702543356606, CurrSamplesPerSec=202.93986243643846, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:56,241] [INFO] [logging.py:96:log_dist] [Rank 0] step=1010, skipped=12, lr=[4.776571984584496e-05, 4.776571984584496e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:56,247] [INFO] [timer.py:199:stop] epoch=0/micro_step=1010/global_step=1010, RunningAvgSamplesPerSec=201.703900456782, CurrSamplesPerSec=206.16339132384883, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:57,818] [INFO] [logging.py:96:log_dist] [Rank 0] step=1020, skipped=12, lr=[4.772141660263471e-05, 4.772141660263471e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:57,825] [INFO] [timer.py:199:stop] epoch=0/micro_step=1020/global_step=1020, RunningAvgSamplesPerSec=201.7221710599246, CurrSamplesPerSec=201.4264995437737, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:52:59,408] [INFO] [logging.py:96:log_dist] [Rank 0] step=1030, skipped=12, lr=[4.76766993798994e-05, 4.76766993798994e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:52:59,415] [INFO] [timer.py:199:stop] epoch=0/micro_step=1030/global_step=1030, RunningAvgSamplesPerSec=201.72544156112178, CurrSamplesPerSec=201.3808641428865, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:00,999] [INFO] [logging.py:96:log_dist] [Rank 0] step=1040, skipped=12, lr=[4.7631568992377586e-05, 4.7631568992377586e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:01,005] [INFO] [timer.py:199:stop] epoch=0/micro_step=1040/global_step=1040, RunningAvgSamplesPerSec=201.728271468453, CurrSamplesPerSec=206.72607022829277, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:02,553] [INFO] [logging.py:96:log_dist] [Rank 0] step=1050, skipped=12, lr=[4.758602626233562e-05, 4.758602626233562e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:02,559] [INFO] [timer.py:199:stop] epoch=0/micro_step=1050/global_step=1050, RunningAvgSamplesPerSec=201.77388967251855, CurrSamplesPerSec=206.28950490910321, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:03,783] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:53:03,783] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:53:03,783] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:53:03,783] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:53:03,783] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:53:03,783] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:53:03,783] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:53:03,784] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:53:03,784] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:53:03,784] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:53:03,784] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:53:03,784] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:53:03,784] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:53:03,784] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:53:03,784] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:53:03,784] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:53:04,106] [INFO] [logging.py:96:log_dist] [Rank 0] step=1060, skipped=12, lr=[4.7540072019552664e-05, 4.7540072019552664e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:04,113] [INFO] [timer.py:199:stop] epoch=0/micro_step=1060/global_step=1060, RunningAvgSamplesPerSec=201.82023083630207, CurrSamplesPerSec=207.52836980319879, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:05,660] [INFO] [logging.py:96:log_dist] [Rank 0] step=1070, skipped=12, lr=[4.749370710130554e-05, 4.749370710130554e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:05,667] [INFO] [timer.py:199:stop] epoch=0/micro_step=1070/global_step=1070, RunningAvgSamplesPerSec=201.86504176578, CurrSamplesPerSec=206.47451791799028, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:07,215] [INFO] [logging.py:96:log_dist] [Rank 0] step=1080, skipped=12, lr=[4.74469323523535e-05, 4.74469323523535e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:07,221] [INFO] [timer.py:199:stop] epoch=0/micro_step=1080/global_step=1080, RunningAvgSamplesPerSec=201.9085213900197, CurrSamplesPerSec=206.74422093755825, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:08,773] [INFO] [logging.py:96:log_dist] [Rank 0] step=1090, skipped=12, lr=[4.739974862492281e-05, 4.739974862492281e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:08,779] [INFO] [timer.py:199:stop] epoch=0/micro_step=1090/global_step=1090, RunningAvgSamplesPerSec=201.9477138410553, CurrSamplesPerSec=204.41042995146273, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:10,340] [INFO] [logging.py:96:log_dist] [Rank 0] step=1100, skipped=12, lr=[4.735215677869128e-05, 4.735215677869128e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:10,347] [INFO] [timer.py:199:stop] epoch=0/micro_step=1100/global_step=1100, RunningAvgSamplesPerSec=201.9744152797632, CurrSamplesPerSec=206.2568526779883, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:11,897] [INFO] [logging.py:96:log_dist] [Rank 0] step=1110, skipped=12, lr=[4.730415768077252e-05, 4.730415768077252e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:11,903] [INFO] [timer.py:199:stop] epoch=0/micro_step=1110/global_step=1110, RunningAvgSamplesPerSec=202.0136067290997, CurrSamplesPerSec=206.98430088026106, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:13,464] [INFO] [logging.py:96:log_dist] [Rank 0] step=1120, skipped=12, lr=[4.7255752205700194e-05, 4.7255752205700194e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:13,470] [INFO] [timer.py:199:stop] epoch=0/micro_step=1120/global_step=1120, RunningAvgSamplesPerSec=202.04061691316366, CurrSamplesPerSec=206.24797813015263, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:15,021] [INFO] [logging.py:96:log_dist] [Rank 0] step=1130, skipped=12, lr=[4.7206941235412075e-05, 4.7206941235412075e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:15,027] [INFO] [timer.py:199:stop] epoch=0/micro_step=1130/global_step=1130, RunningAvgSamplesPerSec=202.07719753891053, CurrSamplesPerSec=205.54929391963807, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:16,600] [INFO] [logging.py:96:log_dist] [Rank 0] step=1140, skipped=12, lr=[4.7157725659233985e-05, 4.7157725659233985e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:16,607] [INFO] [timer.py:199:stop] epoch=0/micro_step=1140/global_step=1140, RunningAvgSamplesPerSec=202.08891747939848, CurrSamplesPerSec=203.45513967124052, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:18,184] [INFO] [logging.py:96:log_dist] [Rank 0] step=1150, skipped=12, lr=[4.710810637386357e-05, 4.710810637386357e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:18,191] [INFO] [timer.py:199:stop] epoch=0/micro_step=1150/global_step=1150, RunningAvgSamplesPerSec=202.09447645739283, CurrSamplesPerSec=203.27303292679048, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:19,430] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:53:19,430] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:53:19,430] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:53:19,430] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:53:19,430] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:53:19,430] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:53:19,430] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:53:19,430] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:53:19,430] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:53:19,430] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:53:19,430] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:53:19,431] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:53:19,431] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:53:19,431] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:53:19,431] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:53:19,431] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:53:19,753] [INFO] [logging.py:96:log_dist] [Rank 0] step=1160, skipped=12, lr=[4.705808428335397e-05, 4.705808428335397e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:19,760] [INFO] [timer.py:199:stop] epoch=0/micro_step=1160/global_step=1160, RunningAvgSamplesPerSec=202.11773080417922, CurrSamplesPerSec=206.666545536854, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:21,307] [INFO] [logging.py:96:log_dist] [Rank 0] step=1170, skipped=12, lr=[4.700766029909737e-05, 4.700766029909737e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:21,314] [INFO] [timer.py:199:stop] epoch=0/micro_step=1170/global_step=1170, RunningAvgSamplesPerSec=202.15614336897127, CurrSamplesPerSec=206.57907147122742, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:22,861] [INFO] [logging.py:96:log_dist] [Rank 0] step=1180, skipped=12, lr=[4.695683533980835e-05, 4.695683533980835e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:22,867] [INFO] [timer.py:199:stop] epoch=0/micro_step=1180/global_step=1180, RunningAvgSamplesPerSec=202.19422802950973, CurrSamplesPerSec=206.4224395540511, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:24,414] [INFO] [logging.py:96:log_dist] [Rank 0] step=1190, skipped=12, lr=[4.69056103315072e-05, 4.69056103315072e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:24,420] [INFO] [timer.py:199:stop] epoch=0/micro_step=1190/global_step=1190, RunningAvgSamplesPerSec=202.23253836218987, CurrSamplesPerSec=206.19379656798733, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:25,968] [INFO] [logging.py:96:log_dist] [Rank 0] step=1200, skipped=12, lr=[4.685398620750301e-05, 4.685398620750301e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:25,974] [INFO] [timer.py:199:stop] epoch=0/micro_step=1200/global_step=1200, RunningAvgSamplesPerSec=202.2697413607035, CurrSamplesPerSec=206.88539569698221, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:27,522] [INFO] [logging.py:96:log_dist] [Rank 0] step=1210, skipped=12, lr=[4.680196390837667e-05, 4.680196390837667e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:27,529] [INFO] [timer.py:199:stop] epoch=0/micro_step=1210/global_step=1210, RunningAvgSamplesPerSec=202.3055678181584, CurrSamplesPerSec=207.2966313341066, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:29,077] [INFO] [logging.py:96:log_dist] [Rank 0] step=1220, skipped=12, lr=[4.674954438196374e-05, 4.674954438196374e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:29,084] [INFO] [timer.py:199:stop] epoch=0/micro_step=1220/global_step=1220, RunningAvgSamplesPerSec=202.34044847040252, CurrSamplesPerSec=205.90309718968658, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:30,660] [INFO] [logging.py:96:log_dist] [Rank 0] step=1230, skipped=12, lr=[4.669672858333718e-05, 4.669672858333718e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:30,666] [INFO] [timer.py:199:stop] epoch=0/micro_step=1230/global_step=1230, RunningAvgSamplesPerSec=202.34587979202118, CurrSamplesPerSec=202.23198459503195, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:32,245] [INFO] [logging.py:96:log_dist] [Rank 0] step=1240, skipped=12, lr=[4.6643517474789954e-05, 4.6643517474789954e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:32,252] [INFO] [timer.py:199:stop] epoch=0/micro_step=1240/global_step=1240, RunningAvgSamplesPerSec=202.34765206360262, CurrSamplesPerSec=202.78716656720027, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:33,828] [INFO] [logging.py:96:log_dist] [Rank 0] step=1250, skipped=12, lr=[4.658991202581748e-05, 4.658991202581748e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:33,835] [INFO] [timer.py:199:stop] epoch=0/micro_step=1250/global_step=1250, RunningAvgSamplesPerSec=202.35253467316036, CurrSamplesPerSec=203.62954028586296, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:35,083] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:53:35,083] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:53:35,083] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:53:35,083] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:53:35,083] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:53:35,083] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:53:35,084] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:53:35,084] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:53:35,084] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:53:35,084] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:53:35,084] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:53:35,084] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:53:35,084] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:53:35,084] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:53:35,087] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:53:35,087] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:53:35,412] [INFO] [logging.py:96:log_dist] [Rank 0] step=1260, skipped=12, lr=[4.65359132131e-05, 4.65359132131e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:35,419] [INFO] [timer.py:199:stop] epoch=0/micro_step=1260/global_step=1260, RunningAvgSamplesPerSec=202.35579747293934, CurrSamplesPerSec=202.67937549171947, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:36,996] [INFO] [logging.py:96:log_dist] [Rank 0] step=1270, skipped=12, lr=[4.648152202048474e-05, 4.648152202048474e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:37,002] [INFO] [timer.py:199:stop] epoch=0/micro_step=1270/global_step=1270, RunningAvgSamplesPerSec=202.3598404676842, CurrSamplesPerSec=202.77552451038898, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:38,393] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1278 [2023-05-15 22:53:38,393] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1278 [2023-05-15 22:53:38,393] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:53:38,393] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:53:38,393] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1278 [2023-05-15 22:53:38,393] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1278 [2023-05-15 22:53:38,393] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1278 [2023-05-15 22:53:38,393] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1278 [2023-05-15 22:53:38,393] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1278 [2023-05-15 22:53:38,393] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:53:38,393] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1278 [2023-05-15 22:53:38,393] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:53:38,393] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:53:38,393] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:53:38,393] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:53:38,393] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:53:38,393] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-05-15 22:53:38,543] [INFO] [logging.py:96:log_dist] [Rank 0] step=1280, skipped=13, lr=[4.643223528122942e-05, 4.643223528122942e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:38,549] [INFO] [timer.py:199:stop] epoch=0/micro_step=1280/global_step=1280, RunningAvgSamplesPerSec=202.40030829368754, CurrSamplesPerSec=206.57684582506698, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:40,137] [INFO] [logging.py:96:log_dist] [Rank 0] step=1290, skipped=13, lr=[4.637710130289737e-05, 4.637710130289737e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:40,144] [INFO] [timer.py:199:stop] epoch=0/micro_step=1290/global_step=1290, RunningAvgSamplesPerSec=202.39364903079263, CurrSamplesPerSec=203.05346015186151, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:41,724] [INFO] [logging.py:96:log_dist] [Rank 0] step=1300, skipped=13, lr=[4.63215778381878e-05, 4.63215778381878e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:41,730] [INFO] [timer.py:199:stop] epoch=0/micro_step=1300/global_step=1300, RunningAvgSamplesPerSec=202.39439434063974, CurrSamplesPerSec=202.09952162043737, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:43,305] [INFO] [logging.py:96:log_dist] [Rank 0] step=1310, skipped=13, lr=[4.6265665898726776e-05, 4.6265665898726776e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:43,311] [INFO] [timer.py:199:stop] epoch=0/micro_step=1310/global_step=1310, RunningAvgSamplesPerSec=202.39995307131275, CurrSamplesPerSec=203.21455684032503, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:44,896] [INFO] [logging.py:96:log_dist] [Rank 0] step=1320, skipped=13, lr=[4.620936650321831e-05, 4.620936650321831e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:44,902] [INFO] [timer.py:199:stop] epoch=0/micro_step=1320/global_step=1320, RunningAvgSamplesPerSec=202.39688361254403, CurrSamplesPerSec=202.1473100789204, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:46,488] [INFO] [logging.py:96:log_dist] [Rank 0] step=1330, skipped=13, lr=[4.6152680677425775e-05, 4.6152680677425775e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:46,494] [INFO] [timer.py:199:stop] epoch=0/micro_step=1330/global_step=1330, RunningAvgSamplesPerSec=202.39226898465012, CurrSamplesPerSec=204.1918114989533, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:48,080] [INFO] [logging.py:96:log_dist] [Rank 0] step=1340, skipped=13, lr=[4.6095609454153226e-05, 4.6095609454153226e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:48,087] [INFO] [timer.py:199:stop] epoch=0/micro_step=1340/global_step=1340, RunningAvgSamplesPerSec=202.38763048063922, CurrSamplesPerSec=201.20999081032542, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:49,163] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1346 [2023-05-15 22:53:49,163] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1346 [2023-05-15 22:53:49,163] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1346 [2023-05-15 22:53:49,163] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:53:49,163] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:53:49,163] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:53:49,163] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1346 [2023-05-15 22:53:49,163] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1346 [2023-05-15 22:53:49,163] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1346 [2023-05-15 22:53:49,163] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:53:49,163] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-05-15 22:53:49,163] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:53:49,163] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:53:49,163] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1346 [2023-05-15 22:53:49,164] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1346 [2023-05-15 22:53:49,164] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:53:49,164] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:53:49,626] [INFO] [logging.py:96:log_dist] [Rank 0] step=1350, skipped=14, lr=[4.6043916697633864e-05, 4.6043916697633864e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:49,632] [INFO] [timer.py:199:stop] epoch=0/micro_step=1350/global_step=1350, RunningAvgSamplesPerSec=202.42792877762662, CurrSamplesPerSec=206.01339677666923, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:49,911] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1351 [2023-05-15 22:53:49,911] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1351 [2023-05-15 22:53:49,911] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1351 [2023-05-15 22:53:49,911] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1351 [2023-05-15 22:53:49,911] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:53:49,911] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1351 [2023-05-15 22:53:49,911] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:53:49,911] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1351 [2023-05-15 22:53:49,911] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:53:49,911] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:53:49,911] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:53:49,911] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:53:49,911] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0 [2023-05-15 22:53:49,911] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1351 [2023-05-15 22:53:49,911] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1351 [2023-05-15 22:53:49,911] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:53:49,911] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:53:51,155] [INFO] [logging.py:96:log_dist] [Rank 0] step=1360, skipped=15, lr=[4.599191337420851e-05, 4.599191337420851e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:51,161] [INFO] [timer.py:199:stop] epoch=0/micro_step=1360/global_step=1360, RunningAvgSamplesPerSec=202.48349426745773, CurrSamplesPerSec=206.4754708157319, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:52,707] [INFO] [logging.py:96:log_dist] [Rank 0] step=1370, skipped=15, lr=[4.593376859220696e-05, 4.593376859220696e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:52,713] [INFO] [timer.py:199:stop] epoch=0/micro_step=1370/global_step=1370, RunningAvgSamplesPerSec=202.51527584071965, CurrSamplesPerSec=207.09544265047154, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:54,259] [INFO] [logging.py:96:log_dist] [Rank 0] step=1380, skipped=15, lr=[4.587524240125845e-05, 4.587524240125845e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:54,266] [INFO] [timer.py:199:stop] epoch=0/micro_step=1380/global_step=1380, RunningAvgSamplesPerSec=202.54700539465517, CurrSamplesPerSec=206.11115325611073, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:55,827] [INFO] [logging.py:96:log_dist] [Rank 0] step=1390, skipped=15, lr=[4.581633586769812e-05, 4.581633586769812e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:55,833] [INFO] [timer.py:199:stop] epoch=0/micro_step=1390/global_step=1390, RunningAvgSamplesPerSec=202.56386045115002, CurrSamplesPerSec=206.49294216653564, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:57,379] [INFO] [logging.py:96:log_dist] [Rank 0] step=1400, skipped=15, lr=[4.5757050064790844e-05, 4.5757050064790844e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:57,385] [INFO] [timer.py:199:stop] epoch=0/micro_step=1400/global_step=1400, RunningAvgSamplesPerSec=202.5950408550534, CurrSamplesPerSec=206.83470357351447, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:53:58,936] [INFO] [logging.py:96:log_dist] [Rank 0] step=1410, skipped=15, lr=[4.569738607271174e-05, 4.569738607271174e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:53:58,942] [INFO] [timer.py:199:stop] epoch=0/micro_step=1410/global_step=1410, RunningAvgSamplesPerSec=202.6214194288114, CurrSamplesPerSec=203.40087411421678, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:00,503] [INFO] [logging.py:96:log_dist] [Rank 0] step=1420, skipped=15, lr=[4.563734497852644e-05, 4.563734497852644e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:00,510] [INFO] [timer.py:199:stop] epoch=0/micro_step=1420/global_step=1420, RunningAvgSamplesPerSec=202.63735767864526, CurrSamplesPerSec=206.85127939574116, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:02,056] [INFO] [logging.py:96:log_dist] [Rank 0] step=1430, skipped=15, lr=[4.5576927876171295e-05, 4.5576927876171295e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:02,063] [INFO] [timer.py:199:stop] epoch=0/micro_step=1430/global_step=1430, RunningAvgSamplesPerSec=202.6661828976023, CurrSamplesPerSec=207.00632812801234, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:03,607] [INFO] [logging.py:96:log_dist] [Rank 0] step=1440, skipped=15, lr=[4.551613586643345e-05, 4.551613586643345e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:03,613] [INFO] [timer.py:199:stop] epoch=0/micro_step=1440/global_step=1440, RunningAvgSamplesPerSec=202.69703326692502, CurrSamplesPerSec=206.93132699156658, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:05,170] [INFO] [logging.py:96:log_dist] [Rank 0] step=1450, skipped=15, lr=[4.545497005693079e-05, 4.545497005693079e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:05,177] [INFO] [timer.py:199:stop] epoch=0/micro_step=1450/global_step=1450, RunningAvgSamplesPerSec=202.71578056468184, CurrSamplesPerSec=207.870350638087, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:05,624] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:54:05,624] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:54:05,624] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:54:05,624] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:54:05,624] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:54:05,625] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:54:05,625] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:54:05,625] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:54:05,625] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:54:05,625] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:54:05,625] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:54:05,625] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:54:05,625] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:54:05,625] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:54:05,625] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:54:05,625] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:54:06,723] [INFO] [logging.py:96:log_dist] [Rank 0] step=1460, skipped=15, lr=[4.539343156209175e-05, 4.539343156209175e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:06,729] [INFO] [timer.py:199:stop] epoch=0/micro_step=1460/global_step=1460, RunningAvgSamplesPerSec=202.74418329268894, CurrSamplesPerSec=206.9523856515518, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:08,276] [INFO] [logging.py:96:log_dist] [Rank 0] step=1470, skipped=15, lr=[4.5331521503135005e-05, 4.5331521503135005e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:08,282] [INFO] [timer.py:199:stop] epoch=0/micro_step=1470/global_step=1470, RunningAvgSamplesPerSec=202.77161942816923, CurrSamplesPerSec=206.6869137650394, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:09,844] [INFO] [logging.py:96:log_dist] [Rank 0] step=1480, skipped=15, lr=[4.526924100804908e-05, 4.526924100804908e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:09,850] [INFO] [timer.py:199:stop] epoch=0/micro_step=1480/global_step=1480, RunningAvgSamplesPerSec=202.786072657764, CurrSamplesPerSec=197.9977547482943, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:11,397] [INFO] [logging.py:96:log_dist] [Rank 0] step=1490, skipped=15, lr=[4.5206591211571744e-05, 4.5206591211571744e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:11,403] [INFO] [timer.py:199:stop] epoch=0/micro_step=1490/global_step=1490, RunningAvgSamplesPerSec=202.81277869064775, CurrSamplesPerSec=206.91792825737068, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:12,950] [INFO] [logging.py:96:log_dist] [Rank 0] step=1500, skipped=15, lr=[4.514357325516937e-05, 4.514357325516937e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:12,957] [INFO] [timer.py:199:stop] epoch=0/micro_step=1500/global_step=1500, RunningAvgSamplesPerSec=202.83905811940613, CurrSamplesPerSec=206.9278176315599, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:14,507] [INFO] [logging.py:96:log_dist] [Rank 0] step=1510, skipped=15, lr=[4.508018828701612e-05, 4.508018828701612e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:14,513] [INFO] [timer.py:199:stop] epoch=0/micro_step=1510/global_step=1510, RunningAvgSamplesPerSec=202.8623455128611, CurrSamplesPerSec=206.79837356535245, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:16,061] [INFO] [logging.py:96:log_dist] [Rank 0] step=1520, skipped=15, lr=[4.501643746197306e-05, 4.501643746197306e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:16,068] [INFO] [timer.py:199:stop] epoch=0/micro_step=1520/global_step=1520, RunningAvgSamplesPerSec=202.88685486497133, CurrSamplesPerSec=205.4844674567537, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:17,614] [INFO] [logging.py:96:log_dist] [Rank 0] step=1530, skipped=15, lr=[4.495232194156708e-05, 4.495232194156708e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:17,621] [INFO] [timer.py:199:stop] epoch=0/micro_step=1530/global_step=1530, RunningAvgSamplesPerSec=202.91212409842385, CurrSamplesPerSec=206.8566989908206, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:18,363] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1534 [2023-05-15 22:54:18,363] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:54:18,363] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1534 [2023-05-15 22:54:18,363] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1534 [2023-05-15 22:54:18,363] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1534 [2023-05-15 22:54:18,363] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1534 [2023-05-15 22:54:18,363] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:54:18,363] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:54:18,363] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1534 [2023-05-15 22:54:18,363] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:54:18,363] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:54:18,363] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0 [2023-05-15 22:54:18,363] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:54:18,363] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1534 [2023-05-15 22:54:18,363] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1534 [2023-05-15 22:54:18,363] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:54:18,363] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:54:18,485] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1535 [2023-05-15 22:54:18,485] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-05-15 22:54:18,485] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1535 [2023-05-15 22:54:18,485] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1535 [2023-05-15 22:54:18,485] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1535 [2023-05-15 22:54:18,485] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1535 [2023-05-15 22:54:18,485] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1535 [2023-05-15 22:54:18,485] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1535 [2023-05-15 22:54:18,485] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-05-15 22:54:18,485] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-05-15 22:54:18,485] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-05-15 22:54:18,485] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-05-15 22:54:18,485] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-05-15 22:54:18,485] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1535 [2023-05-15 22:54:18,485] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-05-15 22:54:18,485] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 4096.0, reducing to 2048.0 [2023-05-15 22:54:18,485] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-05-15 22:54:19,103] [INFO] [logging.py:96:log_dist] [Rank 0] step=1540, skipped=17, lr=[4.490076772936954e-05, 4.490076772936954e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:19,109] [INFO] [timer.py:199:stop] epoch=0/micro_step=1540/global_step=1540, RunningAvgSamplesPerSec=202.99132258265277, CurrSamplesPerSec=205.13335422095537, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:19,550] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1542 [2023-05-15 22:54:19,550] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1542 [2023-05-15 22:54:19,550] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-05-15 22:54:19,550] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-05-15 22:54:19,550] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1542 [2023-05-15 22:54:19,550] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1542 [2023-05-15 22:54:19,550] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1542 [2023-05-15 22:54:19,550] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-05-15 22:54:19,550] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1542 [2023-05-15 22:54:19,550] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-05-15 22:54:19,550] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-05-15 22:54:19,550] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-05-15 22:54:19,550] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 2048.0, reducing to 1024.0 [2023-05-15 22:54:19,550] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1542 [2023-05-15 22:54:19,550] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1542 [2023-05-15 22:54:19,550] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-05-15 22:54:19,550] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-05-15 22:54:20,633] [INFO] [logging.py:96:log_dist] [Rank 0] step=1550, skipped=18, lr=[4.4842491890810014e-05, 4.4842491890810014e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:20,639] [INFO] [timer.py:199:stop] epoch=0/micro_step=1550/global_step=1550, RunningAvgSamplesPerSec=203.03515783913204, CurrSamplesPerSec=207.3920846023218, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:22,187] [INFO] [logging.py:96:log_dist] [Rank 0] step=1560, skipped=18, lr=[4.477739754667796e-05, 4.477739754667796e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:22,194] [INFO] [timer.py:199:stop] epoch=0/micro_step=1560/global_step=1560, RunningAvgSamplesPerSec=203.05785854482102, CurrSamplesPerSec=206.7639674335847, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:23,738] [INFO] [logging.py:96:log_dist] [Rank 0] step=1570, skipped=18, lr=[4.4711942862440933e-05, 4.4711942862440933e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:23,745] [INFO] [timer.py:199:stop] epoch=0/micro_step=1570/global_step=1570, RunningAvgSamplesPerSec=203.08302979657327, CurrSamplesPerSec=207.35748064584405, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:25,293] [INFO] [logging.py:96:log_dist] [Rank 0] step=1580, skipped=18, lr=[4.4646129030669795e-05, 4.4646129030669795e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:25,299] [INFO] [timer.py:199:stop] epoch=0/micro_step=1580/global_step=1580, RunningAvgSamplesPerSec=203.1053031025199, CurrSamplesPerSec=205.86014708927354, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:26,845] [INFO] [logging.py:96:log_dist] [Rank 0] step=1590, skipped=18, lr=[4.4579957250478985e-05, 4.4579957250478985e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:26,852] [INFO] [timer.py:199:stop] epoch=0/micro_step=1590/global_step=1590, RunningAvgSamplesPerSec=203.1286383775893, CurrSamplesPerSec=206.76747123816097, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:28,398] [INFO] [logging.py:96:log_dist] [Rank 0] step=1600, skipped=18, lr=[4.451342872750468e-05, 4.451342872750468e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:28,405] [INFO] [timer.py:199:stop] epoch=0/micro_step=1600/global_step=1600, RunningAvgSamplesPerSec=203.15187524018683, CurrSamplesPerSec=206.93260315199058, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:29,952] [INFO] [logging.py:96:log_dist] [Rank 0] step=1610, skipped=18, lr=[4.444654467388286e-05, 4.444654467388286e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:29,958] [INFO] [timer.py:199:stop] epoch=0/micro_step=1610/global_step=1610, RunningAvgSamplesPerSec=203.17383811208418, CurrSamplesPerSec=206.91952325378827, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:31,516] [INFO] [logging.py:96:log_dist] [Rank 0] step=1620, skipped=18, lr=[4.43793063082272e-05, 4.43793063082272e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:31,522] [INFO] [timer.py:199:stop] epoch=0/micro_step=1620/global_step=1620, RunningAvgSamplesPerSec=203.1876328635976, CurrSamplesPerSec=206.10893768101252, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:33,068] [INFO] [logging.py:96:log_dist] [Rank 0] step=1630, skipped=18, lr=[4.4311714855606835e-05, 4.4311714855606835e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:33,074] [INFO] [timer.py:199:stop] epoch=0/micro_step=1630/global_step=1630, RunningAvgSamplesPerSec=203.21080842552905, CurrSamplesPerSec=206.7849920424485, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:34,619] [INFO] [logging.py:96:log_dist] [Rank 0] step=1640, skipped=18, lr=[4.42437715475241e-05, 4.42437715475241e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:34,626] [INFO] [timer.py:199:stop] epoch=0/micro_step=1640/global_step=1640, RunningAvgSamplesPerSec=203.2336502179551, CurrSamplesPerSec=206.3304140359939, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:35,234] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:54:35,234] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:54:35,234] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-05-15 22:54:35,234] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-05-15 22:54:35,234] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:54:35,234] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-05-15 22:54:35,234] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:54:35,234] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-05-15 22:54:35,234] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:54:35,235] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-05-15 22:54:35,235] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:54:35,235] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:54:35,235] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-05-15 22:54:35,235] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-05-15 22:54:35,235] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:54:35,235] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-05-15 22:54:36,180] [INFO] [logging.py:96:log_dist] [Rank 0] step=1650, skipped=18, lr=[4.417547762189207e-05, 4.417547762189207e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:36,186] [INFO] [timer.py:199:stop] epoch=0/micro_step=1650/global_step=1650, RunningAvgSamplesPerSec=203.2497919741135, CurrSamplesPerSec=204.27167893098752, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:37,737] [INFO] [logging.py:96:log_dist] [Rank 0] step=1660, skipped=18, lr=[4.410683432301198e-05, 4.410683432301198e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:37,744] [INFO] [timer.py:199:stop] epoch=0/micro_step=1660/global_step=1660, RunningAvgSamplesPerSec=203.26742628818712, CurrSamplesPerSec=206.75250126700053, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:39,295] [INFO] [logging.py:96:log_dist] [Rank 0] step=1670, skipped=18, lr=[4.403784290155057e-05, 4.403784290155057e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:39,302] [INFO] [timer.py:199:stop] epoch=0/micro_step=1670/global_step=1670, RunningAvgSamplesPerSec=203.28468825080265, CurrSamplesPerSec=205.6327473625185, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:40,849] [INFO] [logging.py:96:log_dist] [Rank 0] step=1680, skipped=18, lr=[4.3968504614517336e-05, 4.3968504614517336e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:40,856] [INFO] [timer.py:199:stop] epoch=0/micro_step=1680/global_step=1680, RunningAvgSamplesPerSec=203.3050734966065, CurrSamplesPerSec=207.04113434917147, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:42,401] [INFO] [logging.py:96:log_dist] [Rank 0] step=1690, skipped=18, lr=[4.389882072524154e-05, 4.389882072524154e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:42,408] [INFO] [timer.py:199:stop] epoch=0/micro_step=1690/global_step=1690, RunningAvgSamplesPerSec=203.32635786230017, CurrSamplesPerSec=207.17024304558686, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:43,955] [INFO] [logging.py:96:log_dist] [Rank 0] step=1700, skipped=18, lr=[4.38287925033493e-05, 4.38287925033493e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:43,961] [INFO] [timer.py:199:stop] epoch=0/micro_step=1700/global_step=1700, RunningAvgSamplesPerSec=203.34648998852808, CurrSamplesPerSec=206.54219552130397, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:45,515] [INFO] [logging.py:96:log_dist] [Rank 0] step=1710, skipped=18, lr=[4.375842122474037e-05, 4.375842122474037e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:45,521] [INFO] [timer.py:199:stop] epoch=0/micro_step=1710/global_step=1710, RunningAvgSamplesPerSec=203.36175710881534, CurrSamplesPerSec=206.29996849038187, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:47,073] [INFO] [logging.py:96:log_dist] [Rank 0] step=1720, skipped=18, lr=[4.3687708171564925e-05, 4.3687708171564925e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:47,080] [INFO] [timer.py:199:stop] epoch=0/micro_step=1720/global_step=1720, RunningAvgSamplesPerSec=203.3776349688407, CurrSamplesPerSec=205.86235731880524, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:48,631] [INFO] [logging.py:96:log_dist] [Rank 0] step=1730, skipped=18, lr=[4.3616654632200224e-05, 4.3616654632200224e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:48,637] [INFO] [timer.py:199:stop] epoch=0/micro_step=1730/global_step=1730, RunningAvgSamplesPerSec=203.3939743385674, CurrSamplesPerSec=206.6127799756777, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:50,190] [INFO] [logging.py:96:log_dist] [Rank 0] step=1740, skipped=18, lr=[4.354526190122709e-05, 4.354526190122709e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:50,197] [INFO] [timer.py:199:stop] epoch=0/micro_step=1740/global_step=1740, RunningAvgSamplesPerSec=203.4091828744981, CurrSamplesPerSec=206.09469583550225, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:50,802] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:54:50,802] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:54:50,802] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-05-15 22:54:50,802] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-05-15 22:54:50,802] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:54:50,802] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-05-15 22:54:50,802] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:54:50,802] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-05-15 22:54:50,802] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:54:50,802] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-05-15 22:54:50,803] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:54:50,803] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-05-15 22:54:50,803] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:54:50,803] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:54:50,803] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-05-15 22:54:50,803] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-05-15 22:54:51,745] [INFO] [logging.py:96:log_dist] [Rank 0] step=1750, skipped=18, lr=[4.3473531279406375e-05, 4.3473531279406375e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:51,751] [INFO] [timer.py:199:stop] epoch=0/micro_step=1750/global_step=1750, RunningAvgSamplesPerSec=203.42764502949302, CurrSamplesPerSec=206.53647457105487, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:53,324] [INFO] [logging.py:96:log_dist] [Rank 0] step=1760, skipped=18, lr=[4.340146407365521e-05, 4.340146407365521e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:53,330] [INFO] [timer.py:199:stop] epoch=0/micro_step=1760/global_step=1760, RunningAvgSamplesPerSec=203.4277836918981, CurrSamplesPerSec=201.5553573305702, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:54,924] [INFO] [logging.py:96:log_dist] [Rank 0] step=1770, skipped=18, lr=[4.3329061597023216e-05, 4.3329061597023216e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:54,931] [INFO] [timer.py:199:stop] epoch=0/micro_step=1770/global_step=1770, RunningAvgSamplesPerSec=203.4122330809237, CurrSamplesPerSec=194.85167097354898, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:56,532] [INFO] [logging.py:96:log_dist] [Rank 0] step=1780, skipped=18, lr=[4.3256325168668596e-05, 4.3256325168668596e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:56,538] [INFO] [timer.py:199:stop] epoch=0/micro_step=1780/global_step=1780, RunningAvgSamplesPerSec=203.3917963808652, CurrSamplesPerSec=201.5347774250801, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:58,128] [INFO] [logging.py:96:log_dist] [Rank 0] step=1790, skipped=18, lr=[4.3183256113834076e-05, 4.3183256113834076e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:58,134] [INFO] [timer.py:199:stop] epoch=0/micro_step=1790/global_step=1790, RunningAvgSamplesPerSec=203.3795403992166, CurrSamplesPerSec=201.82297560850913, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:54:59,743] [INFO] [logging.py:96:log_dist] [Rank 0] step=1800, skipped=18, lr=[4.310985576382276e-05, 4.310985576382276e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:54:59,749] [INFO] [timer.py:199:stop] epoch=0/micro_step=1800/global_step=1800, RunningAvgSamplesPerSec=203.35433637212572, CurrSamplesPerSec=200.46619110776547, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:01,343] [INFO] [logging.py:96:log_dist] [Rank 0] step=1810, skipped=18, lr=[4.3036125455973896e-05, 4.3036125455973896e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:01,349] [INFO] [timer.py:199:stop] epoch=0/micro_step=1810/global_step=1810, RunningAvgSamplesPerSec=203.33981854933984, CurrSamplesPerSec=200.8240298652622, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:02,937] [INFO] [logging.py:96:log_dist] [Rank 0] step=1820, skipped=18, lr=[4.296206653363848e-05, 4.296206653363848e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:02,943] [INFO] [timer.py:199:stop] epoch=0/micro_step=1820/global_step=1820, RunningAvgSamplesPerSec=203.32960299147314, CurrSamplesPerSec=201.53871149384727, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:04,531] [INFO] [logging.py:96:log_dist] [Rank 0] step=1830, skipped=18, lr=[4.288768034615482e-05, 4.288768034615482e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:04,538] [INFO] [timer.py:199:stop] epoch=0/micro_step=1830/global_step=1830, RunningAvgSamplesPerSec=203.31925286899803, CurrSamplesPerSec=201.47003866753127, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:06,126] [INFO] [logging.py:96:log_dist] [Rank 0] step=1840, skipped=18, lr=[4.2812968248823894e-05, 4.2812968248823894e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:06,133] [INFO] [timer.py:199:stop] epoch=0/micro_step=1840/global_step=1840, RunningAvgSamplesPerSec=203.30849782479595, CurrSamplesPerSec=201.190084527648, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:06,754] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:55:06,754] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:55:06,754] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:55:06,754] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:55:06,754] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:55:06,754] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:55:06,754] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:55:06,754] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:55:06,755] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:55:06,755] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:55:06,755] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:55:06,755] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:55:06,755] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:55:06,755] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:55:06,759] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:55:06,759] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:55:07,724] [INFO] [logging.py:96:log_dist] [Rank 0] step=1850, skipped=18, lr=[4.273793160288473e-05, 4.273793160288473e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:07,730] [INFO] [timer.py:199:stop] epoch=0/micro_step=1850/global_step=1850, RunningAvgSamplesPerSec=203.2965334287637, CurrSamplesPerSec=201.36424711421034, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:09,322] [INFO] [logging.py:96:log_dist] [Rank 0] step=1860, skipped=18, lr=[4.2662571775489523e-05, 4.2662571775489523e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:09,328] [INFO] [timer.py:199:stop] epoch=0/micro_step=1860/global_step=1860, RunningAvgSamplesPerSec=203.28414738682187, CurrSamplesPerSec=201.00117110398264, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:10,918] [INFO] [logging.py:96:log_dist] [Rank 0] step=1870, skipped=18, lr=[4.25868901396788e-05, 4.25868901396788e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:10,924] [INFO] [timer.py:199:stop] epoch=0/micro_step=1870/global_step=1870, RunningAvgSamplesPerSec=203.27338676569337, CurrSamplesPerSec=200.8712169511717, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:12,517] [INFO] [logging.py:96:log_dist] [Rank 0] step=1880, skipped=18, lr=[4.251088807435636e-05, 4.251088807435636e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:12,524] [INFO] [timer.py:199:stop] epoch=0/micro_step=1880/global_step=1880, RunningAvgSamplesPerSec=203.26002407778458, CurrSamplesPerSec=201.4349640106377, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:13,290] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1884 [2023-05-15 22:55:13,290] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:55:13,290] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1884 [2023-05-15 22:55:13,290] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:55:13,290] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1884 [2023-05-15 22:55:13,290] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1884 [2023-05-15 22:55:13,290] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1884 [2023-05-15 22:55:13,290] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:55:13,290] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:55:13,290] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1884 [2023-05-15 22:55:13,290] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:55:13,290] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:55:13,290] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0 [2023-05-15 22:55:13,290] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1884 [2023-05-15 22:55:13,290] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:55:13,291] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1884 [2023-05-15 22:55:13,291] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:55:14,078] [INFO] [logging.py:96:log_dist] [Rank 0] step=1890, skipped=19, lr=[4.2442213392702635e-05, 4.2442213392702635e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:14,085] [INFO] [timer.py:199:stop] epoch=0/micro_step=1890/global_step=1890, RunningAvgSamplesPerSec=203.27327099479172, CurrSamplesPerSec=201.71408475681784, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:15,671] [INFO] [logging.py:96:log_dist] [Rank 0] step=1900, skipped=19, lr=[4.2365606331076925e-05, 4.2365606331076925e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:15,677] [INFO] [timer.py:199:stop] epoch=0/micro_step=1900/global_step=1900, RunningAvgSamplesPerSec=203.2647675015335, CurrSamplesPerSec=201.81781656552047, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:17,266] [INFO] [logging.py:96:log_dist] [Rank 0] step=1910, skipped=19, lr=[4.2288682871684857e-05, 4.2288682871684857e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:17,272] [INFO] [timer.py:199:stop] epoch=0/micro_step=1910/global_step=1910, RunningAvgSamplesPerSec=203.2549709805113, CurrSamplesPerSec=201.54506685246807, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:18,862] [INFO] [logging.py:96:log_dist] [Rank 0] step=1920, skipped=19, lr=[4.2211444416056e-05, 4.2211444416056e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:18,868] [INFO] [timer.py:199:stop] epoch=0/micro_step=1920/global_step=1920, RunningAvgSamplesPerSec=203.24458898618897, CurrSamplesPerSec=200.99635498875347, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:20,459] [INFO] [logging.py:96:log_dist] [Rank 0] step=1930, skipped=19, lr=[4.2133892371459074e-05, 4.2133892371459074e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:20,465] [INFO] [timer.py:199:stop] epoch=0/micro_step=1930/global_step=1930, RunningAvgSamplesPerSec=203.23346689720526, CurrSamplesPerSec=200.53238349516738, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:22,055] [INFO] [logging.py:96:log_dist] [Rank 0] step=1940, skipped=19, lr=[4.2056028150876356e-05, 4.2056028150876356e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:22,061] [INFO] [timer.py:199:stop] epoch=0/micro_step=1940/global_step=1940, RunningAvgSamplesPerSec=203.22306047735805, CurrSamplesPerSec=201.28905159187812, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:23,652] [INFO] [logging.py:96:log_dist] [Rank 0] step=1950, skipped=19, lr=[4.1977853172977885e-05, 4.1977853172977885e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:23,658] [INFO] [timer.py:199:stop] epoch=0/micro_step=1950/global_step=1950, RunningAvgSamplesPerSec=203.21221404265552, CurrSamplesPerSec=201.18978294759563, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:25,250] [INFO] [logging.py:96:log_dist] [Rank 0] step=1960, skipped=19, lr=[4.189936886209563e-05, 4.189936886209563e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:25,256] [INFO] [timer.py:199:stop] epoch=0/micro_step=1960/global_step=1960, RunningAvgSamplesPerSec=203.20089756856532, CurrSamplesPerSec=201.5714004658658, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:26,844] [INFO] [logging.py:96:log_dist] [Rank 0] step=1970, skipped=19, lr=[4.182057664819757e-05, 4.182057664819757e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:26,850] [INFO] [timer.py:199:stop] epoch=0/micro_step=1970/global_step=1970, RunningAvgSamplesPerSec=203.19218650239495, CurrSamplesPerSec=202.04384765919013, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:28,438] [INFO] [logging.py:96:log_dist] [Rank 0] step=1980, skipped=19, lr=[4.174147796686158e-05, 4.174147796686158e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:28,445] [INFO] [timer.py:199:stop] epoch=0/micro_step=1980/global_step=1980, RunningAvgSamplesPerSec=203.1832666174955, CurrSamplesPerSec=201.325283385658, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:29,385] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:55:29,385] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:55:29,385] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:55:29,385] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:55:29,385] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:55:29,385] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:55:29,385] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:55:29,385] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:55:29,386] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:55:29,386] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:55:29,386] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:55:29,386] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:55:29,386] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:55:29,386] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:55:29,390] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:55:29,390] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:55:30,036] [INFO] [logging.py:96:log_dist] [Rank 0] step=1990, skipped=19, lr=[4.1662074259249305e-05, 4.1662074259249305e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:30,042] [INFO] [timer.py:199:stop] epoch=0/micro_step=1990/global_step=1990, RunningAvgSamplesPerSec=203.17237469937496, CurrSamplesPerSec=200.93587219651388, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:31,631] [INFO] [logging.py:96:log_dist] [Rank 0] step=2000, skipped=19, lr=[4.158236697207996e-05, 4.158236697207996e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:31,638] [INFO] [timer.py:199:stop] epoch=0/micro_step=2000/global_step=2000, RunningAvgSamplesPerSec=203.16303483535873, CurrSamplesPerSec=201.36424711421034, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:33,225] [INFO] [logging.py:96:log_dist] [Rank 0] step=2010, skipped=19, lr=[4.1502357557603856e-05, 4.1502357557603856e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:33,231] [INFO] [timer.py:199:stop] epoch=0/micro_step=2010/global_step=2010, RunningAvgSamplesPerSec=203.15507059710598, CurrSamplesPerSec=201.85423898069553, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:34,821] [INFO] [logging.py:96:log_dist] [Rank 0] step=2020, skipped=19, lr=[4.142204747357604e-05, 4.142204747357604e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:34,827] [INFO] [timer.py:199:stop] epoch=0/micro_step=2020/global_step=2020, RunningAvgSamplesPerSec=203.14550896994209, CurrSamplesPerSec=206.66145410704635, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:36,394] [INFO] [logging.py:96:log_dist] [Rank 0] step=2030, skipped=19, lr=[4.134143818322967e-05, 4.134143818322967e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:36,400] [INFO] [timer.py:199:stop] epoch=0/micro_step=2030/global_step=2030, RunningAvgSamplesPerSec=203.15115033111852, CurrSamplesPerSec=200.0171795341795, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:37,969] [INFO] [logging.py:96:log_dist] [Rank 0] step=2040, skipped=19, lr=[4.1260531155249397e-05, 4.1260531155249397e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:37,975] [INFO] [timer.py:199:stop] epoch=0/micro_step=2040/global_step=2040, RunningAvgSamplesPerSec=203.15486099512466, CurrSamplesPerSec=206.66272694097955, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:39,527] [INFO] [logging.py:96:log_dist] [Rank 0] step=2050, skipped=19, lr=[4.117932786374459e-05, 4.117932786374459e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:39,533] [INFO] [timer.py:199:stop] epoch=0/micro_step=2050/global_step=2050, RunningAvgSamplesPerSec=203.16994712684343, CurrSamplesPerSec=204.5300438111928, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:41,080] [INFO] [logging.py:96:log_dist] [Rank 0] step=2060, skipped=19, lr=[4.109782978822248e-05, 4.109782978822248e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:41,087] [INFO] [timer.py:199:stop] epoch=0/micro_step=2060/global_step=2060, RunningAvgSamplesPerSec=203.18678299259204, CurrSamplesPerSec=206.95589584493, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:42,634] [INFO] [logging.py:96:log_dist] [Rank 0] step=2070, skipped=19, lr=[4.1016038413561186e-05, 4.1016038413561186e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:42,640] [INFO] [timer.py:199:stop] epoch=0/micro_step=2070/global_step=2070, RunningAvgSamplesPerSec=203.20386640363623, CurrSamplesPerSec=207.37253835576226, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:44,188] [INFO] [logging.py:96:log_dist] [Rank 0] step=2080, skipped=19, lr=[4.093395522998269e-05, 4.093395522998269e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:44,194] [INFO] [timer.py:199:stop] epoch=0/micro_step=2080/global_step=2080, RunningAvgSamplesPerSec=203.22062682328226, CurrSamplesPerSec=206.77702767241723, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:45,111] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:55:45,111] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:55:45,111] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:55:45,111] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:55:45,111] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:55:45,111] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:55:45,111] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:55:45,112] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:55:45,112] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:55:45,112] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:55:45,112] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:55:45,112] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:55:45,112] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:55:45,112] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:55:45,112] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:55:45,112] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:55:45,747] [INFO] [logging.py:96:log_dist] [Rank 0] step=2090, skipped=19, lr=[4.085158173302568e-05, 4.085158173302568e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:45,754] [INFO] [timer.py:199:stop] epoch=0/micro_step=2090/global_step=2090, RunningAvgSamplesPerSec=203.23376259440238, CurrSamplesPerSec=206.29996849038187, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:47,305] [INFO] [logging.py:96:log_dist] [Rank 0] step=2100, skipped=19, lr=[4.076891942351827e-05, 4.076891942351827e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:47,311] [INFO] [timer.py:199:stop] epoch=0/micro_step=2100/global_step=2100, RunningAvgSamplesPerSec=203.24814134321608, CurrSamplesPerSec=205.95965442632007, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:48,866] [INFO] [logging.py:96:log_dist] [Rank 0] step=2110, skipped=19, lr=[4.068596980755071e-05, 4.068596980755071e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:48,872] [INFO] [timer.py:199:stop] epoch=0/micro_step=2110/global_step=2110, RunningAvgSamplesPerSec=203.2602782354475, CurrSamplesPerSec=206.37546455254576, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:49,778] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2115 [2023-05-15 22:55:49,778] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2115 [2023-05-15 22:55:49,778] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:55:49,778] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2115 [2023-05-15 22:55:49,778] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2115 [2023-05-15 22:55:49,778] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2115 [2023-05-15 22:55:49,778] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:55:49,778] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:55:49,778] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:55:49,778] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:55:49,778] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2115 [2023-05-15 22:55:49,778] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-05-15 22:55:49,778] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2115 [2023-05-15 22:55:49,779] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:55:49,778] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2115 [2023-05-15 22:55:49,779] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:55:49,779] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:55:50,396] [INFO] [logging.py:96:log_dist] [Rank 0] step=2120, skipped=20, lr=[4.0611070755158766e-05, 4.0611070755158766e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:50,403] [INFO] [timer.py:199:stop] epoch=0/micro_step=2120/global_step=2120, RunningAvgSamplesPerSec=203.29067608460383, CurrSamplesPerSec=206.3881583904465, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:51,611] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2127 [2023-05-15 22:55:51,611] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2127 [2023-05-15 22:55:51,611] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:55:51,611] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:55:51,611] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2127 [2023-05-15 22:55:51,611] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2127 [2023-05-15 22:55:51,612] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:55:51,611] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2127 [2023-05-15 22:55:51,612] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:55:51,612] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0 [2023-05-15 22:55:51,612] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2127 [2023-05-15 22:55:51,612] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:55:51,612] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2127 [2023-05-15 22:55:51,612] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:55:51,612] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2127 [2023-05-15 22:55:51,612] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:55:51,612] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:55:51,917] [INFO] [logging.py:96:log_dist] [Rank 0] step=2130, skipped=21, lr=[4.053594131400349e-05, 4.053594131400349e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:51,923] [INFO] [timer.py:199:stop] epoch=0/micro_step=2130/global_step=2130, RunningAvgSamplesPerSec=203.32688607457084, CurrSamplesPerSec=206.62582112138972, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:53,470] [INFO] [logging.py:96:log_dist] [Rank 0] step=2140, skipped=21, lr=[4.0452195296799455e-05, 4.0452195296799455e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:53,476] [INFO] [timer.py:199:stop] epoch=0/micro_step=2140/global_step=2140, RunningAvgSamplesPerSec=203.34320947546902, CurrSamplesPerSec=206.33295157042912, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:55,035] [INFO] [logging.py:96:log_dist] [Rank 0] step=2150, skipped=21, lr=[4.036816774378353e-05, 4.036816774378353e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:55,041] [INFO] [timer.py:199:stop] epoch=0/micro_step=2150/global_step=2150, RunningAvgSamplesPerSec=203.35188790811665, CurrSamplesPerSec=197.48995465099512, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:56,613] [INFO] [logging.py:96:log_dist] [Rank 0] step=2160, skipped=21, lr=[4.028386018592041e-05, 4.028386018592041e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:56,619] [INFO] [timer.py:199:stop] epoch=0/micro_step=2160/global_step=2160, RunningAvgSamplesPerSec=203.35286839317303, CurrSamplesPerSec=203.40796399154348, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:58,186] [INFO] [logging.py:96:log_dist] [Rank 0] step=2170, skipped=21, lr=[4.019927415927642e-05, 4.019927415927642e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:58,192] [INFO] [timer.py:199:stop] epoch=0/micro_step=2170/global_step=2170, RunningAvgSamplesPerSec=203.3566216118467, CurrSamplesPerSec=199.10773539673875, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:55:59,743] [INFO] [logging.py:96:log_dist] [Rank 0] step=2180, skipped=21, lr=[4.011441120499152e-05, 4.011441120499152e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:55:59,749] [INFO] [timer.py:199:stop] epoch=0/micro_step=2180/global_step=2180, RunningAvgSamplesPerSec=203.37000832348792, CurrSamplesPerSec=206.2578035676965, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:01,297] [INFO] [logging.py:96:log_dist] [Rank 0] step=2190, skipped=21, lr=[4.0029272869251235e-05, 4.0029272869251235e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:01,304] [INFO] [timer.py:199:stop] epoch=0/micro_step=2190/global_step=2190, RunningAvgSamplesPerSec=203.38484847922612, CurrSamplesPerSec=206.13172872838743, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:02,851] [INFO] [logging.py:96:log_dist] [Rank 0] step=2200, skipped=21, lr=[3.9943860703258475e-05, 3.9943860703258475e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:02,857] [INFO] [timer.py:199:stop] epoch=0/micro_step=2200/global_step=2200, RunningAvgSamplesPerSec=203.3999342383992, CurrSamplesPerSec=207.65583811791498, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:04,403] [INFO] [logging.py:96:log_dist] [Rank 0] step=2210, skipped=21, lr=[3.985817626320531e-05, 3.985817626320531e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:04,410] [INFO] [timer.py:199:stop] epoch=0/micro_step=2210/global_step=2210, RunningAvgSamplesPerSec=203.4153805231964, CurrSamplesPerSec=207.172161714949, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:05,981] [INFO] [logging.py:96:log_dist] [Rank 0] step=2220, skipped=21, lr=[3.977222111024456e-05, 3.977222111024456e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:05,987] [INFO] [timer.py:199:stop] epoch=0/micro_step=2220/global_step=2220, RunningAvgSamplesPerSec=203.41649601482996, CurrSamplesPerSec=203.68701901991224, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:07,372] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:56:07,372] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:56:07,372] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:56:07,372] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:56:07,372] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:56:07,372] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:56:07,372] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:56:07,372] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:56:07,372] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:56:07,372] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:56:07,372] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:56:07,373] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:56:07,373] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:56:07,373] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:56:07,379] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:56:07,379] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:56:07,546] [INFO] [logging.py:96:log_dist] [Rank 0] step=2230, skipped=21, lr=[3.968599681046139e-05, 3.968599681046139e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:07,552] [INFO] [timer.py:199:stop] epoch=0/micro_step=2230/global_step=2230, RunningAvgSamplesPerSec=203.4245079461082, CurrSamplesPerSec=198.51580669303334, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:09,111] [INFO] [logging.py:96:log_dist] [Rank 0] step=2240, skipped=21, lr=[3.959950493484475e-05, 3.959950493484475e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:09,118] [INFO] [timer.py:199:stop] epoch=0/micro_step=2240/global_step=2240, RunningAvgSamplesPerSec=203.43237464036673, CurrSamplesPerSec=206.00896987631884, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:10,677] [INFO] [logging.py:96:log_dist] [Rank 0] step=2250, skipped=21, lr=[3.951274705925879e-05, 3.951274705925879e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:10,683] [INFO] [timer.py:199:stop] epoch=0/micro_step=2250/global_step=2250, RunningAvgSamplesPerSec=203.44017167225545, CurrSamplesPerSec=207.30399540346346, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:12,230] [INFO] [logging.py:96:log_dist] [Rank 0] step=2260, skipped=21, lr=[3.9425724764414116e-05, 3.9425724764414116e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:12,237] [INFO] [timer.py:199:stop] epoch=0/micro_step=2260/global_step=2260, RunningAvgSamplesPerSec=203.45474111652928, CurrSamplesPerSec=206.54124200758653, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:13,786] [INFO] [logging.py:96:log_dist] [Rank 0] step=2270, skipped=21, lr=[3.9338439635838986e-05, 3.9338439635838986e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:13,792] [INFO] [timer.py:199:stop] epoch=0/micro_step=2270/global_step=2270, RunningAvgSamplesPerSec=203.4681490381615, CurrSamplesPerSec=206.8873090938238, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:15,340] [INFO] [logging.py:96:log_dist] [Rank 0] step=2280, skipped=21, lr=[3.925089326385045e-05, 3.925089326385045e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:15,346] [INFO] [timer.py:199:stop] epoch=0/micro_step=2280/global_step=2280, RunningAvgSamplesPerSec=203.48206460289404, CurrSamplesPerSec=206.63186513740283, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:16,922] [INFO] [logging.py:96:log_dist] [Rank 0] step=2290, skipped=21, lr=[3.916308724352534e-05, 3.916308724352534e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:16,926] [INFO] [timer.py:199:stop] epoch=0/micro_step=2290/global_step=2290, RunningAvgSamplesPerSec=203.4821045246409, CurrSamplesPerSec=194.44714908881235, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:18,569] [INFO] [logging.py:96:log_dist] [Rank 0] step=2300, skipped=21, lr=[3.9075023174671256e-05, 3.9075023174671256e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:18,573] [INFO] [timer.py:199:stop] epoch=0/micro_step=2300/global_step=2300, RunningAvgSamplesPerSec=203.44466986236378, CurrSamplesPerSec=195.00964458302215, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:20,164] [INFO] [logging.py:96:log_dist] [Rank 0] step=2310, skipped=21, lr=[3.898670266179736e-05, 3.898670266179736e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:20,170] [INFO] [timer.py:199:stop] epoch=0/micro_step=2310/global_step=2310, RunningAvgSamplesPerSec=203.43524627774423, CurrSamplesPerSec=207.45619651018828, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:21,731] [INFO] [logging.py:96:log_dist] [Rank 0] step=2320, skipped=21, lr=[3.88981273140852e-05, 3.88981273140852e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:21,738] [INFO] [timer.py:199:stop] epoch=0/micro_step=2320/global_step=2320, RunningAvgSamplesPerSec=203.44155620016716, CurrSamplesPerSec=206.71046953425443, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:23,116] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:56:23,116] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:56:23,116] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:56:23,116] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:56:23,116] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:56:23,116] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:56:23,116] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:56:23,116] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:56:23,117] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:56:23,117] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:56:23,117] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:56:23,117] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:56:23,117] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:56:23,117] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:56:23,118] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:56:23,118] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:56:23,285] [INFO] [logging.py:96:log_dist] [Rank 0] step=2330, skipped=21, lr=[3.880929874535932e-05, 3.880929874535932e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:23,291] [INFO] [timer.py:199:stop] epoch=0/micro_step=2330/global_step=2330, RunningAvgSamplesPerSec=203.4558946673967, CurrSamplesPerSec=205.80680023430048, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:24,847] [INFO] [logging.py:96:log_dist] [Rank 0] step=2340, skipped=21, lr=[3.872021857405796e-05, 3.872021857405796e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:24,853] [INFO] [timer.py:199:stop] epoch=0/micro_step=2340/global_step=2340, RunningAvgSamplesPerSec=203.465286156847, CurrSamplesPerSec=205.383532925885, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:26,389] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2349 [2023-05-15 22:56:26,389] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2349 [2023-05-15 22:56:26,389] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2349 [2023-05-15 22:56:26,389] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2349 [2023-05-15 22:56:26,389] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2349 [2023-05-15 22:56:26,389] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:56:26,389] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:56:26,389] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:56:26,389] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2349 [2023-05-15 22:56:26,389] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:56:26,389] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:56:26,389] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:56:26,389] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-05-15 22:56:26,389] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2349 [2023-05-15 22:56:26,389] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:56:26,389] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2349 [2023-05-15 22:56:26,389] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:56:26,389] [INFO] [logging.py:96:log_dist] [Rank 0] step=2350, skipped=22, lr=[3.863983264101839e-05, 3.863983264101839e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:26,390] [INFO] [timer.py:199:stop] epoch=0/micro_step=2350/global_step=2350, RunningAvgSamplesPerSec=203.48842521010314, CurrSamplesPerSec=248.08228885754897, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:27,936] [INFO] [logging.py:96:log_dist] [Rank 0] step=2360, skipped=22, lr=[3.855027890001512e-05, 3.855027890001512e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:27,942] [INFO] [timer.py:199:stop] epoch=0/micro_step=2360/global_step=2360, RunningAvgSamplesPerSec=203.50280072249979, CurrSamplesPerSec=206.3332687666221, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:29,505] [INFO] [logging.py:96:log_dist] [Rank 0] step=2370, skipped=22, lr=[3.846047827572451e-05, 3.846047827572451e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:29,511] [INFO] [timer.py:199:stop] epoch=0/micro_step=2370/global_step=2370, RunningAvgSamplesPerSec=203.50833073262694, CurrSamplesPerSec=207.8465297930791, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:31,059] [INFO] [logging.py:96:log_dist] [Rank 0] step=2380, skipped=22, lr=[3.837043240429543e-05, 3.837043240429543e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:31,065] [INFO] [timer.py:199:stop] epoch=0/micro_step=2380/global_step=2380, RunningAvgSamplesPerSec=203.52182189103718, CurrSamplesPerSec=205.19513589706742, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:32,611] [INFO] [logging.py:96:log_dist] [Rank 0] step=2390, skipped=22, lr=[3.828014292634509e-05, 3.828014292634509e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:32,617] [INFO] [timer.py:199:stop] epoch=0/micro_step=2390/global_step=2390, RunningAvgSamplesPerSec=203.53600770954768, CurrSamplesPerSec=207.2723016077697, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:34,163] [INFO] [logging.py:96:log_dist] [Rank 0] step=2400, skipped=22, lr=[3.818961148692914e-05, 3.818961148692914e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:34,169] [INFO] [timer.py:199:stop] epoch=0/micro_step=2400/global_step=2400, RunningAvgSamplesPerSec=203.55009210013316, CurrSamplesPerSec=206.07444415613529, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:35,715] [INFO] [logging.py:96:log_dist] [Rank 0] step=2410, skipped=22, lr=[3.809883973551177e-05, 3.809883973551177e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:35,722] [INFO] [timer.py:199:stop] epoch=0/micro_step=2410/global_step=2410, RunningAvgSamplesPerSec=203.56392104152087, CurrSamplesPerSec=206.44021945603572, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:37,274] [INFO] [logging.py:96:log_dist] [Rank 0] step=2420, skipped=22, lr=[3.800782932593554e-05, 3.800782932593554e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:37,281] [INFO] [timer.py:199:stop] epoch=0/micro_step=2420/global_step=2420, RunningAvgSamplesPerSec=203.57415255759886, CurrSamplesPerSec=207.02484741260835, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:38,850] [INFO] [logging.py:96:log_dist] [Rank 0] step=2430, skipped=22, lr=[3.791658191639136e-05, 3.791658191639136e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:38,856] [INFO] [timer.py:199:stop] epoch=0/micro_step=2430/global_step=2430, RunningAvgSamplesPerSec=203.5756543681075, CurrSamplesPerSec=190.8090086350286, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:40,432] [INFO] [logging.py:96:log_dist] [Rank 0] step=2440, skipped=22, lr=[3.782509916938822e-05, 3.782509916938822e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:40,438] [INFO] [timer.py:199:stop] epoch=0/micro_step=2440/global_step=2440, RunningAvgSamplesPerSec=203.57386286853628, CurrSamplesPerSec=200.82703475105674, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:42,030] [INFO] [logging.py:96:log_dist] [Rank 0] step=2450, skipped=22, lr=[3.77333827517229e-05, 3.77333827517229e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:42,036] [INFO] [timer.py:199:stop] epoch=0/micro_step=2450/global_step=2450, RunningAvgSamplesPerSec=203.56308548642866, CurrSamplesPerSec=201.6056141700113, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:42,177] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:56:42,177] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:56:42,177] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:56:42,177] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:56:42,177] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:56:42,177] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:56:42,178] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:56:42,178] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:56:42,178] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:56:42,178] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:56:42,178] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:56:42,178] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:56:42,178] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:56:42,178] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:56:42,182] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:56:42,182] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:56:42,958] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2455 [2023-05-15 22:56:42,958] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:56:42,958] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2455 [2023-05-15 22:56:42,958] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2455 [2023-05-15 22:56:42,959] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:56:42,959] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2455 [2023-05-15 22:56:42,959] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:56:42,959] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2455 [2023-05-15 22:56:42,959] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2455 [2023-05-15 22:56:42,959] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:56:42,959] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:56:42,959] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:56:42,959] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2455 [2023-05-15 22:56:42,959] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-05-15 22:56:42,959] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2455 [2023-05-15 22:56:42,959] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:56:42,959] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:56:43,586] [INFO] [logging.py:96:log_dist] [Rank 0] step=2460, skipped=23, lr=[3.765063956844678e-05, 3.765063956844678e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:43,593] [INFO] [timer.py:199:stop] epoch=0/micro_step=2460/global_step=2460, RunningAvgSamplesPerSec=203.574270930064, CurrSamplesPerSec=201.84240025384872, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:45,198] [INFO] [logging.py:96:log_dist] [Rank 0] step=2470, skipped=23, lr=[3.755848378377192e-05, 3.755848378377192e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:45,205] [INFO] [timer.py:199:stop] epoch=0/micro_step=2470/global_step=2470, RunningAvgSamplesPerSec=203.55708388144654, CurrSamplesPerSec=203.1631872124001, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:46,801] [INFO] [logging.py:96:log_dist] [Rank 0] step=2480, skipped=23, lr=[3.746609918611222e-05, 3.746609918611222e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:46,806] [INFO] [timer.py:199:stop] epoch=0/micro_step=2480/global_step=2480, RunningAvgSamplesPerSec=203.544886518137, CurrSamplesPerSec=201.54204044123168, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:48,375] [INFO] [logging.py:96:log_dist] [Rank 0] step=2490, skipped=23, lr=[3.737348745869602e-05, 3.737348745869602e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:48,380] [INFO] [timer.py:199:stop] epoch=0/micro_step=2490/global_step=2490, RunningAvgSamplesPerSec=203.54744908909993, CurrSamplesPerSec=204.41572239686136, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:49,942] [INFO] [logging.py:96:log_dist] [Rank 0] step=2500, skipped=23, lr=[3.728065028888987e-05, 3.728065028888987e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:49,948] [INFO] [timer.py:199:stop] epoch=0/micro_step=2500/global_step=2500, RunningAvgSamplesPerSec=203.55322652058717, CurrSamplesPerSec=206.3018710737501, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:51,504] [INFO] [logging.py:96:log_dist] [Rank 0] step=2510, skipped=23, lr=[3.718758936816788e-05, 3.718758936816788e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:51,510] [INFO] [timer.py:199:stop] epoch=0/micro_step=2510/global_step=2510, RunningAvgSamplesPerSec=203.5616082437123, CurrSamplesPerSec=206.49198910752474, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:53,083] [INFO] [logging.py:96:log_dist] [Rank 0] step=2520, skipped=23, lr=[3.7094306392080816e-05, 3.7094306392080816e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:53,089] [INFO] [timer.py:199:stop] epoch=0/micro_step=2520/global_step=2520, RunningAvgSamplesPerSec=203.56121547797483, CurrSamplesPerSec=207.74937157150464, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:54,641] [INFO] [logging.py:96:log_dist] [Rank 0] step=2530, skipped=23, lr=[3.7000803060225284e-05, 3.7000803060225284e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:54,647] [INFO] [timer.py:199:stop] epoch=0/micro_step=2530/global_step=2530, RunningAvgSamplesPerSec=203.5716982574814, CurrSamplesPerSec=199.93077529229683, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:56,020] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2538 [2023-05-15 22:56:56,020] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2538 [2023-05-15 22:56:56,020] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2538 [2023-05-15 22:56:56,020] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:56:56,020] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:56:56,020] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:56:56,020] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2538 [2023-05-15 22:56:56,020] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:56:56,020] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2538 [2023-05-15 22:56:56,020] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2538 [2023-05-15 22:56:56,020] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:56:56,020] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:56:56,020] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2538 [2023-05-15 22:56:56,020] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:56:56,020] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2538 [2023-05-15 22:56:56,020] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 22:56:56,020] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0 [2023-05-15 22:56:56,171] [INFO] [logging.py:96:log_dist] [Rank 0] step=2540, skipped=24, lr=[3.691646306532564e-05, 3.691646306532564e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:56,177] [INFO] [timer.py:199:stop] epoch=0/micro_step=2540/global_step=2540, RunningAvgSamplesPerSec=203.59620074267585, CurrSamplesPerSec=206.2948951062924, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:57,724] [INFO] [logging.py:96:log_dist] [Rank 0] step=2550, skipped=24, lr=[3.682254575425273e-05, 3.682254575425273e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:57,730] [INFO] [timer.py:199:stop] epoch=0/micro_step=2550/global_step=2550, RunningAvgSamplesPerSec=203.60899699515437, CurrSamplesPerSec=206.3770511968903, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:56:59,283] [INFO] [logging.py:96:log_dist] [Rank 0] step=2560, skipped=24, lr=[3.672841303883413e-05, 3.672841303883413e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:56:59,289] [INFO] [timer.py:199:stop] epoch=0/micro_step=2560/global_step=2560, RunningAvgSamplesPerSec=203.6182501730329, CurrSamplesPerSec=208.02660900967925, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:00,935] [INFO] [logging.py:96:log_dist] [Rank 0] step=2570, skipped=24, lr=[3.66340666341485e-05, 3.66340666341485e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:00,939] [INFO] [timer.py:199:stop] epoch=0/micro_step=2570/global_step=2570, RunningAvgSamplesPerSec=203.58296959591394, CurrSamplesPerSec=196.03072397641515, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:02,557] [INFO] [logging.py:96:log_dist] [Rank 0] step=2580, skipped=24, lr=[3.6539508259167863e-05, 3.6539508259167863e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:02,563] [INFO] [timer.py:199:stop] epoch=0/micro_step=2580/global_step=2580, RunningAvgSamplesPerSec=203.56075215011145, CurrSamplesPerSec=207.58293404931223, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:04,110] [INFO] [logging.py:96:log_dist] [Rank 0] step=2590, skipped=24, lr=[3.6444739636726335e-05, 3.6444739636726335e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:04,116] [INFO] [timer.py:199:stop] epoch=0/micro_step=2590/global_step=2590, RunningAvgSamplesPerSec=203.5731982851234, CurrSamplesPerSec=207.22877845352406, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:05,668] [INFO] [logging.py:96:log_dist] [Rank 0] step=2600, skipped=24, lr=[3.634976249348867e-05, 3.634976249348867e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:05,674] [INFO] [timer.py:199:stop] epoch=0/micro_step=2600/global_step=2600, RunningAvgSamplesPerSec=203.5832859248294, CurrSamplesPerSec=207.16960349703024, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:07,224] [INFO] [logging.py:96:log_dist] [Rank 0] step=2610, skipped=24, lr=[3.625457855991883e-05, 3.625457855991883e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:07,230] [INFO] [timer.py:199:stop] epoch=0/micro_step=2610/global_step=2610, RunningAvgSamplesPerSec=203.59408975337783, CurrSamplesPerSec=207.06509038231016, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:08,782] [INFO] [logging.py:96:log_dist] [Rank 0] step=2620, skipped=24, lr=[3.615918957024845e-05, 3.615918957024845e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:08,789] [INFO] [timer.py:199:stop] epoch=0/micro_step=2620/global_step=2620, RunningAvgSamplesPerSec=203.6036439811771, CurrSamplesPerSec=206.53488547445198, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:10,340] [INFO] [logging.py:96:log_dist] [Rank 0] step=2630, skipped=24, lr=[3.606359726244526e-05, 3.606359726244526e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:10,347] [INFO] [timer.py:199:stop] epoch=0/micro_step=2630/global_step=2630, RunningAvgSamplesPerSec=203.61344937795928, CurrSamplesPerSec=207.60444666492396, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:11,905] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:57:11,905] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:57:11,905] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:57:11,905] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:57:11,905] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:57:11,905] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:57:11,905] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:57:11,905] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:57:11,906] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:57:11,906] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:57:11,906] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:57:11,906] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:57:11,906] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:57:11,906] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:57:11,910] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:57:11,910] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 22:57:11,926] [INFO] [logging.py:96:log_dist] [Rank 0] step=2640, skipped=24, lr=[3.5967803378181386e-05, 3.5967803378181386e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:11,930] [INFO] [timer.py:199:stop] epoch=0/micro_step=2640/global_step=2640, RunningAvgSamplesPerSec=203.61114583439343, CurrSamplesPerSec=200.48206131670338, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:13,528] [INFO] [logging.py:96:log_dist] [Rank 0] step=2650, skipped=24, lr=[3.587180966280166e-05, 3.587180966280166e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:13,532] [INFO] [timer.py:199:stop] epoch=0/micro_step=2650/global_step=2650, RunningAvgSamplesPerSec=203.59909178839763, CurrSamplesPerSec=200.6454008640665, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:15,130] [INFO] [logging.py:96:log_dist] [Rank 0] step=2660, skipped=24, lr=[3.577561786529177e-05, 3.577561786529177e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:15,134] [INFO] [timer.py:199:stop] epoch=0/micro_step=2660/global_step=2660, RunningAvgSamplesPerSec=203.58751511112078, CurrSamplesPerSec=199.26854645224125, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:16,731] [INFO] [logging.py:96:log_dist] [Rank 0] step=2670, skipped=24, lr=[3.5679229738246434e-05, 3.5679229738246434e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:16,735] [INFO] [timer.py:199:stop] epoch=0/micro_step=2670/global_step=2670, RunningAvgSamplesPerSec=203.5764538046713, CurrSamplesPerSec=201.04814032564897, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:18,333] [INFO] [logging.py:96:log_dist] [Rank 0] step=2680, skipped=24, lr=[3.5582647037837445e-05, 3.5582647037837445e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:18,337] [INFO] [timer.py:199:stop] epoch=0/micro_step=2680/global_step=2680, RunningAvgSamplesPerSec=203.56497248380788, CurrSamplesPerSec=200.91391481896184, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:19,937] [INFO] [logging.py:96:log_dist] [Rank 0] step=2690, skipped=24, lr=[3.54858715237817e-05, 3.54858715237817e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:19,941] [INFO] [timer.py:199:stop] epoch=0/micro_step=2690/global_step=2690, RunningAvgSamplesPerSec=203.5526100618762, CurrSamplesPerSec=199.83819710794006, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:21,598] [INFO] [logging.py:96:log_dist] [Rank 0] step=2700, skipped=24, lr=[3.53889049593091e-05, 3.53889049593091e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:21,599] [INFO] [timer.py:199:stop] epoch=0/micro_step=2700/global_step=2700, RunningAvgSamplesPerSec=203.5146377633373, CurrSamplesPerSec=170.3298760004924, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:23,276] [INFO] [logging.py:96:log_dist] [Rank 0] step=2710, skipped=24, lr=[3.529174911113046e-05, 3.529174911113046e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:23,280] [INFO] [timer.py:199:stop] epoch=0/micro_step=2710/global_step=2710, RunningAvgSamplesPerSec=203.46610097750204, CurrSamplesPerSec=200.89677335341037, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:24,876] [INFO] [logging.py:96:log_dist] [Rank 0] step=2720, skipped=24, lr=[3.519440574940529e-05, 3.519440574940529e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:24,880] [INFO] [timer.py:199:stop] epoch=0/micro_step=2720/global_step=2720, RunningAvgSamplesPerSec=203.4559931901487, CurrSamplesPerSec=199.78049078183935, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:26,516] [INFO] [logging.py:96:log_dist] [Rank 0] step=2730, skipped=24, lr=[3.5096876647709575e-05, 3.5096876647709575e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:26,517] [INFO] [timer.py:199:stop] epoch=0/micro_step=2730/global_step=2730, RunningAvgSamplesPerSec=203.4289995773447, CurrSamplesPerSec=171.41800673065256, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:28,362] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:57:28,362] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:57:28,362] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:57:28,362] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:57:28,362] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:57:28,362] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:57:28,363] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:57:28,363] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:57:28,363] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:57:28,363] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:57:28,363] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:57:28,363] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:57:28,363] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:57:28,363] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:57:28,366] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:57:28,366] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:57:28,374] [INFO] [logging.py:96:log_dist] [Rank 0] step=2740, skipped=24, lr=[3.499916358300343e-05, 3.499916358300343e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:28,381] [INFO] [timer.py:199:stop] epoch=0/micro_step=2740/global_step=2740, RunningAvgSamplesPerSec=203.2955322846925, CurrSamplesPerSec=194.23272722010375, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:29,962] [INFO] [logging.py:96:log_dist] [Rank 0] step=2750, skipped=24, lr=[3.490126833559875e-05, 3.490126833559875e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:29,969] [INFO] [timer.py:199:stop] epoch=0/micro_step=2750/global_step=2750, RunningAvgSamplesPerSec=203.2919146989093, CurrSamplesPerSec=202.28501841726097, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:30,092] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2750 [2023-05-15 22:57:30,092] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2750 [2023-05-15 22:57:30,092] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:57:30,092] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2750 [2023-05-15 22:57:30,092] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:57:30,092] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2750 [2023-05-15 22:57:30,092] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2750 [2023-05-15 22:57:30,092] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2750 [2023-05-15 22:57:30,092] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2750 [2023-05-15 22:57:30,092] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2750 [2023-05-15 22:57:30,092] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:57:30,092] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:57:30,092] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:57:30,092] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:57:30,092] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:57:30,092] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:57:30,092] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-05-15 22:57:31,523] [INFO] [logging.py:96:log_dist] [Rank 0] step=2760, skipped=25, lr=[3.4813008320830434e-05, 3.4813008320830434e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:31,529] [INFO] [timer.py:199:stop] epoch=0/micro_step=2760/global_step=2760, RunningAvgSamplesPerSec=203.3011792127639, CurrSamplesPerSec=202.52065377465536, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:33,108] [INFO] [logging.py:96:log_dist] [Rank 0] step=2770, skipped=25, lr=[3.471477184291979e-05, 3.471477184291979e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:33,114] [INFO] [timer.py:199:stop] epoch=0/micro_step=2770/global_step=2770, RunningAvgSamplesPerSec=203.2986798638339, CurrSamplesPerSec=202.98651422003434, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:34,694] [INFO] [logging.py:96:log_dist] [Rank 0] step=2780, skipped=25, lr=[3.4616358363869465e-05, 3.4616358363869465e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:34,700] [INFO] [timer.py:199:stop] epoch=0/micro_step=2780/global_step=2780, RunningAvgSamplesPerSec=203.29585317690825, CurrSamplesPerSec=202.10773840895058, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:36,288] [INFO] [logging.py:96:log_dist] [Rank 0] step=2790, skipped=25, lr=[3.4517769676752754e-05, 3.4517769676752754e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:36,294] [INFO] [timer.py:199:stop] epoch=0/micro_step=2790/global_step=2790, RunningAvgSamplesPerSec=203.2894157300352, CurrSamplesPerSec=202.36248556360007, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:37,872] [INFO] [logging.py:96:log_dist] [Rank 0] step=2800, skipped=25, lr=[3.4419007577835214e-05, 3.4419007577835214e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:37,878] [INFO] [timer.py:199:stop] epoch=0/micro_step=2800/global_step=2800, RunningAvgSamplesPerSec=203.28739981572264, CurrSamplesPerSec=201.98334083271757, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:39,444] [INFO] [logging.py:96:log_dist] [Rank 0] step=2810, skipped=25, lr=[3.432007386654192e-05, 3.432007386654192e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:39,450] [INFO] [timer.py:199:stop] epoch=0/micro_step=2810/global_step=2810, RunningAvgSamplesPerSec=203.2910831103266, CurrSamplesPerSec=206.3304140359939, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:41,010] [INFO] [logging.py:96:log_dist] [Rank 0] step=2820, skipped=25, lr=[3.422097034542468e-05, 3.422097034542468e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:41,017] [INFO] [timer.py:199:stop] epoch=0/micro_step=2820/global_step=2820, RunningAvgSamplesPerSec=203.2973917605243, CurrSamplesPerSec=207.28222488540752, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:42,565] [INFO] [logging.py:96:log_dist] [Rank 0] step=2830, skipped=25, lr=[3.412169882012922e-05, 3.412169882012922e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:42,571] [INFO] [timer.py:199:stop] epoch=0/micro_step=2830/global_step=2830, RunningAvgSamplesPerSec=203.30904349373841, CurrSamplesPerSec=206.3770511968903, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:44,120] [INFO] [logging.py:96:log_dist] [Rank 0] step=2840, skipped=25, lr=[3.4022261099362265e-05, 3.4022261099362265e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:44,126] [INFO] [timer.py:199:stop] epoch=0/micro_step=2840/global_step=2840, RunningAvgSamplesPerSec=203.32090735291942, CurrSamplesPerSec=206.67481964537313, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:45,678] [INFO] [logging.py:96:log_dist] [Rank 0] step=2850, skipped=25, lr=[3.392265899485857e-05, 3.392265899485857e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:45,684] [INFO] [timer.py:199:stop] epoch=0/micro_step=2850/global_step=2850, RunningAvgSamplesPerSec=203.33053367327446, CurrSamplesPerSec=206.74071792091286, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:45,977] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:57:45,977] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:57:45,977] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:57:45,977] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:57:45,977] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:57:45,977] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:57:45,977] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:57:45,977] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:57:45,978] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:57:45,978] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:57:45,978] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:57:45,978] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:57:45,978] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:57:45,978] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:57:45,978] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:57:45,978] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:57:47,234] [INFO] [logging.py:96:log_dist] [Rank 0] step=2860, skipped=25, lr=[3.382289432134795e-05, 3.382289432134795e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:47,241] [INFO] [timer.py:199:stop] epoch=0/micro_step=2860/global_step=2860, RunningAvgSamplesPerSec=203.34127490452795, CurrSamplesPerSec=206.35547492389531, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:48,791] [INFO] [logging.py:96:log_dist] [Rank 0] step=2870, skipped=25, lr=[3.372296889652218e-05, 3.372296889652218e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:48,798] [INFO] [timer.py:199:stop] epoch=0/micro_step=2870/global_step=2870, RunningAvgSamplesPerSec=203.35155906118578, CurrSamplesPerSec=205.66583256844535, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:50,350] [INFO] [logging.py:96:log_dist] [Rank 0] step=2880, skipped=25, lr=[3.362288454100189e-05, 3.362288454100189e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:50,356] [INFO] [timer.py:199:stop] epoch=0/micro_step=2880/global_step=2880, RunningAvgSamplesPerSec=203.3611805111285, CurrSamplesPerSec=206.29679759608365, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:51,903] [INFO] [logging.py:96:log_dist] [Rank 0] step=2890, skipped=25, lr=[3.3522643078303406e-05, 3.3522643078303406e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:51,910] [INFO] [timer.py:199:stop] epoch=0/micro_step=2890/global_step=2890, RunningAvgSamplesPerSec=203.37276026935567, CurrSamplesPerSec=206.95557673152064, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:53,456] [INFO] [logging.py:96:log_dist] [Rank 0] step=2900, skipped=25, lr=[3.34222463348055e-05, 3.34222463348055e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:53,462] [INFO] [timer.py:199:stop] epoch=0/micro_step=2900/global_step=2900, RunningAvgSamplesPerSec=203.38480325814442, CurrSamplesPerSec=206.39736239942917, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:55,021] [INFO] [logging.py:96:log_dist] [Rank 0] step=2910, skipped=25, lr=[3.332169613971615e-05, 3.332169613971615e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:55,027] [INFO] [timer.py:199:stop] epoch=0/micro_step=2910/global_step=2910, RunningAvgSamplesPerSec=203.39113211624445, CurrSamplesPerSec=199.38369755913502, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:56,575] [INFO] [logging.py:96:log_dist] [Rank 0] step=2920, skipped=25, lr=[3.322099432503916e-05, 3.322099432503916e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:56,581] [INFO] [timer.py:199:stop] epoch=0/micro_step=2920/global_step=2920, RunningAvgSamplesPerSec=203.40225657720688, CurrSamplesPerSec=206.95972528260458, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:58,128] [INFO] [logging.py:96:log_dist] [Rank 0] step=2930, skipped=25, lr=[3.312014272554084e-05, 3.312014272554084e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:58,134] [INFO] [timer.py:199:stop] epoch=0/micro_step=2930/global_step=2930, RunningAvgSamplesPerSec=203.41370561308648, CurrSamplesPerSec=206.99547353599348, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:57:59,696] [INFO] [logging.py:96:log_dist] [Rank 0] step=2940, skipped=25, lr=[3.301914317871651e-05, 3.301914317871651e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:57:59,702] [INFO] [timer.py:199:stop] epoch=0/micro_step=2940/global_step=2940, RunningAvgSamplesPerSec=203.4186122215627, CurrSamplesPerSec=208.8605036576883, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:01,249] [INFO] [logging.py:96:log_dist] [Rank 0] step=2950, skipped=25, lr=[3.291799752475713e-05, 3.291799752475713e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:01,256] [INFO] [timer.py:199:stop] epoch=0/micro_step=2950/global_step=2950, RunningAvgSamplesPerSec=203.42979666936566, CurrSamplesPerSec=206.26255814775485, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:01,547] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:58:01,547] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:58:01,547] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:58:01,547] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:58:01,547] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:58:01,547] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:58:01,547] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:58:01,547] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:58:01,548] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:58:01,548] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:58:01,548] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:58:01,548] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:58:01,548] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:58:01,548] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:58:01,548] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:58:01,548] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:58:02,801] [INFO] [logging.py:96:log_dist] [Rank 0] step=2960, skipped=25, lr=[3.281670760651563e-05, 3.281670760651563e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:02,808] [INFO] [timer.py:199:stop] epoch=0/micro_step=2960/global_step=2960, RunningAvgSamplesPerSec=203.44141263070875, CurrSamplesPerSec=207.49853208768207, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:04,355] [INFO] [logging.py:96:log_dist] [Rank 0] step=2970, skipped=25, lr=[3.271527526947343e-05, 3.271527526947343e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:04,361] [INFO] [timer.py:199:stop] epoch=0/micro_step=2970/global_step=2970, RunningAvgSamplesPerSec=203.45242657854706, CurrSamplesPerSec=206.69869097853206, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:05,907] [INFO] [logging.py:96:log_dist] [Rank 0] step=2980, skipped=25, lr=[3.26137023617068e-05, 3.26137023617068e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:05,913] [INFO] [timer.py:199:stop] epoch=0/micro_step=2980/global_step=2980, RunningAvgSamplesPerSec=203.4640555551352, CurrSamplesPerSec=207.4000964236708, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:07,459] [INFO] [logging.py:96:log_dist] [Rank 0] step=2990, skipped=25, lr=[3.251199073385317e-05, 3.251199073385317e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:07,465] [INFO] [timer.py:199:stop] epoch=0/micro_step=2990/global_step=2990, RunningAvgSamplesPerSec=203.4757367462835, CurrSamplesPerSec=207.3395423993524, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:09,022] [INFO] [logging.py:96:log_dist] [Rank 0] step=3000, skipped=25, lr=[3.241014223907744e-05, 3.241014223907744e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:09,028] [INFO] [timer.py:199:stop] epoch=0/micro_step=3000/global_step=3000, RunningAvgSamplesPerSec=203.48238836944319, CurrSamplesPerSec=207.05870156476826, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:10,617] [INFO] [logging.py:96:log_dist] [Rank 0] step=3010, skipped=25, lr=[3.230815873303817e-05, 3.230815873303817e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:10,623] [INFO] [timer.py:199:stop] epoch=0/micro_step=3010/global_step=3010, RunningAvgSamplesPerSec=203.47540534868023, CurrSamplesPerSec=201.02525514254134, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:12,213] [INFO] [logging.py:96:log_dist] [Rank 0] step=3020, skipped=25, lr=[3.220604207385382e-05, 3.220604207385382e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:12,220] [INFO] [timer.py:199:stop] epoch=0/micro_step=3020/global_step=3020, RunningAvgSamplesPerSec=203.46774997804016, CurrSamplesPerSec=200.83905519351774, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:12,663] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3022 [2023-05-15 22:58:12,663] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3022 [2023-05-15 22:58:12,663] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:58:12,663] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:58:12,663] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3022 [2023-05-15 22:58:12,663] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3022 [2023-05-15 22:58:12,663] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3022 [2023-05-15 22:58:12,663] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3022 [2023-05-15 22:58:12,663] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:58:12,663] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:58:12,663] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3022 [2023-05-15 22:58:12,663] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:58:12,663] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:58:12,663] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-05-15 22:58:12,663] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3022 [2023-05-15 22:58:12,663] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:58:12,663] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:58:13,774] [INFO] [logging.py:96:log_dist] [Rank 0] step=3030, skipped=26, lr=[3.211402477233927e-05, 3.211402477233927e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:13,780] [INFO] [timer.py:199:stop] epoch=0/micro_step=3030/global_step=3030, RunningAvgSamplesPerSec=203.47548612609262, CurrSamplesPerSec=199.9164811005572, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:15,371] [INFO] [logging.py:96:log_dist] [Rank 0] step=3040, skipped=26, lr=[3.201166024995706e-05, 3.201166024995706e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:15,378] [INFO] [timer.py:199:stop] epoch=0/micro_step=3040/global_step=3040, RunningAvgSamplesPerSec=203.46730713510104, CurrSamplesPerSec=201.4189425851753, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:16,969] [INFO] [logging.py:96:log_dist] [Rank 0] step=3050, skipped=26, lr=[3.1909167976570976e-05, 3.1909167976570976e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:16,975] [INFO] [timer.py:199:stop] epoch=0/micro_step=3050/global_step=3050, RunningAvgSamplesPerSec=203.4592857762531, CurrSamplesPerSec=200.69700549376606, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:18,563] [INFO] [logging.py:96:log_dist] [Rank 0] step=3060, skipped=26, lr=[3.180654981956912e-05, 3.180654981956912e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:18,569] [INFO] [timer.py:199:stop] epoch=0/micro_step=3060/global_step=3060, RunningAvgSamplesPerSec=203.45289843379936, CurrSamplesPerSec=201.3796555392346, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:20,164] [INFO] [logging.py:96:log_dist] [Rank 0] step=3070, skipped=26, lr=[3.1703807648633144e-05, 3.1703807648633144e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:20,170] [INFO] [timer.py:199:stop] epoch=0/micro_step=3070/global_step=3070, RunningAvgSamplesPerSec=203.4435341720365, CurrSamplesPerSec=201.21240396075228, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:21,760] [INFO] [logging.py:96:log_dist] [Rank 0] step=3080, skipped=26, lr=[3.160094333570421e-05, 3.160094333570421e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:21,766] [INFO] [timer.py:199:stop] epoch=0/micro_step=3080/global_step=3080, RunningAvgSamplesPerSec=203.4363132168589, CurrSamplesPerSec=201.41017722354778, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:23,355] [INFO] [logging.py:96:log_dist] [Rank 0] step=3090, skipped=26, lr=[3.149795875494889e-05, 3.149795875494889e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:23,361] [INFO] [timer.py:199:stop] epoch=0/micro_step=3090/global_step=3090, RunningAvgSamplesPerSec=203.42954861500579, CurrSamplesPerSec=200.81471529072394, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:24,950] [INFO] [logging.py:96:log_dist] [Rank 0] step=3100, skipped=26, lr=[3.139485578272501e-05, 3.139485578272501e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:24,956] [INFO] [timer.py:199:stop] epoch=0/micro_step=3100/global_step=3100, RunningAvgSamplesPerSec=203.4227752845993, CurrSamplesPerSec=200.9217347045703, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:26,546] [INFO] [logging.py:96:log_dist] [Rank 0] step=3110, skipped=26, lr=[3.1291636297547464e-05, 3.1291636297547464e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:26,552] [INFO] [timer.py:199:stop] epoch=0/micro_step=3110/global_step=3110, RunningAvgSamplesPerSec=203.4157879875305, CurrSamplesPerSec=201.30626006773315, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:28,143] [INFO] [logging.py:96:log_dist] [Rank 0] step=3120, skipped=26, lr=[3.118830218005399e-05, 3.118830218005399e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:28,149] [INFO] [timer.py:199:stop] epoch=0/micro_step=3120/global_step=3120, RunningAvgSamplesPerSec=203.40867800914904, CurrSamplesPerSec=201.4812288431371, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:28,769] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:58:28,769] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:58:28,769] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:58:28,769] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:58:28,769] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:58:28,770] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:58:28,770] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:58:28,770] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:58:28,770] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:58:28,770] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:58:28,770] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:58:28,770] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:58:28,772] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:58:28,772] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:58:28,782] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:58:28,782] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:58:29,763] [INFO] [logging.py:96:log_dist] [Rank 0] step=3130, skipped=26, lr=[3.1084855312970896e-05, 3.1084855312970896e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:29,769] [INFO] [timer.py:199:stop] epoch=0/micro_step=3130/global_step=3130, RunningAvgSamplesPerSec=203.39162886596003, CurrSamplesPerSec=200.90218613179658, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:29,901] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3130 [2023-05-15 22:58:29,901] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3130 [2023-05-15 22:58:29,901] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:58:29,901] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:58:29,901] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3130 [2023-05-15 22:58:29,901] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:58:29,901] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3130 [2023-05-15 22:58:29,901] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-05-15 22:58:29,901] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:58:29,901] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3130 [2023-05-15 22:58:29,901] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:58:29,901] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3130 [2023-05-15 22:58:29,902] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:58:29,902] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3130 [2023-05-15 22:58:29,902] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:58:29,903] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3130 [2023-05-15 22:58:29,903] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:58:31,329] [INFO] [logging.py:96:log_dist] [Rank 0] step=3140, skipped=27, lr=[3.0991658289426254e-05, 3.0991658289426254e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:31,335] [INFO] [timer.py:199:stop] epoch=0/micro_step=3140/global_step=3140, RunningAvgSamplesPerSec=203.39731099257932, CurrSamplesPerSec=200.77506290220762, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:32,924] [INFO] [logging.py:96:log_dist] [Rank 0] step=3150, skipped=27, lr=[3.0888002392363576e-05, 3.0888002392363576e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:32,930] [INFO] [timer.py:199:stop] epoch=0/micro_step=3150/global_step=3150, RunningAvgSamplesPerSec=203.39076784934346, CurrSamplesPerSec=201.84391796173594, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:34,518] [INFO] [logging.py:96:log_dist] [Rank 0] step=3160, skipped=27, lr=[3.078423921711149e-05, 3.078423921711149e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:34,524] [INFO] [timer.py:199:stop] epoch=0/micro_step=3160/global_step=3160, RunningAvgSamplesPerSec=203.3846271634234, CurrSamplesPerSec=201.5913822806256, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:36,114] [INFO] [logging.py:96:log_dist] [Rank 0] step=3170, skipped=27, lr=[3.0680370654213645e-05, 3.0680370654213645e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:36,120] [INFO] [timer.py:199:stop] epoch=0/micro_step=3170/global_step=3170, RunningAvgSamplesPerSec=203.37768853702983, CurrSamplesPerSec=200.60461598706257, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:37,710] [INFO] [logging.py:96:log_dist] [Rank 0] step=3180, skipped=27, lr=[3.057639859613384e-05, 3.057639859613384e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:37,716] [INFO] [timer.py:199:stop] epoch=0/micro_step=3180/global_step=3180, RunningAvgSamplesPerSec=203.37090574954445, CurrSamplesPerSec=201.46731692086922, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:39,305] [INFO] [logging.py:96:log_dist] [Rank 0] step=3190, skipped=27, lr=[3.047232493722154e-05, 3.047232493722154e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:39,311] [INFO] [timer.py:199:stop] epoch=0/micro_step=3190/global_step=3190, RunningAvgSamplesPerSec=203.3648009821887, CurrSamplesPerSec=200.96505586449976, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:40,861] [INFO] [logging.py:96:log_dist] [Rank 0] step=3200, skipped=27, lr=[3.036815157367734e-05, 3.036815157367734e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:40,867] [INFO] [timer.py:199:stop] epoch=0/micro_step=3200/global_step=3200, RunningAvgSamplesPerSec=203.3743490515482, CurrSamplesPerSec=206.9415367157073, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:42,414] [INFO] [logging.py:96:log_dist] [Rank 0] step=3210, skipped=27, lr=[3.026388040351843e-05, 3.026388040351843e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:42,421] [INFO] [timer.py:199:stop] epoch=0/micro_step=3210/global_step=3210, RunningAvgSamplesPerSec=203.3847154149866, CurrSamplesPerSec=206.9278176315599, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:43,967] [INFO] [logging.py:96:log_dist] [Rank 0] step=3220, skipped=27, lr=[3.0159513326544053e-05, 3.0159513326544053e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:43,973] [INFO] [timer.py:199:stop] epoch=0/micro_step=3220/global_step=3220, RunningAvgSamplesPerSec=203.39543466825467, CurrSamplesPerSec=207.23485779575702, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:45,524] [INFO] [logging.py:96:log_dist] [Rank 0] step=3230, skipped=27, lr=[3.0055052244300817e-05, 3.0055052244300817e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:45,530] [INFO] [timer.py:199:stop] epoch=0/micro_step=3230/global_step=3230, RunningAvgSamplesPerSec=203.40443312425325, CurrSamplesPerSec=206.6687731162828, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:45,823] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:58:45,824] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:58:45,823] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:58:45,824] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:58:45,824] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:58:45,824] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:58:45,824] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:58:45,824] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:58:45,824] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:58:45,824] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:58:45,824] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:58:45,824] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:58:45,824] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:58:45,824] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:58:45,825] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:58:45,825] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:58:47,089] [INFO] [logging.py:96:log_dist] [Rank 0] step=3240, skipped=27, lr=[2.9950499060048108e-05, 2.9950499060048108e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:47,095] [INFO] [timer.py:199:stop] epoch=0/micro_step=3240/global_step=3240, RunningAvgSamplesPerSec=203.41027163550515, CurrSamplesPerSec=205.03025859119128, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:48,650] [INFO] [logging.py:96:log_dist] [Rank 0] step=3250, skipped=27, lr=[2.9845855678723372e-05, 2.9845855678723372e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:48,656] [INFO] [timer.py:199:stop] epoch=0/micro_step=3250/global_step=3250, RunningAvgSamplesPerSec=203.41754584862855, CurrSamplesPerSec=206.68818691260645, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:50,208] [INFO] [logging.py:96:log_dist] [Rank 0] step=3260, skipped=27, lr=[2.9741124006907433e-05, 2.9741124006907433e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:50,214] [INFO] [timer.py:199:stop] epoch=0/micro_step=3260/global_step=3260, RunningAvgSamplesPerSec=203.42599225955524, CurrSamplesPerSec=205.95017339266533, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:51,769] [INFO] [logging.py:96:log_dist] [Rank 0] step=3270, skipped=27, lr=[2.963630595278977e-05, 2.963630595278977e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:51,776] [INFO] [timer.py:199:stop] epoch=0/micro_step=3270/global_step=3270, RunningAvgSamplesPerSec=203.43282246010048, CurrSamplesPerSec=207.09448402173126, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:53,324] [INFO] [logging.py:96:log_dist] [Rank 0] step=3280, skipped=27, lr=[2.9531403426133712e-05, 2.9531403426133712e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:53,330] [INFO] [timer.py:199:stop] epoch=0/micro_step=3280/global_step=3280, RunningAvgSamplesPerSec=203.44248720972078, CurrSamplesPerSec=206.52090247869285, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:54,227] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3285 [2023-05-15 22:58:54,227] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3285 [2023-05-15 22:58:54,227] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:58:54,227] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:58:54,227] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3285 [2023-05-15 22:58:54,227] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3285 [2023-05-15 22:58:54,227] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3285 [2023-05-15 22:58:54,227] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:58:54,227] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:58:54,227] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:58:54,227] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3285 [2023-05-15 22:58:54,227] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-05-15 22:58:54,227] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3285 [2023-05-15 22:58:54,227] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:58:54,227] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3285 [2023-05-15 22:58:54,227] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:58:54,227] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:58:54,844] [INFO] [logging.py:96:log_dist] [Rank 0] step=3290, skipped=28, lr=[2.9436920507783127e-05, 2.9436920507783127e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:54,850] [INFO] [timer.py:199:stop] epoch=0/micro_step=3290/global_step=3290, RunningAvgSamplesPerSec=203.46562650271062, CurrSamplesPerSec=205.52348431138054, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:56,404] [INFO] [logging.py:96:log_dist] [Rank 0] step=3300, skipped=28, lr=[2.933186275018621e-05, 2.933186275018621e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:56,410] [INFO] [timer.py:199:stop] epoch=0/micro_step=3300/global_step=3300, RunningAvgSamplesPerSec=203.47290804789932, CurrSamplesPerSec=207.24701755036116, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:57,957] [INFO] [logging.py:96:log_dist] [Rank 0] step=3310, skipped=28, lr=[2.9226726066943267e-05, 2.9226726066943267e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:57,963] [INFO] [timer.py:199:stop] epoch=0/micro_step=3310/global_step=3310, RunningAvgSamplesPerSec=203.48296165740516, CurrSamplesPerSec=206.6118258101362, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:58:59,512] [INFO] [logging.py:96:log_dist] [Rank 0] step=3320, skipped=28, lr=[2.912151237362299e-05, 2.912151237362299e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:58:59,518] [INFO] [timer.py:199:stop] epoch=0/micro_step=3320/global_step=3320, RunningAvgSamplesPerSec=203.4921229745747, CurrSamplesPerSec=206.61532446020794, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:01,066] [INFO] [logging.py:96:log_dist] [Rank 0] step=3330, skipped=28, lr=[2.9016223587197166e-05, 2.9016223587197166e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:01,072] [INFO] [timer.py:199:stop] epoch=0/micro_step=3330/global_step=3330, RunningAvgSamplesPerSec=203.50177650549273, CurrSamplesPerSec=205.58644595337694, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:02,618] [INFO] [logging.py:96:log_dist] [Rank 0] step=3340, skipped=28, lr=[2.8910861626005776e-05, 2.8910861626005776e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:02,624] [INFO] [timer.py:199:stop] epoch=0/micro_step=3340/global_step=3340, RunningAvgSamplesPerSec=203.51189134922052, CurrSamplesPerSec=207.51008115324498, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:04,171] [INFO] [logging.py:96:log_dist] [Rank 0] step=3350, skipped=28, lr=[2.8805428409722024e-05, 2.8805428409722024e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:04,178] [INFO] [timer.py:199:stop] epoch=0/micro_step=3350/global_step=3350, RunningAvgSamplesPerSec=203.52151389945, CurrSamplesPerSec=206.82450438789112, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:05,725] [INFO] [logging.py:96:log_dist] [Rank 0] step=3360, skipped=28, lr=[2.8699925859317366e-05, 2.8699925859317366e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:05,732] [INFO] [timer.py:199:stop] epoch=0/micro_step=3360/global_step=3360, RunningAvgSamplesPerSec=203.53083270622497, CurrSamplesPerSec=206.4910360573114, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:07,278] [INFO] [logging.py:96:log_dist] [Rank 0] step=3370, skipped=28, lr=[2.859435589702653e-05, 2.859435589702653e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:07,285] [INFO] [timer.py:199:stop] epoch=0/micro_step=3370/global_step=3370, RunningAvgSamplesPerSec=203.54045160414015, CurrSamplesPerSec=206.97759786572908, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:08,835] [INFO] [logging.py:96:log_dist] [Rank 0] step=3380, skipped=28, lr=[2.8488720446312456e-05, 2.8488720446312456e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:08,841] [INFO] [timer.py:199:stop] epoch=0/micro_step=3380/global_step=3380, RunningAvgSamplesPerSec=203.5488146723659, CurrSamplesPerSec=206.64777213240956, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:09,914] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:59:09,915] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:59:09,915] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:59:09,915] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:59:09,915] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:59:09,915] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:59:09,915] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:59:09,915] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:59:09,915] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:59:09,915] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:59:09,915] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:59:09,915] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:59:09,915] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:59:09,916] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:59:09,916] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:59:09,916] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:59:10,394] [INFO] [logging.py:96:log_dist] [Rank 0] step=3390, skipped=28, lr=[2.8383021431831247e-05, 2.8383021431831247e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:10,400] [INFO] [timer.py:199:stop] epoch=0/micro_step=3390/global_step=3390, RunningAvgSamplesPerSec=203.55647170693632, CurrSamplesPerSec=206.7416732773161, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:11,148] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3394 [2023-05-15 22:59:11,148] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:59:11,148] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3394 [2023-05-15 22:59:11,148] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3394 [2023-05-15 22:59:11,148] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3394 [2023-05-15 22:59:11,148] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:59:11,148] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:59:11,148] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:59:11,148] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3394 [2023-05-15 22:59:11,148] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3394 [2023-05-15 22:59:11,148] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:59:11,148] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:59:11,148] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-05-15 22:59:11,149] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3394 [2023-05-15 22:59:11,149] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:59:11,149] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3394 [2023-05-15 22:59:11,149] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:59:11,920] [INFO] [logging.py:96:log_dist] [Rank 0] step=3400, skipped=29, lr=[2.8287839563439395e-05, 2.8287839563439395e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:11,926] [INFO] [timer.py:199:stop] epoch=0/micro_step=3400/global_step=3400, RunningAvgSamplesPerSec=203.5762519314513, CurrSamplesPerSec=207.0871348295537, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:13,471] [INFO] [logging.py:96:log_dist] [Rank 0] step=3410, skipped=29, lr=[2.8182025084347836e-05, 2.8182025084347836e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:13,478] [INFO] [timer.py:199:stop] epoch=0/micro_step=3410/global_step=3410, RunningAvgSamplesPerSec=203.58615395171537, CurrSamplesPerSec=206.8413973357662, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:15,030] [INFO] [logging.py:96:log_dist] [Rank 0] step=3420, skipped=29, lr=[2.8076152629415403e-05, 2.8076152629415403e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:15,036] [INFO] [timer.py:199:stop] epoch=0/micro_step=3420/global_step=3420, RunningAvgSamplesPerSec=203.59354115479022, CurrSamplesPerSec=206.73371224373489, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:16,588] [INFO] [logging.py:96:log_dist] [Rank 0] step=3430, skipped=29, lr=[2.797022412761641e-05, 2.797022412761641e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:16,594] [INFO] [timer.py:199:stop] epoch=0/micro_step=3430/global_step=3430, RunningAvgSamplesPerSec=203.60098111239319, CurrSamplesPerSec=205.30530618113735, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:18,141] [INFO] [logging.py:96:log_dist] [Rank 0] step=3440, skipped=29, lr=[2.7864241508946305e-05, 2.7864241508946305e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:18,148] [INFO] [timer.py:199:stop] epoch=0/micro_step=3440/global_step=3440, RunningAvgSamplesPerSec=203.61025189389255, CurrSamplesPerSec=206.37958987858732, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:19,694] [INFO] [logging.py:96:log_dist] [Rank 0] step=3450, skipped=29, lr=[2.7758206704386545e-05, 2.7758206704386545e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:19,701] [INFO] [timer.py:199:stop] epoch=0/micro_step=3450/global_step=3450, RunningAvgSamplesPerSec=203.6195518919642, CurrSamplesPerSec=207.35363647458863, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:21,249] [INFO] [logging.py:96:log_dist] [Rank 0] step=3460, skipped=29, lr=[2.7652121645869412e-05, 2.7652121645869412e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:21,255] [INFO] [timer.py:199:stop] epoch=0/micro_step=3460/global_step=3460, RunningAvgSamplesPerSec=203.62807574327547, CurrSamplesPerSec=206.1291961408865, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:22,800] [INFO] [logging.py:96:log_dist] [Rank 0] step=3470, skipped=29, lr=[2.7545988266242785e-05, 2.7545988266242785e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:22,807] [INFO] [timer.py:199:stop] epoch=0/micro_step=3470/global_step=3470, RunningAvgSamplesPerSec=203.63779625586042, CurrSamplesPerSec=207.12037011317588, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:24,352] [INFO] [logging.py:96:log_dist] [Rank 0] step=3480, skipped=29, lr=[2.7439808499234957e-05, 2.7439808499234957e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:24,359] [INFO] [timer.py:199:stop] epoch=0/micro_step=3480/global_step=3480, RunningAvgSamplesPerSec=203.64721498850875, CurrSamplesPerSec=206.6433179269423, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:25,908] [INFO] [logging.py:96:log_dist] [Rank 0] step=3490, skipped=29, lr=[2.73335842794194e-05, 2.73335842794194e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:25,914] [INFO] [timer.py:199:stop] epoch=0/micro_step=3490/global_step=3490, RunningAvgSamplesPerSec=203.65542087301984, CurrSamplesPerSec=206.95844878763347, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:26,827] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:59:26,827] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:59:26,827] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:59:26,827] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:59:26,827] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:59:26,827] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:59:26,827] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:59:26,827] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:59:26,827] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:59:26,827] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:59:26,827] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:59:26,827] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:59:26,827] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:59:26,827] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:59:26,828] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:59:26,828] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 22:59:27,459] [INFO] [logging.py:96:log_dist] [Rank 0] step=3500, skipped=29, lr=[2.7227317542179477e-05, 2.7227317542179477e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:27,465] [INFO] [timer.py:199:stop] epoch=0/micro_step=3500/global_step=3500, RunningAvgSamplesPerSec=203.66502632203478, CurrSamplesPerSec=206.67768391422314, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:29,015] [INFO] [logging.py:96:log_dist] [Rank 0] step=3510, skipped=29, lr=[2.7121010223673237e-05, 2.7121010223673237e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:29,022] [INFO] [timer.py:199:stop] epoch=0/micro_step=3510/global_step=3510, RunningAvgSamplesPerSec=203.67269580900117, CurrSamplesPerSec=206.40878920787634, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:29,610] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3513 [2023-05-15 22:59:29,610] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3513 [2023-05-15 22:59:29,610] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:59:29,610] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:59:29,610] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3513 [2023-05-15 22:59:29,610] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3513 [2023-05-15 22:59:29,610] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3513 [2023-05-15 22:59:29,610] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:59:29,610] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:59:29,610] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:59:29,610] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3513 [2023-05-15 22:59:29,610] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-05-15 22:59:29,610] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:59:29,610] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3513 [2023-05-15 22:59:29,610] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3513 [2023-05-15 22:59:29,610] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:59:29,610] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 22:59:30,537] [INFO] [logging.py:96:log_dist] [Rank 0] step=3520, skipped=30, lr=[2.7025300540865923e-05, 2.7025300540865923e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:30,544] [INFO] [timer.py:199:stop] epoch=0/micro_step=3520/global_step=3520, RunningAvgSamplesPerSec=203.69300427803483, CurrSamplesPerSec=207.339222101728, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:32,090] [INFO] [logging.py:96:log_dist] [Rank 0] step=3530, skipped=30, lr=[2.6918921454688734e-05, 2.6918921454688734e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:32,097] [INFO] [timer.py:199:stop] epoch=0/micro_step=3530/global_step=3530, RunningAvgSamplesPerSec=203.70180896262545, CurrSamplesPerSec=206.6375913734812, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:32,681] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3533 [2023-05-15 22:59:32,681] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3533 [2023-05-15 22:59:32,682] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:59:32,682] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:59:32,681] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3533 [2023-05-15 22:59:32,682] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3533 [2023-05-15 22:59:32,682] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:59:32,682] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:59:32,682] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3533 [2023-05-15 22:59:32,682] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3533 [2023-05-15 22:59:32,682] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:59:32,682] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-05-15 22:59:32,682] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3533 [2023-05-15 22:59:32,682] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:59:32,682] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3533 [2023-05-15 22:59:32,682] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:59:32,682] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 22:59:33,607] [INFO] [logging.py:96:log_dist] [Rank 0] step=3540, skipped=31, lr=[2.6823150329065118e-05, 2.6823150329065118e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:33,613] [INFO] [timer.py:199:stop] epoch=0/micro_step=3540/global_step=3540, RunningAvgSamplesPerSec=203.72384188706042, CurrSamplesPerSec=207.02101556462844, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:35,175] [INFO] [logging.py:96:log_dist] [Rank 0] step=3550, skipped=31, lr=[2.6716706472109342e-05, 2.6716706472109342e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:35,181] [INFO] [timer.py:199:stop] epoch=0/micro_step=3550/global_step=3550, RunningAvgSamplesPerSec=203.727108490635, CurrSamplesPerSec=197.49489479856564, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:36,794] [INFO] [logging.py:96:log_dist] [Rank 0] step=3560, skipped=31, lr=[2.661023133711566e-05, 2.661023133711566e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:36,801] [INFO] [timer.py:199:stop] epoch=0/micro_step=3560/global_step=3560, RunningAvgSamplesPerSec=203.7113357460525, CurrSamplesPerSec=198.18106218106217, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:38,422] [INFO] [logging.py:96:log_dist] [Rank 0] step=3570, skipped=31, lr=[2.6503726864039064e-05, 2.6503726864039064e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:38,429] [INFO] [timer.py:199:stop] epoch=0/micro_step=3570/global_step=3570, RunningAvgSamplesPerSec=203.69275171403754, CurrSamplesPerSec=196.69664795217767, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:40,061] [INFO] [logging.py:96:log_dist] [Rank 0] step=3580, skipped=31, lr=[2.6397194993369086e-05, 2.6397194993369086e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:40,067] [INFO] [timer.py:199:stop] epoch=0/micro_step=3580/global_step=3580, RunningAvgSamplesPerSec=203.67060886047432, CurrSamplesPerSec=196.7973020923447, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:41,685] [INFO] [logging.py:96:log_dist] [Rank 0] step=3590, skipped=31, lr=[2.6290637666094458e-05, 2.6290637666094458e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:41,692] [INFO] [timer.py:199:stop] epoch=0/micro_step=3590/global_step=3590, RunningAvgSamplesPerSec=203.6534474906326, CurrSamplesPerSec=198.1111489962169, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:43,305] [INFO] [logging.py:96:log_dist] [Rank 0] step=3600, skipped=31, lr=[2.6184056823667684e-05, 2.6184056823667684e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:43,311] [INFO] [timer.py:199:stop] epoch=0/micro_step=3600/global_step=3600, RunningAvgSamplesPerSec=203.63817314894473, CurrSamplesPerSec=198.16116554163787, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:44,932] [INFO] [logging.py:96:log_dist] [Rank 0] step=3610, skipped=31, lr=[2.607745440796976e-05, 2.607745440796976e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:44,938] [INFO] [timer.py:199:stop] epoch=0/micro_step=3610/global_step=3610, RunningAvgSamplesPerSec=203.62055980940633, CurrSamplesPerSec=194.0847092439794, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:46,562] [INFO] [logging.py:96:log_dist] [Rank 0] step=3620, skipped=31, lr=[2.5970832361274707e-05, 2.5970832361274707e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:46,569] [INFO] [timer.py:199:stop] epoch=0/micro_step=3620/global_step=3620, RunningAvgSamplesPerSec=203.60161116397472, CurrSamplesPerSec=198.0240459732806, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:48,191] [INFO] [logging.py:96:log_dist] [Rank 0] step=3630, skipped=31, lr=[2.5864192626214216e-05, 2.5864192626214216e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:48,197] [INFO] [timer.py:199:stop] epoch=0/micro_step=3630/global_step=3630, RunningAvgSamplesPerSec=203.5835601713897, CurrSamplesPerSec=197.26703371321574, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:48,993] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:59:48,993] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:59:48,993] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:59:48,993] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:59:48,993] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:59:48,993] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:59:48,994] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:59:48,993] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:59:48,994] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:59:48,994] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:59:48,994] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:59:48,994] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:59:48,994] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:59:48,994] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:59:48,997] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 22:59:48,998] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 22:59:49,824] [INFO] [logging.py:96:log_dist] [Rank 0] step=3640, skipped=31, lr=[2.57575371457423e-05, 2.57575371457423e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:49,830] [INFO] [timer.py:199:stop] epoch=0/micro_step=3640/global_step=3640, RunningAvgSamplesPerSec=203.56400830358592, CurrSamplesPerSec=196.29993945049128, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:51,449] [INFO] [logging.py:96:log_dist] [Rank 0] step=3650, skipped=31, lr=[2.5650867863099785e-05, 2.5650867863099785e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:51,456] [INFO] [timer.py:199:stop] epoch=0/micro_step=3650/global_step=3650, RunningAvgSamplesPerSec=203.54707261960843, CurrSamplesPerSec=198.0924358461577, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:53,071] [INFO] [logging.py:96:log_dist] [Rank 0] step=3660, skipped=31, lr=[2.5544186721779028e-05, 2.5544186721779028e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:53,077] [INFO] [timer.py:199:stop] epoch=0/micro_step=3660/global_step=3660, RunningAvgSamplesPerSec=203.5319666305217, CurrSamplesPerSec=198.3225019024329, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:54,696] [INFO] [logging.py:96:log_dist] [Rank 0] step=3670, skipped=31, lr=[2.543749566548842e-05, 2.543749566548842e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:54,702] [INFO] [timer.py:199:stop] epoch=0/micro_step=3670/global_step=3670, RunningAvgSamplesPerSec=203.51554090745634, CurrSamplesPerSec=195.43130885280084, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 22:59:56,334] [INFO] [logging.py:96:log_dist] [Rank 0] step=3680, skipped=31, lr=[2.5330796638116998e-05, 2.5330796638116998e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 22:59:56,341] [INFO] [timer.py:199:stop] epoch=0/micro_step=3680/global_step=3680, RunningAvgSamplesPerSec=203.49430227434016, CurrSamplesPerSec=196.6594694705255, MemAllocated=4.34GB, MaxMemAllocated=12.81GB Epoch 1/2 with loss 0.5955116769541865 ***** Evaluating reward, Epoch 1/2 ***** chosen_last_scores (higher is better) : 2.779517889022827, acc (higher is better) : 0.6574999690055847 Beginning of Epoch 2/2, Total Micro Batches 3680 [2023-05-15 23:00:01,692] [INFO] [logging.py:96:log_dist] [Rank 0] step=3690, skipped=31, lr=[2.5224091583699054e-05, 2.5224091583699054e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:01,699] [INFO] [timer.py:199:stop] epoch=1/micro_step=10/global_step=3690, RunningAvgSamplesPerSec=203.49259864323267, CurrSamplesPerSec=207.02037693708922, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:03,243] [INFO] [logging.py:96:log_dist] [Rank 0] step=3700, skipped=31, lr=[2.5117382446378657e-05, 2.5117382446378657e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:03,249] [INFO] [timer.py:199:stop] epoch=1/micro_step=20/global_step=3700, RunningAvgSamplesPerSec=203.50233156482696, CurrSamplesPerSec=207.549871110307, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:04,795] [INFO] [logging.py:96:log_dist] [Rank 0] step=3710, skipped=31, lr=[2.5010671170374296e-05, 2.5010671170374296e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:04,801] [INFO] [timer.py:199:stop] epoch=1/micro_step=30/global_step=3710, RunningAvgSamplesPerSec=203.5114889579799, CurrSamplesPerSec=205.34582584934932, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:06,347] [INFO] [logging.py:96:log_dist] [Rank 0] step=3720, skipped=31, lr=[2.49039596999434e-05, 2.49039596999434e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:06,353] [INFO] [timer.py:199:stop] epoch=1/micro_step=40/global_step=3720, RunningAvgSamplesPerSec=203.52060164406424, CurrSamplesPerSec=206.070647409334, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:07,904] [INFO] [logging.py:96:log_dist] [Rank 0] step=3730, skipped=31, lr=[2.4797249979346986e-05, 2.4797249979346986e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:07,911] [INFO] [timer.py:199:stop] epoch=1/micro_step=50/global_step=3730, RunningAvgSamplesPerSec=203.52773514728125, CurrSamplesPerSec=201.14154178156096, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:08,668] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:00:08,668] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:00:08,669] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:00:08,669] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:00:08,669] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:00:08,669] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:00:08,669] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:00:08,669] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:00:08,669] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:00:08,669] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:00:08,669] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:00:08,669] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:00:08,670] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:00:08,670] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:00:08,670] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:00:08,670] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:00:09,459] [INFO] [logging.py:96:log_dist] [Rank 0] step=3740, skipped=31, lr=[2.4690543952814126e-05, 2.4690543952814126e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:09,466] [INFO] [timer.py:199:stop] epoch=1/micro_step=60/global_step=3740, RunningAvgSamplesPerSec=203.53585816930448, CurrSamplesPerSec=205.72383613547777, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:09,593] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3740 [2023-05-15 23:00:09,593] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:00:09,593] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3740 [2023-05-15 23:00:09,593] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3740 [2023-05-15 23:00:09,593] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3740 [2023-05-15 23:00:09,593] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:00:09,593] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:00:09,593] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:00:09,593] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3740 [2023-05-15 23:00:09,593] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-05-15 23:00:09,593] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:00:09,593] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3740 [2023-05-15 23:00:09,594] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:00:09,594] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3740 [2023-05-15 23:00:09,594] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:00:09,594] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3740 [2023-05-15 23:00:09,594] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:00:10,984] [INFO] [logging.py:96:log_dist] [Rank 0] step=3750, skipped=32, lr=[2.4594513294210588e-05, 2.4594513294210588e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:10,990] [INFO] [timer.py:199:stop] epoch=1/micro_step=70/global_step=3750, RunningAvgSamplesPerSec=203.55434778058674, CurrSamplesPerSec=207.13539340495177, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:12,536] [INFO] [logging.py:96:log_dist] [Rank 0] step=3760, skipped=32, lr=[2.44878196424801e-05, 2.44878196424801e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:12,542] [INFO] [timer.py:199:stop] epoch=1/micro_step=80/global_step=3760, RunningAvgSamplesPerSec=203.56318146320768, CurrSamplesPerSec=207.2322980292372, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:14,087] [INFO] [logging.py:96:log_dist] [Rank 0] step=3770, skipped=32, lr=[2.4381135322570097e-05, 2.4381135322570097e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:14,094] [INFO] [timer.py:199:stop] epoch=1/micro_step=90/global_step=3770, RunningAvgSamplesPerSec=203.57214773777474, CurrSamplesPerSec=206.70919611216266, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:15,640] [INFO] [logging.py:96:log_dist] [Rank 0] step=3780, skipped=32, lr=[2.42744622782469e-05, 2.42744622782469e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:15,646] [INFO] [timer.py:199:stop] epoch=1/micro_step=100/global_step=3780, RunningAvgSamplesPerSec=203.58081493587738, CurrSamplesPerSec=207.34915178826608, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:17,192] [INFO] [logging.py:96:log_dist] [Rank 0] step=3790, skipped=32, lr=[2.416780245307139e-05, 2.416780245307139e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:17,198] [INFO] [timer.py:199:stop] epoch=1/micro_step=110/global_step=3790, RunningAvgSamplesPerSec=203.58967344600916, CurrSamplesPerSec=206.50501500886992, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:18,744] [INFO] [logging.py:96:log_dist] [Rank 0] step=3800, skipped=32, lr=[2.4061157790363568e-05, 2.4061157790363568e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:18,750] [INFO] [timer.py:199:stop] epoch=1/micro_step=120/global_step=3800, RunningAvgSamplesPerSec=203.59835661528984, CurrSamplesPerSec=206.4589551664836, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:20,306] [INFO] [logging.py:96:log_dist] [Rank 0] step=3810, skipped=32, lr=[2.3954530233167202e-05, 2.3954530233167202e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:20,313] [INFO] [timer.py:199:stop] epoch=1/micro_step=130/global_step=3810, RunningAvgSamplesPerSec=203.60350445320674, CurrSamplesPerSec=206.45577938299104, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:21,858] [INFO] [logging.py:96:log_dist] [Rank 0] step=3820, skipped=32, lr=[2.38479217242144e-05, 2.38479217242144e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:21,865] [INFO] [timer.py:199:stop] epoch=1/micro_step=140/global_step=3820, RunningAvgSamplesPerSec=203.61222770302768, CurrSamplesPerSec=206.59179036935333, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:23,409] [INFO] [logging.py:96:log_dist] [Rank 0] step=3830, skipped=32, lr=[2.3741334205890222e-05, 2.3741334205890222e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:23,416] [INFO] [timer.py:199:stop] epoch=1/micro_step=150/global_step=3830, RunningAvgSamplesPerSec=203.62111407752693, CurrSamplesPerSec=206.82163604816367, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:24,963] [INFO] [logging.py:96:log_dist] [Rank 0] step=3840, skipped=32, lr=[2.3634769620197254e-05, 2.3634769620197254e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:24,970] [INFO] [timer.py:199:stop] epoch=1/micro_step=160/global_step=3840, RunningAvgSamplesPerSec=203.62897771070746, CurrSamplesPerSec=206.1000758569222, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:25,262] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:00:25,262] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:00:25,262] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:00:25,262] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:00:25,262] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:00:25,262] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:00:25,262] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:00:25,262] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:00:25,262] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:00:25,262] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:00:25,262] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:00:25,262] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:00:25,262] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:00:25,263] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:00:25,262] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:00:25,263] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:00:25,874] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3845 [2023-05-15 23:00:25,874] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3845 [2023-05-15 23:00:25,874] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:00:25,874] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:00:25,874] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3845 [2023-05-15 23:00:25,874] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3845 [2023-05-15 23:00:25,874] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:00:25,874] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:00:25,874] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3845 [2023-05-15 23:00:25,875] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:00:25,874] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3845 [2023-05-15 23:00:25,875] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-05-15 23:00:25,875] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:00:25,875] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3845 [2023-05-15 23:00:25,875] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:00:25,875] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3845 [2023-05-15 23:00:25,875] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:00:26,491] [INFO] [logging.py:96:log_dist] [Rank 0] step=3850, skipped=33, lr=[2.3538882705202288e-05, 2.3538882705202288e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:26,497] [INFO] [timer.py:199:stop] epoch=1/micro_step=170/global_step=3850, RunningAvgSamplesPerSec=203.64583294364652, CurrSamplesPerSec=207.03634380770066, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:28,041] [INFO] [logging.py:96:log_dist] [Rank 0] step=3860, skipped=33, lr=[2.3432367040200857e-05, 2.3432367040200857e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:28,048] [INFO] [timer.py:199:stop] epoch=1/micro_step=180/global_step=3860, RunningAvgSamplesPerSec=203.65471345650315, CurrSamplesPerSec=206.15452481668189, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:29,613] [INFO] [logging.py:96:log_dist] [Rank 0] step=3870, skipped=33, lr=[2.3325879937148676e-05, 2.3325879937148676e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:29,619] [INFO] [timer.py:199:stop] epoch=1/micro_step=190/global_step=3870, RunningAvgSamplesPerSec=203.65667991042955, CurrSamplesPerSec=207.11461708917406, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:31,166] [INFO] [logging.py:96:log_dist] [Rank 0] step=3880, skipped=33, lr=[2.3219423336218818e-05, 2.3219423336218818e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:31,173] [INFO] [timer.py:199:stop] epoch=1/micro_step=200/global_step=3880, RunningAvgSamplesPerSec=203.66451000598, CurrSamplesPerSec=206.44402983352842, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:32,718] [INFO] [logging.py:96:log_dist] [Rank 0] step=3890, skipped=33, lr=[2.3112999177028606e-05, 2.3112999177028606e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:32,725] [INFO] [timer.py:199:stop] epoch=1/micro_step=210/global_step=3890, RunningAvgSamplesPerSec=203.67275448463144, CurrSamplesPerSec=206.76556005730748, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:34,275] [INFO] [logging.py:96:log_dist] [Rank 0] step=3900, skipped=33, lr=[2.3006609398604268e-05, 2.3006609398604268e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:34,281] [INFO] [timer.py:199:stop] epoch=1/micro_step=220/global_step=3900, RunningAvgSamplesPerSec=203.67950289492546, CurrSamplesPerSec=207.47864504350756, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:35,828] [INFO] [logging.py:96:log_dist] [Rank 0] step=3910, skipped=33, lr=[2.2900255939345615e-05, 2.2900255939345615e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:35,834] [INFO] [timer.py:199:stop] epoch=1/micro_step=230/global_step=3910, RunningAvgSamplesPerSec=203.6874469378029, CurrSamplesPerSec=206.97312943055982, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:37,381] [INFO] [logging.py:96:log_dist] [Rank 0] step=3920, skipped=33, lr=[2.279394073699077e-05, 2.279394073699077e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:37,388] [INFO] [timer.py:199:stop] epoch=1/micro_step=240/global_step=3920, RunningAvgSamplesPerSec=203.69520980967232, CurrSamplesPerSec=206.64268162755036, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:38,934] [INFO] [logging.py:96:log_dist] [Rank 0] step=3930, skipped=33, lr=[2.2687665728580767e-05, 2.2687665728580767e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:38,940] [INFO] [timer.py:199:stop] epoch=1/micro_step=250/global_step=3930, RunningAvgSamplesPerSec=203.70328005420689, CurrSamplesPerSec=206.25241530860782, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:40,490] [INFO] [logging.py:96:log_dist] [Rank 0] step=3940, skipped=33, lr=[2.2581432850424367e-05, 2.2581432850424367e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:40,496] [INFO] [timer.py:199:stop] epoch=1/micro_step=260/global_step=3940, RunningAvgSamplesPerSec=203.70995591094996, CurrSamplesPerSec=205.94448519144782, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:41,564] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:00:41,564] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:00:41,564] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:00:41,564] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:00:41,564] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:00:41,564] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:00:41,564] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:00:41,564] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:00:41,564] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:00:41,564] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:00:41,564] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:00:41,564] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:00:41,564] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:00:41,565] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:00:41,565] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:00:41,565] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:00:42,043] [INFO] [logging.py:96:log_dist] [Rank 0] step=3950, skipped=33, lr=[2.2475244038062694e-05, 2.2475244038062694e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:42,050] [INFO] [timer.py:199:stop] epoch=1/micro_step=270/global_step=3950, RunningAvgSamplesPerSec=203.717574866011, CurrSamplesPerSec=204.53783602560196, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:42,953] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3955 [2023-05-15 23:00:42,953] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3955 [2023-05-15 23:00:42,953] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3955 [2023-05-15 23:00:42,953] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:00:42,953] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:00:42,953] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:00:42,953] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3955 [2023-05-15 23:00:42,953] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3955 [2023-05-15 23:00:42,953] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:00:42,953] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:00:42,953] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-05-15 23:00:42,954] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3955 [2023-05-15 23:00:42,954] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3955 [2023-05-15 23:00:42,954] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:00:42,954] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3955 [2023-05-15 23:00:42,954] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:00:42,954] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:00:45,807] [INFO] [logging.py:96:log_dist] [Rank 0] step=3960, skipped=34, lr=[2.2379713382270012e-05, 2.2379713382270012e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:45,813] [INFO] [timer.py:199:stop] epoch=1/micro_step=280/global_step=3960, RunningAvgSamplesPerSec=203.00357664521445, CurrSamplesPerSec=13.390519327888708, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:49,862] [INFO] [logging.py:96:log_dist] [Rank 0] step=3970, skipped=34, lr=[2.2273613624430255e-05, 2.2273613624430255e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:49,868] [INFO] [timer.py:199:stop] epoch=1/micro_step=290/global_step=3970, RunningAvgSamplesPerSec=202.20388648077878, CurrSamplesPerSec=196.68454178701379, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:51,496] [INFO] [logging.py:96:log_dist] [Rank 0] step=3980, skipped=34, lr=[2.2167563540788053e-05, 2.2167563540788053e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:51,502] [INFO] [timer.py:199:stop] epoch=1/micro_step=300/global_step=3980, RunningAvgSamplesPerSec=202.18910564162957, CurrSamplesPerSec=199.37125838340475, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:53,095] [INFO] [logging.py:96:log_dist] [Rank 0] step=3990, skipped=34, lr=[2.2061565063554064e-05, 2.2061565063554064e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:53,101] [INFO] [timer.py:199:stop] epoch=1/micro_step=310/global_step=3990, RunningAvgSamplesPerSec=202.18541722239752, CurrSamplesPerSec=202.26246040429066, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:54,723] [INFO] [logging.py:96:log_dist] [Rank 0] step=4000, skipped=34, lr=[2.195562012399867e-05, 2.195562012399867e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:54,729] [INFO] [timer.py:199:stop] epoch=1/micro_step=320/global_step=4000, RunningAvgSamplesPerSec=202.17268486314182, CurrSamplesPerSec=200.4715807082792, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:56,334] [INFO] [logging.py:96:log_dist] [Rank 0] step=4010, skipped=34, lr=[2.184973065241682e-05, 2.184973065241682e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:56,340] [INFO] [timer.py:199:stop] epoch=1/micro_step=330/global_step=4010, RunningAvgSamplesPerSec=202.16544938644043, CurrSamplesPerSec=200.12812472508384, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:57,929] [INFO] [logging.py:96:log_dist] [Rank 0] step=4020, skipped=34, lr=[2.174389857809288e-05, 2.174389857809288e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:57,935] [INFO] [timer.py:199:stop] epoch=1/micro_step=340/global_step=4020, RunningAvgSamplesPerSec=202.16312656492767, CurrSamplesPerSec=201.26248989696782, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:00:59,533] [INFO] [logging.py:96:log_dist] [Rank 0] step=4030, skipped=34, lr=[2.1638125829265385e-05, 2.1638125829265385e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:00:59,540] [INFO] [timer.py:199:stop] epoch=1/micro_step=350/global_step=4030, RunningAvgSamplesPerSec=202.15786481389173, CurrSamplesPerSec=201.19551312320024, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:01,134] [INFO] [logging.py:96:log_dist] [Rank 0] step=4040, skipped=34, lr=[2.1532414333092026e-05, 2.1532414333092026e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:01,140] [INFO] [timer.py:199:stop] epoch=1/micro_step=360/global_step=4040, RunningAvgSamplesPerSec=202.15399779978512, CurrSamplesPerSec=200.20872657516756, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:02,730] [INFO] [logging.py:96:log_dist] [Rank 0] step=4050, skipped=34, lr=[2.1426766015614466e-05, 2.1426766015614466e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:02,736] [INFO] [timer.py:199:stop] epoch=1/micro_step=370/global_step=4050, RunningAvgSamplesPerSec=202.15159212352353, CurrSamplesPerSec=201.92165511763167, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:03,834] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:01:03,834] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:01:03,834] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:01:03,834] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:01:03,834] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:01:03,835] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:01:03,835] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:01:03,835] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:01:03,835] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:01:03,835] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:01:03,835] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:01:03,835] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:01:03,836] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:01:03,836] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:01:03,839] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:01:03,839] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:01:04,325] [INFO] [logging.py:96:log_dist] [Rank 0] step=4060, skipped=34, lr=[2.1321182801723257e-05, 2.1321182801723257e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:04,331] [INFO] [timer.py:199:stop] epoch=1/micro_step=380/global_step=4060, RunningAvgSamplesPerSec=202.1492751140099, CurrSamplesPerSec=201.45340691965592, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:05,923] [INFO] [logging.py:96:log_dist] [Rank 0] step=4070, skipped=34, lr=[2.1215666615122778e-05, 2.1215666615122778e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:05,930] [INFO] [timer.py:199:stop] epoch=1/micro_step=390/global_step=4070, RunningAvgSamplesPerSec=202.1460318613045, CurrSamplesPerSec=201.16777004554905, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:07,519] [INFO] [logging.py:96:log_dist] [Rank 0] step=4080, skipped=34, lr=[2.111021937829621e-05, 2.111021937829621e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:07,525] [INFO] [timer.py:199:stop] epoch=1/micro_step=400/global_step=4080, RunningAvgSamplesPerSec=202.1437865612696, CurrSamplesPerSec=200.85167820687982, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:09,119] [INFO] [logging.py:96:log_dist] [Rank 0] step=4090, skipped=34, lr=[2.1004843012470436e-05, 2.1004843012470436e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:09,126] [INFO] [timer.py:199:stop] epoch=1/micro_step=410/global_step=4090, RunningAvgSamplesPerSec=202.13999682017908, CurrSamplesPerSec=200.79548793441347, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:09,258] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4090 [2023-05-15 23:01:09,259] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4090 [2023-05-15 23:01:09,259] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:01:09,259] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4090 [2023-05-15 23:01:09,259] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:01:09,259] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:01:09,259] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4090 [2023-05-15 23:01:09,259] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4090 [2023-05-15 23:01:09,259] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:01:09,259] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:01:09,259] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-05-15 23:01:09,259] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4090 [2023-05-15 23:01:09,259] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:01:09,259] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4090 [2023-05-15 23:01:09,259] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:01:09,260] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4090 [2023-05-15 23:01:09,261] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:01:10,689] [INFO] [logging.py:96:log_dist] [Rank 0] step=4100, skipped=35, lr=[2.0910066464787004e-05, 2.0910066464787004e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:10,696] [INFO] [timer.py:199:stop] epoch=1/micro_step=420/global_step=4100, RunningAvgSamplesPerSec=202.14568750965844, CurrSamplesPerSec=201.6634708280182, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:12,285] [INFO] [logging.py:96:log_dist] [Rank 0] step=4110, skipped=35, lr=[2.08048300421901e-05, 2.08048300421901e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:12,291] [INFO] [timer.py:199:stop] epoch=1/micro_step=430/global_step=4110, RunningAvgSamplesPerSec=202.14357284124628, CurrSamplesPerSec=199.25641745089393, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:13,888] [INFO] [logging.py:96:log_dist] [Rank 0] step=4120, skipped=35, lr=[2.0699670054724724e-05, 2.0699670054724724e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:13,895] [INFO] [timer.py:199:stop] epoch=1/micro_step=440/global_step=4120, RunningAvgSamplesPerSec=202.1387137097595, CurrSamplesPerSec=200.00495922941204, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:15,489] [INFO] [logging.py:96:log_dist] [Rank 0] step=4130, skipped=35, lr=[2.059458841838417e-05, 2.059458841838417e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:15,495] [INFO] [timer.py:199:stop] epoch=1/micro_step=450/global_step=4130, RunningAvgSamplesPerSec=202.13500790072962, CurrSamplesPerSec=202.97546317510296, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:17,040] [INFO] [logging.py:96:log_dist] [Rank 0] step=4140, skipped=35, lr=[2.0489587047734195e-05, 2.0489587047734195e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:17,047] [INFO] [timer.py:199:stop] epoch=1/micro_step=460/global_step=4140, RunningAvgSamplesPerSec=202.14644784666405, CurrSamplesPerSec=207.20862318639212, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:18,609] [INFO] [logging.py:96:log_dist] [Rank 0] step=4150, skipped=35, lr=[2.0384667855878104e-05, 2.0384667855878104e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:18,615] [INFO] [timer.py:199:stop] epoch=1/micro_step=470/global_step=4150, RunningAvgSamplesPerSec=202.15254279485697, CurrSamplesPerSec=206.20520084683528, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:20,164] [INFO] [logging.py:96:log_dist] [Rank 0] step=4160, skipped=35, lr=[2.0279832754421945e-05, 2.0279832754421945e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:20,170] [INFO] [timer.py:199:stop] epoch=1/micro_step=480/global_step=4160, RunningAvgSamplesPerSec=202.1629676188559, CurrSamplesPerSec=204.0915218928433, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:21,715] [INFO] [logging.py:96:log_dist] [Rank 0] step=4170, skipped=35, lr=[2.017508365343964e-05, 2.017508365343964e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:21,722] [INFO] [timer.py:199:stop] epoch=1/micro_step=490/global_step=4170, RunningAvgSamplesPerSec=202.17403417431746, CurrSamplesPerSec=206.38308066790398, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:23,277] [INFO] [logging.py:96:log_dist] [Rank 0] step=4180, skipped=35, lr=[2.007042246143823e-05, 2.007042246143823e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:23,283] [INFO] [timer.py:199:stop] epoch=1/micro_step=500/global_step=4180, RunningAvgSamplesPerSec=202.18223328446962, CurrSamplesPerSec=207.00153919708816, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:24,830] [INFO] [logging.py:96:log_dist] [Rank 0] step=4190, skipped=35, lr=[1.9965851085323022e-05, 1.9965851085323022e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:24,837] [INFO] [timer.py:199:stop] epoch=1/micro_step=510/global_step=4190, RunningAvgSamplesPerSec=202.1926506166996, CurrSamplesPerSec=205.70870161617864, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:25,129] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:01:25,129] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:01:25,129] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:01:25,129] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:01:25,129] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:01:25,129] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:01:25,129] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:01:25,129] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:01:25,129] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:01:25,129] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:01:25,129] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:01:25,129] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:01:25,129] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:01:25,130] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:01:25,130] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:01:25,130] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:01:25,732] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4195 [2023-05-15 23:01:25,732] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4195 [2023-05-15 23:01:25,732] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:01:25,732] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:01:25,732] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4195 [2023-05-15 23:01:25,732] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4195 [2023-05-15 23:01:25,732] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:01:25,732] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:01:25,732] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4195 [2023-05-15 23:01:25,732] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-05-15 23:01:25,732] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:01:25,732] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4195 [2023-05-15 23:01:25,732] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4195 [2023-05-15 23:01:25,732] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:01:25,732] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4195 [2023-05-15 23:01:25,732] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:01:25,732] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:01:26,347] [INFO] [logging.py:96:log_dist] [Rank 0] step=4200, skipped=36, lr=[1.987181521414118e-05, 1.987181521414118e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:26,354] [INFO] [timer.py:199:stop] epoch=1/micro_step=520/global_step=4200, RunningAvgSamplesPerSec=202.2141430140724, CurrSamplesPerSec=207.12868060099507, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:27,900] [INFO] [logging.py:96:log_dist] [Rank 0] step=4210, skipped=36, lr=[1.9767419735845157e-05, 1.9767419735845157e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:27,907] [INFO] [timer.py:199:stop] epoch=1/micro_step=530/global_step=4210, RunningAvgSamplesPerSec=202.22481991286907, CurrSamplesPerSec=207.22685873534203, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:29,456] [INFO] [logging.py:96:log_dist] [Rank 0] step=4220, skipped=36, lr=[1.9663119594082512e-05, 1.9663119594082512e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:29,463] [INFO] [timer.py:199:stop] epoch=1/micro_step=540/global_step=4220, RunningAvgSamplesPerSec=202.23433389683132, CurrSamplesPerSec=205.83678470813223, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:31,012] [INFO] [logging.py:96:log_dist] [Rank 0] step=4230, skipped=36, lr=[1.955891668918034e-05, 1.955891668918034e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:31,018] [INFO] [timer.py:199:stop] epoch=1/micro_step=550/global_step=4230, RunningAvgSamplesPerSec=202.24406800507313, CurrSamplesPerSec=206.44021945603572, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:32,570] [INFO] [logging.py:96:log_dist] [Rank 0] step=4240, skipped=36, lr=[1.945481291969408e-05, 1.945481291969408e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:32,576] [INFO] [timer.py:199:stop] epoch=1/micro_step=560/global_step=4240, RunningAvgSamplesPerSec=202.25289786406415, CurrSamplesPerSec=206.1424929195874, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:34,125] [INFO] [logging.py:96:log_dist] [Rank 0] step=4250, skipped=36, lr=[1.9350810182372934e-05, 1.9350810182372934e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:34,131] [INFO] [timer.py:199:stop] epoch=1/micro_step=570/global_step=4250, RunningAvgSamplesPerSec=202.26274906444075, CurrSamplesPerSec=206.95621495932352, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:35,682] [INFO] [logging.py:96:log_dist] [Rank 0] step=4260, skipped=36, lr=[1.9246910372125342e-05, 1.9246910372125342e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:35,688] [INFO] [timer.py:199:stop] epoch=1/micro_step=580/global_step=4260, RunningAvgSamplesPerSec=202.27181605834426, CurrSamplesPerSec=206.50660364585258, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:37,238] [INFO] [logging.py:96:log_dist] [Rank 0] step=4270, skipped=36, lr=[1.914311538198441e-05, 1.914311538198441e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:37,244] [INFO] [timer.py:199:stop] epoch=1/micro_step=590/global_step=4270, RunningAvgSamplesPerSec=202.28114244535521, CurrSamplesPerSec=206.3875236614646, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:38,795] [INFO] [logging.py:96:log_dist] [Rank 0] step=4280, skipped=36, lr=[1.9039427103073467e-05, 1.9039427103073467e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:38,801] [INFO] [timer.py:199:stop] epoch=1/micro_step=600/global_step=4280, RunningAvgSamplesPerSec=202.29023836272472, CurrSamplesPerSec=206.02288363240058, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:40,385] [INFO] [logging.py:96:log_dist] [Rank 0] step=4290, skipped=36, lr=[1.8935847424571556e-05, 1.8935847424571556e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:40,392] [INFO] [timer.py:199:stop] epoch=1/micro_step=610/global_step=4290, RunningAvgSamplesPerSec=202.28924327223737, CurrSamplesPerSec=201.46005928927914, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:41,492] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:01:41,493] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:01:41,493] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:01:41,493] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:01:41,493] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:01:41,493] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:01:41,493] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:01:41,493] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:01:41,493] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:01:41,493] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:01:41,493] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:01:41,494] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:01:41,494] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:01:41,494] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:01:41,498] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:01:41,498] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:01:41,984] [INFO] [logging.py:96:log_dist] [Rank 0] step=4300, skipped=36, lr=[1.8832378233679076e-05, 1.8832378233679076e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:41,990] [INFO] [timer.py:199:stop] epoch=1/micro_step=620/global_step=4300, RunningAvgSamplesPerSec=202.28595190590275, CurrSamplesPerSec=201.31169493457483, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:43,578] [INFO] [logging.py:96:log_dist] [Rank 0] step=4310, skipped=36, lr=[1.8729021415583343e-05, 1.8729021415583343e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:43,585] [INFO] [timer.py:199:stop] epoch=1/micro_step=630/global_step=4310, RunningAvgSamplesPerSec=202.28374302990153, CurrSamplesPerSec=201.12647116119462, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:45,181] [INFO] [logging.py:96:log_dist] [Rank 0] step=4320, skipped=36, lr=[1.8625778853424265e-05, 1.8625778853424265e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:45,187] [INFO] [timer.py:199:stop] epoch=1/micro_step=640/global_step=4320, RunningAvgSamplesPerSec=202.2793567678108, CurrSamplesPerSec=199.85188494102022, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:46,788] [INFO] [logging.py:96:log_dist] [Rank 0] step=4330, skipped=36, lr=[1.8522652428260034e-05, 1.8522652428260034e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:46,795] [INFO] [timer.py:199:stop] epoch=1/micro_step=650/global_step=4330, RunningAvgSamplesPerSec=202.27346831594915, CurrSamplesPerSec=200.24546672515092, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:47,240] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4332 [2023-05-15 23:01:47,240] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4332 [2023-05-15 23:01:47,240] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4332 [2023-05-15 23:01:47,240] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:01:47,240] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4332 [2023-05-15 23:01:47,240] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:01:47,240] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:01:47,240] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:01:47,240] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4332 [2023-05-15 23:01:47,240] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4332 [2023-05-15 23:01:47,240] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4332 [2023-05-15 23:01:47,240] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-05-15 23:01:47,240] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:01:47,240] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:01:47,240] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:01:47,240] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4332 [2023-05-15 23:01:47,240] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:01:48,356] [INFO] [logging.py:96:log_dist] [Rank 0] step=4340, skipped=37, lr=[1.8429939495732837e-05, 1.8429939495732837e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:48,363] [INFO] [timer.py:199:stop] epoch=1/micro_step=660/global_step=4340, RunningAvgSamplesPerSec=202.2792594659985, CurrSamplesPerSec=198.50024624385685, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:49,967] [INFO] [logging.py:96:log_dist] [Rank 0] step=4350, skipped=37, lr=[1.832703890556813e-05, 1.832703890556813e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:49,973] [INFO] [timer.py:199:stop] epoch=1/micro_step=670/global_step=4350, RunningAvgSamplesPerSec=202.27256556894162, CurrSamplesPerSec=199.30849384556672, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:51,570] [INFO] [logging.py:96:log_dist] [Rank 0] step=4360, skipped=37, lr=[1.8224259895378527e-05, 1.8224259895378527e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:51,576] [INFO] [timer.py:199:stop] epoch=1/micro_step=680/global_step=4360, RunningAvgSamplesPerSec=202.26815434281588, CurrSamplesPerSec=200.57343873754053, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:53,177] [INFO] [logging.py:96:log_dist] [Rank 0] step=4370, skipped=37, lr=[1.8121604337776397e-05, 1.8121604337776397e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:53,183] [INFO] [timer.py:199:stop] epoch=1/micro_step=690/global_step=4370, RunningAvgSamplesPerSec=202.2623996947172, CurrSamplesPerSec=200.92925439942513, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:54,772] [INFO] [logging.py:96:log_dist] [Rank 0] step=4380, skipped=37, lr=[1.8019074103124815e-05, 1.8019074103124815e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:54,779] [INFO] [timer.py:199:stop] epoch=1/micro_step=700/global_step=4380, RunningAvgSamplesPerSec=202.25997751432166, CurrSamplesPerSec=200.4111140311059, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:56,385] [INFO] [logging.py:96:log_dist] [Rank 0] step=4390, skipped=37, lr=[1.7916671059503527e-05, 1.7916671059503527e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:56,392] [INFO] [timer.py:199:stop] epoch=1/micro_step=710/global_step=4390, RunningAvgSamplesPerSec=202.25258594469992, CurrSamplesPerSec=200.25024692279, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:57,981] [INFO] [logging.py:96:log_dist] [Rank 0] step=4400, skipped=37, lr=[1.7814397072674876e-05, 1.7814397072674876e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:57,987] [INFO] [timer.py:199:stop] epoch=1/micro_step=720/global_step=4400, RunningAvgSamplesPerSec=202.25035864850847, CurrSamplesPerSec=201.15902653089975, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:01:59,558] [INFO] [logging.py:96:log_dist] [Rank 0] step=4410, skipped=37, lr=[1.77122540060498e-05, 1.77122540060498e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:01:59,564] [INFO] [timer.py:199:stop] epoch=1/micro_step=730/global_step=4410, RunningAvgSamplesPerSec=202.25340077340616, CurrSamplesPerSec=205.5640901544437, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:01,113] [INFO] [logging.py:96:log_dist] [Rank 0] step=4420, skipped=37, lr=[1.7610243720653918e-05, 1.7610243720653918e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:01,120] [INFO] [timer.py:199:stop] epoch=1/micro_step=740/global_step=4420, RunningAvgSamplesPerSec=202.26286027925343, CurrSamplesPerSec=206.01339677666923, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:02,669] [INFO] [logging.py:96:log_dist] [Rank 0] step=4430, skipped=37, lr=[1.7508368075093583e-05, 1.7508368075093583e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:02,675] [INFO] [timer.py:199:stop] epoch=1/micro_step=750/global_step=4430, RunningAvgSamplesPerSec=202.27224523806967, CurrSamplesPerSec=207.20510441480008, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:03,278] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:02:03,278] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:02:03,278] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:02:03,278] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:02:03,278] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:02:03,278] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:02:03,278] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:02:03,278] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:02:03,278] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:02:03,278] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:02:03,279] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:02:03,279] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:02:03,279] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:02:03,279] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:02:03,279] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:02:03,279] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:02:03,882] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4437 [2023-05-15 23:02:03,882] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4437 [2023-05-15 23:02:03,882] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:02:03,882] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:02:03,882] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4437 [2023-05-15 23:02:03,882] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4437 [2023-05-15 23:02:03,882] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4437 [2023-05-15 23:02:03,882] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:02:03,882] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:02:03,882] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4437 [2023-05-15 23:02:03,882] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:02:03,882] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-05-15 23:02:03,882] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4437 [2023-05-15 23:02:03,882] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:02:03,882] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:02:03,882] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4437 [2023-05-15 23:02:03,882] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:02:04,188] [INFO] [logging.py:96:log_dist] [Rank 0] step=4440, skipped=38, lr=[1.741679664531059e-05, 1.741679664531059e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:04,194] [INFO] [timer.py:199:stop] epoch=1/micro_step=760/global_step=4440, RunningAvgSamplesPerSec=202.29196696248212, CurrSamplesPerSec=207.70596928771835, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:05,742] [INFO] [logging.py:96:log_dist] [Rank 0] step=4450, skipped=38, lr=[1.7315181927085277e-05, 1.7315181927085277e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:05,748] [INFO] [timer.py:199:stop] epoch=1/micro_step=770/global_step=4450, RunningAvgSamplesPerSec=202.30158437217804, CurrSamplesPerSec=206.59433433692286, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:07,295] [INFO] [logging.py:96:log_dist] [Rank 0] step=4460, skipped=38, lr=[1.7213707224660558e-05, 1.7213707224660558e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:07,302] [INFO] [timer.py:199:stop] epoch=1/micro_step=780/global_step=4460, RunningAvgSamplesPerSec=202.311268376378, CurrSamplesPerSec=206.8665825129891, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:08,855] [INFO] [logging.py:96:log_dist] [Rank 0] step=4470, skipped=38, lr=[1.7112374386884583e-05, 1.7112374386884583e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:08,862] [INFO] [timer.py:199:stop] epoch=1/micro_step=790/global_step=4470, RunningAvgSamplesPerSec=202.31905694926138, CurrSamplesPerSec=205.49956899785494, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:10,411] [INFO] [logging.py:96:log_dist] [Rank 0] step=4480, skipped=38, lr=[1.701118526002075e-05, 1.701118526002075e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:10,418] [INFO] [timer.py:199:stop] epoch=1/micro_step=800/global_step=4480, RunningAvgSamplesPerSec=202.32803419192405, CurrSamplesPerSec=205.92521084600762, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:11,968] [INFO] [logging.py:96:log_dist] [Rank 0] step=4490, skipped=38, lr=[1.691014168771409e-05, 1.691014168771409e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:11,974] [INFO] [timer.py:199:stop] epoch=1/micro_step=810/global_step=4490, RunningAvgSamplesPerSec=202.33671823412377, CurrSamplesPerSec=206.76301187112756, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:13,522] [INFO] [logging.py:96:log_dist] [Rank 0] step=4500, skipped=38, lr=[1.6809245510957665e-05, 1.6809245510957665e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:13,528] [INFO] [timer.py:199:stop] epoch=1/micro_step=820/global_step=4500, RunningAvgSamplesPerSec=202.34616875122, CurrSamplesPerSec=207.43471460629704, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:15,076] [INFO] [logging.py:96:log_dist] [Rank 0] step=4510, skipped=38, lr=[1.6708498568058996e-05, 1.6708498568058996e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:15,082] [INFO] [timer.py:199:stop] epoch=1/micro_step=830/global_step=4510, RunningAvgSamplesPerSec=202.35549605855107, CurrSamplesPerSec=206.18556092884927, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:16,633] [INFO] [logging.py:96:log_dist] [Rank 0] step=4520, skipped=38, lr=[1.660790269460661e-05, 1.660790269460661e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:16,639] [INFO] [timer.py:199:stop] epoch=1/micro_step=840/global_step=4520, RunningAvgSamplesPerSec=202.3639682875577, CurrSamplesPerSec=206.57938942459387, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:18,190] [INFO] [logging.py:96:log_dist] [Rank 0] step=4530, skipped=38, lr=[1.6507459723436585e-05, 1.6507459723436585e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:18,196] [INFO] [timer.py:199:stop] epoch=1/micro_step=850/global_step=4530, RunningAvgSamplesPerSec=202.3723742836334, CurrSamplesPerSec=206.65795389457216, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:19,579] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:02:19,579] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:02:19,579] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:02:19,579] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:02:19,579] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:02:19,579] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:02:19,579] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:02:19,579] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:02:19,579] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:02:19,579] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:02:19,579] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:02:19,579] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:02:19,579] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:02:19,579] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:02:19,579] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:02:19,580] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:02:19,746] [INFO] [logging.py:96:log_dist] [Rank 0] step=4540, skipped=38, lr=[1.6407171484599128e-05, 1.6407171484599128e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:19,752] [INFO] [timer.py:199:stop] epoch=1/micro_step=860/global_step=4540, RunningAvgSamplesPerSec=202.38106583163568, CurrSamplesPerSec=206.75122732699194, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:19,882] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4540 [2023-05-15 23:02:19,883] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:02:19,883] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4540 [2023-05-15 23:02:19,883] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4540 [2023-05-15 23:02:19,883] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4540 [2023-05-15 23:02:19,883] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4540 [2023-05-15 23:02:19,883] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4540 [2023-05-15 23:02:19,883] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:02:19,883] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:02:19,883] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:02:19,883] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:02:19,883] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:02:19,883] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-05-15 23:02:19,883] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4540 [2023-05-15 23:02:19,883] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:02:19,883] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4540 [2023-05-15 23:02:19,883] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:02:21,277] [INFO] [logging.py:96:log_dist] [Rank 0] step=4550, skipped=39, lr=[1.6317045876055006e-05, 1.6317045876055006e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:21,283] [INFO] [timer.py:199:stop] epoch=1/micro_step=870/global_step=4550, RunningAvgSamplesPerSec=202.39690937883728, CurrSamplesPerSec=206.12033257468954, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:22,829] [INFO] [logging.py:96:log_dist] [Rank 0] step=4560, skipped=39, lr=[1.6217056660314052e-05, 1.6217056660314052e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:22,836] [INFO] [timer.py:199:stop] epoch=1/micro_step=880/global_step=4560, RunningAvgSamplesPerSec=202.406456958769, CurrSamplesPerSec=206.36816630277684, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:24,383] [INFO] [logging.py:96:log_dist] [Rank 0] step=4570, skipped=39, lr=[1.6117227467989602e-05, 1.6117227467989602e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:24,389] [INFO] [timer.py:199:stop] epoch=1/micro_step=890/global_step=4570, RunningAvgSamplesPerSec=202.41565301904205, CurrSamplesPerSec=207.19422896723768, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:25,938] [INFO] [logging.py:96:log_dist] [Rank 0] step=4580, skipped=39, lr=[1.6017560117948946e-05, 1.6017560117948946e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:25,944] [INFO] [timer.py:199:stop] epoch=1/micro_step=900/global_step=4580, RunningAvgSamplesPerSec=202.42445149204565, CurrSamplesPerSec=206.71492663513473, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:27,522] [INFO] [logging.py:96:log_dist] [Rank 0] step=4590, skipped=39, lr=[1.5918056426110657e-05, 1.5918056426110657e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:27,529] [INFO] [timer.py:199:stop] epoch=1/micro_step=910/global_step=4590, RunningAvgSamplesPerSec=202.42506425169822, CurrSamplesPerSec=201.63499272136733, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:29,111] [INFO] [logging.py:96:log_dist] [Rank 0] step=4600, skipped=39, lr=[1.5818718205411487e-05, 1.5818718205411487e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:29,118] [INFO] [timer.py:199:stop] epoch=1/micro_step=920/global_step=4600, RunningAvgSamplesPerSec=202.42437621867816, CurrSamplesPerSec=200.57613638571286, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:30,705] [INFO] [logging.py:96:log_dist] [Rank 0] step=4610, skipped=39, lr=[1.5719547265773317e-05, 1.5719547265773317e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:30,711] [INFO] [timer.py:199:stop] epoch=1/micro_step=930/global_step=4610, RunningAvgSamplesPerSec=202.4224314650033, CurrSamplesPerSec=202.07426678711232, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:32,290] [INFO] [logging.py:96:log_dist] [Rank 0] step=4620, skipped=39, lr=[1.562054541407023e-05, 1.562054541407023e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:32,296] [INFO] [timer.py:199:stop] epoch=1/micro_step=940/global_step=4620, RunningAvgSamplesPerSec=202.42274235927184, CurrSamplesPerSec=202.8772845242612, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:33,874] [INFO] [logging.py:96:log_dist] [Rank 0] step=4630, skipped=39, lr=[1.552171445409555e-05, 1.552171445409555e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:33,880] [INFO] [timer.py:199:stop] epoch=1/micro_step=950/global_step=4630, RunningAvgSamplesPerSec=202.42342976190744, CurrSamplesPerSec=202.6261347149946, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:35,459] [INFO] [logging.py:96:log_dist] [Rank 0] step=4640, skipped=39, lr=[1.5423056186528972e-05, 1.5423056186528972e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:35,465] [INFO] [timer.py:199:stop] epoch=1/micro_step=960/global_step=4640, RunningAvgSamplesPerSec=202.42390429263756, CurrSamplesPerSec=203.32322608010665, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:35,764] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:02:35,764] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:02:35,764] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:02:35,764] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:02:35,765] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:02:35,765] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:02:35,765] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:02:35,765] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:02:35,765] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:02:35,765] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:02:35,765] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:02:35,765] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:02:35,765] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:02:35,765] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:02:35,768] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:02:35,768] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:02:37,045] [INFO] [logging.py:96:log_dist] [Rank 0] step=4650, skipped=39, lr=[1.53245724089038e-05, 1.53245724089038e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:37,052] [INFO] [timer.py:199:stop] epoch=1/micro_step=970/global_step=4650, RunningAvgSamplesPerSec=202.42395897824042, CurrSamplesPerSec=202.86379238681107, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:38,448] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4658 [2023-05-15 23:02:38,448] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4658 [2023-05-15 23:02:38,449] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:02:38,448] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4658 [2023-05-15 23:02:38,449] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:02:38,448] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4658 [2023-05-15 23:02:38,448] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4658 [2023-05-15 23:02:38,449] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4658 [2023-05-15 23:02:38,449] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4658 [2023-05-15 23:02:38,449] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4658 [2023-05-15 23:02:38,449] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:02:38,449] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:02:38,449] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:02:38,449] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:02:38,449] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:02:38,449] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:02:38,449] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-05-15 23:02:38,598] [INFO] [logging.py:96:log_dist] [Rank 0] step=4660, skipped=40, lr=[1.5236087681040948e-05, 1.5236087681040948e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:38,605] [INFO] [timer.py:199:stop] epoch=1/micro_step=980/global_step=4660, RunningAvgSamplesPerSec=202.43304285945496, CurrSamplesPerSec=206.67545614270117, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:40,204] [INFO] [logging.py:96:log_dist] [Rank 0] step=4670, skipped=40, lr=[1.5137940375096114e-05, 1.5137940375096114e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:40,211] [INFO] [timer.py:199:stop] epoch=1/micro_step=990/global_step=4670, RunningAvgSamplesPerSec=202.42777666366766, CurrSamplesPerSec=202.52279287612282, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:41,789] [INFO] [logging.py:96:log_dist] [Rank 0] step=4680, skipped=40, lr=[1.5039972753843965e-05, 1.5039972753843965e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:41,796] [INFO] [timer.py:199:stop] epoch=1/micro_step=1000/global_step=4680, RunningAvgSamplesPerSec=202.428219721254, CurrSamplesPerSec=202.9895841682988, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:43,362] [INFO] [logging.py:96:log_dist] [Rank 0] step=4690, skipped=40, lr=[1.4942186602234377e-05, 1.4942186602234377e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:43,368] [INFO] [timer.py:199:stop] epoch=1/micro_step=1010/global_step=4690, RunningAvgSamplesPerSec=202.431995336164, CurrSamplesPerSec=207.21918021827705, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:44,924] [INFO] [logging.py:96:log_dist] [Rank 0] step=4700, skipped=40, lr=[1.4844583701910847e-05, 1.4844583701910847e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:44,930] [INFO] [timer.py:199:stop] epoch=1/micro_step=1020/global_step=4700, RunningAvgSamplesPerSec=202.43882425243308, CurrSamplesPerSec=205.18729352124907, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:46,482] [INFO] [logging.py:96:log_dist] [Rank 0] step=4710, skipped=40, lr=[1.47471658311781e-05, 1.47471658311781e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:46,488] [INFO] [timer.py:199:stop] epoch=1/micro_step=1030/global_step=4710, RunningAvgSamplesPerSec=202.4465740924854, CurrSamplesPerSec=205.91510115662362, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:48,083] [INFO] [logging.py:96:log_dist] [Rank 0] step=4720, skipped=40, lr=[1.4649934764969664e-05, 1.4649934764969664e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:48,089] [INFO] [timer.py:199:stop] epoch=1/micro_step=1040/global_step=4720, RunningAvgSamplesPerSec=202.44270702324948, CurrSamplesPerSec=198.91858309237526, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:49,695] [INFO] [logging.py:96:log_dist] [Rank 0] step=4730, skipped=40, lr=[1.4552892274815505e-05, 1.4552892274815505e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:49,701] [INFO] [timer.py:199:stop] epoch=1/micro_step=1050/global_step=4730, RunningAvgSamplesPerSec=202.43590769852406, CurrSamplesPerSec=205.40113430113155, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:51,252] [INFO] [logging.py:96:log_dist] [Rank 0] step=4740, skipped=40, lr=[1.4456040128809772e-05, 1.4456040128809772e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:51,258] [INFO] [timer.py:199:stop] epoch=1/micro_step=1060/global_step=4740, RunningAvgSamplesPerSec=202.44395487501623, CurrSamplesPerSec=206.87551037708948, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:52,805] [INFO] [logging.py:96:log_dist] [Rank 0] step=4750, skipped=40, lr=[1.4359380091578606e-05, 1.4359380091578606e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:52,811] [INFO] [timer.py:199:stop] epoch=1/micro_step=1070/global_step=4750, RunningAvgSamplesPerSec=202.45305941346402, CurrSamplesPerSec=206.72384141354556, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:54,338] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4759 [2023-05-15 23:02:54,338] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:02:54,338] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4759 [2023-05-15 23:02:54,338] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:02:54,338] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4759 [2023-05-15 23:02:54,338] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4759 [2023-05-15 23:02:54,338] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4759 [2023-05-15 23:02:54,338] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:02:54,338] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4759 [2023-05-15 23:02:54,338] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:02:54,338] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:02:54,338] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:02:54,338] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-05-15 23:02:54,338] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4759 [2023-05-15 23:02:54,338] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:02:54,338] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4759 [2023-05-15 23:02:54,338] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:02:54,339] [INFO] [logging.py:96:log_dist] [Rank 0] step=4760, skipped=41, lr=[1.4272551766716252e-05, 1.4272551766716252e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:54,339] [INFO] [timer.py:199:stop] epoch=1/micro_step=1080/global_step=4760, RunningAvgSamplesPerSec=202.46872484281715, CurrSamplesPerSec=246.40941152279996, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:55,914] [INFO] [logging.py:96:log_dist] [Rank 0] step=4770, skipped=41, lr=[1.417626158513998e-05, 1.417626158513998e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:55,920] [INFO] [timer.py:199:stop] epoch=1/micro_step=1090/global_step=4770, RunningAvgSamplesPerSec=202.47022903717567, CurrSamplesPerSec=206.5695333253302, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:57,468] [INFO] [logging.py:96:log_dist] [Rank 0] step=4780, skipped=41, lr=[1.4080168609845643e-05, 1.4080168609845643e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:57,475] [INFO] [timer.py:199:stop] epoch=1/micro_step=1100/global_step=4780, RunningAvgSamplesPerSec=202.47882917903652, CurrSamplesPerSec=206.88220678089414, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:02:59,029] [INFO] [logging.py:96:log_dist] [Rank 0] step=4790, skipped=41, lr=[1.3984274591627445e-05, 1.3984274591627445e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:02:59,035] [INFO] [timer.py:199:stop] epoch=1/micro_step=1110/global_step=4790, RunningAvgSamplesPerSec=202.48566275246174, CurrSamplesPerSec=201.01953466080562, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:00,586] [INFO] [logging.py:96:log_dist] [Rank 0] step=4800, skipped=41, lr=[1.3888581277654606e-05, 1.3888581277654606e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:00,592] [INFO] [timer.py:199:stop] epoch=1/micro_step=1120/global_step=4800, RunningAvgSamplesPerSec=202.49352468473882, CurrSamplesPerSec=206.89304949670895, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:02,140] [INFO] [logging.py:96:log_dist] [Rank 0] step=4810, skipped=41, lr=[1.3793090411439586e-05, 1.3793090411439586e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:02,147] [INFO] [timer.py:199:stop] epoch=1/micro_step=1130/global_step=4810, RunningAvgSamplesPerSec=202.50192922299732, CurrSamplesPerSec=206.8809312419848, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:03,693] [INFO] [logging.py:96:log_dist] [Rank 0] step=4820, skipped=41, lr=[1.3697803732806278e-05, 1.3697803732806278e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:03,699] [INFO] [timer.py:199:stop] epoch=1/micro_step=1140/global_step=4820, RunningAvgSamplesPerSec=202.51083532939597, CurrSamplesPerSec=207.35235511584398, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:05,246] [INFO] [logging.py:96:log_dist] [Rank 0] step=4830, skipped=41, lr=[1.36027229778583e-05, 1.36027229778583e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:05,253] [INFO] [timer.py:199:stop] epoch=1/micro_step=1150/global_step=4830, RunningAvgSamplesPerSec=202.51934100547183, CurrSamplesPerSec=206.9016606983473, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:06,806] [INFO] [logging.py:96:log_dist] [Rank 0] step=4840, skipped=41, lr=[1.3507849878947418e-05, 1.3507849878947418e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:06,812] [INFO] [timer.py:199:stop] epoch=1/micro_step=1160/global_step=4840, RunningAvgSamplesPerSec=202.52627632391207, CurrSamplesPerSec=206.82673537379534, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:08,359] [INFO] [logging.py:96:log_dist] [Rank 0] step=4850, skipped=41, lr=[1.3413186164641933e-05, 1.3413186164641933e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:08,366] [INFO] [timer.py:199:stop] epoch=1/micro_step=1170/global_step=4850, RunningAvgSamplesPerSec=202.53484904799512, CurrSamplesPerSec=206.90453125818954, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:09,922] [INFO] [logging.py:96:log_dist] [Rank 0] step=4860, skipped=41, lr=[1.331873355969519e-05, 1.331873355969519e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:09,929] [INFO] [timer.py:199:stop] epoch=1/micro_step=1180/global_step=4860, RunningAvgSamplesPerSec=202.5409249691156, CurrSamplesPerSec=206.14344275513793, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:10,066] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:10,066] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:03:10,066] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:10,066] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:10,066] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:03:10,066] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:03:10,066] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:10,066] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:03:10,066] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:10,066] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:10,066] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:03:10,066] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:03:10,066] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:10,067] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:10,067] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:03:10,067] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:03:11,475] [INFO] [logging.py:96:log_dist] [Rank 0] step=4870, skipped=41, lr=[1.3224493785014163e-05, 1.3224493785014163e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:11,482] [INFO] [timer.py:199:stop] epoch=1/micro_step=1190/global_step=4870, RunningAvgSamplesPerSec=202.54942795842365, CurrSamplesPerSec=206.47197690034506, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:13,027] [INFO] [logging.py:96:log_dist] [Rank 0] step=4880, skipped=41, lr=[1.3130468557628128e-05, 1.3130468557628128e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:13,034] [INFO] [timer.py:199:stop] epoch=1/micro_step=1200/global_step=4880, RunningAvgSamplesPerSec=202.5582964301111, CurrSamplesPerSec=206.786903582549, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:14,580] [INFO] [logging.py:96:log_dist] [Rank 0] step=4890, skipped=41, lr=[1.3036659590657344e-05, 1.3036659590657344e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:14,587] [INFO] [timer.py:199:stop] epoch=1/micro_step=1210/global_step=4890, RunningAvgSamplesPerSec=202.56673728926043, CurrSamplesPerSec=207.10534900048606, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:16,134] [INFO] [logging.py:96:log_dist] [Rank 0] step=4900, skipped=41, lr=[1.2943068593281826e-05, 1.2943068593281826e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:16,140] [INFO] [timer.py:199:stop] epoch=1/micro_step=1220/global_step=4900, RunningAvgSamplesPerSec=202.5750846424062, CurrSamplesPerSec=205.72225951914558, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:17,688] [INFO] [logging.py:96:log_dist] [Rank 0] step=4910, skipped=41, lr=[1.284969727071026e-05, 1.284969727071026e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:17,694] [INFO] [timer.py:199:stop] epoch=1/micro_step=1230/global_step=4910, RunningAvgSamplesPerSec=202.58339598510858, CurrSamplesPerSec=206.5911543872503, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:19,267] [INFO] [logging.py:96:log_dist] [Rank 0] step=4920, skipped=41, lr=[1.275654732414889e-05, 1.275654732414889e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:19,273] [INFO] [timer.py:199:stop] epoch=1/micro_step=1240/global_step=4920, RunningAvgSamplesPerSec=202.58510015564602, CurrSamplesPerSec=206.48563560560606, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:20,831] [INFO] [logging.py:96:log_dist] [Rank 0] step=4930, skipped=41, lr=[1.2663620450770524e-05, 1.2663620450770524e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:20,837] [INFO] [timer.py:199:stop] epoch=1/micro_step=1250/global_step=4930, RunningAvgSamplesPerSec=202.59059540316886, CurrSamplesPerSec=206.99866594488887, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:22,384] [INFO] [logging.py:96:log_dist] [Rank 0] step=4940, skipped=41, lr=[1.2570918343683635e-05, 1.2570918343683635e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:22,391] [INFO] [timer.py:199:stop] epoch=1/micro_step=1260/global_step=4940, RunningAvgSamplesPerSec=202.5988456691676, CurrSamplesPerSec=207.2831852530235, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:23,941] [INFO] [logging.py:96:log_dist] [Rank 0] step=4950, skipped=41, lr=[1.2478442691901515e-05, 1.2478442691901515e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:23,948] [INFO] [timer.py:199:stop] epoch=1/micro_step=1270/global_step=4950, RunningAvgSamplesPerSec=202.60610171939857, CurrSamplesPerSec=206.1310955756779, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:25,506] [INFO] [logging.py:96:log_dist] [Rank 0] step=4960, skipped=41, lr=[1.2386195180311452e-05, 1.2386195180311452e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:25,513] [INFO] [timer.py:199:stop] epoch=1/micro_step=1280/global_step=4960, RunningAvgSamplesPerSec=202.61142809713755, CurrSamplesPerSec=206.7018742396008, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:25,649] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:25,650] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:25,650] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:03:25,650] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:03:25,650] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:25,650] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:25,650] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:03:25,650] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:03:25,650] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:25,650] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:25,650] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:03:25,650] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:25,650] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:03:25,650] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:03:25,651] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:25,651] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:03:26,109] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4963 [2023-05-15 23:03:26,109] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:03:26,109] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4963 [2023-05-15 23:03:26,109] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:03:26,109] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4963 [2023-05-15 23:03:26,109] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4963 [2023-05-15 23:03:26,109] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:03:26,109] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4963 [2023-05-15 23:03:26,109] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:03:26,109] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:03:26,109] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-05-15 23:03:26,109] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4963 [2023-05-15 23:03:26,109] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:03:26,109] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4963 [2023-05-15 23:03:26,109] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:03:26,110] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4963 [2023-05-15 23:03:26,110] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:03:27,071] [INFO] [logging.py:96:log_dist] [Rank 0] step=4970, skipped=42, lr=[1.2303368868954848e-05, 1.2303368868954848e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:27,078] [INFO] [timer.py:199:stop] epoch=1/micro_step=1290/global_step=4970, RunningAvgSamplesPerSec=202.61665259412595, CurrSamplesPerSec=196.48155337531804, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:28,698] [INFO] [logging.py:96:log_dist] [Rank 0] step=4980, skipped=42, lr=[1.221155945068244e-05, 1.221155945068244e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:28,704] [INFO] [timer.py:199:stop] epoch=1/micro_step=1300/global_step=4980, RunningAvgSamplesPerSec=202.6059815979832, CurrSamplesPerSec=198.65360837254323, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:30,276] [INFO] [logging.py:96:log_dist] [Rank 0] step=4990, skipped=42, lr=[1.211998303515972e-05, 1.211998303515972e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:30,282] [INFO] [timer.py:199:stop] epoch=1/micro_step=1310/global_step=4990, RunningAvgSamplesPerSec=202.6078783328128, CurrSamplesPerSec=207.0727574837966, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:31,829] [INFO] [logging.py:96:log_dist] [Rank 0] step=5000, skipped=42, lr=[1.2028641290890088e-05, 1.2028641290890088e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:31,835] [INFO] [timer.py:199:stop] epoch=1/micro_step=1320/global_step=5000, RunningAvgSamplesPerSec=202.61613668587577, CurrSamplesPerSec=206.52439805012557, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:33,387] [INFO] [logging.py:96:log_dist] [Rank 0] step=5010, skipped=42, lr=[1.1937535882101281e-05, 1.1937535882101281e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:33,393] [INFO] [timer.py:199:stop] epoch=1/micro_step=1330/global_step=5010, RunningAvgSamplesPerSec=202.6230573038349, CurrSamplesPerSec=206.69805433808378, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:34,946] [INFO] [logging.py:96:log_dist] [Rank 0] step=5020, skipped=42, lr=[1.1846668468715077e-05, 1.1846668468715077e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:34,952] [INFO] [timer.py:199:stop] epoch=1/micro_step=1340/global_step=5020, RunningAvgSamplesPerSec=202.62964014131717, CurrSamplesPerSec=197.74950127149984, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:36,498] [INFO] [logging.py:96:log_dist] [Rank 0] step=5030, skipped=42, lr=[1.175604070631699e-05, 1.175604070631699e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:36,505] [INFO] [timer.py:199:stop] epoch=1/micro_step=1350/global_step=5030, RunningAvgSamplesPerSec=202.6379017003208, CurrSamplesPerSec=207.10694688763385, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:38,051] [INFO] [logging.py:96:log_dist] [Rank 0] step=5040, skipped=42, lr=[1.1665654246126175e-05, 1.1665654246126175e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:38,057] [INFO] [timer.py:199:stop] epoch=1/micro_step=1360/global_step=5040, RunningAvgSamplesPerSec=202.6462171875403, CurrSamplesPerSec=206.33517196426047, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:39,622] [INFO] [logging.py:96:log_dist] [Rank 0] step=5050, skipped=42, lr=[1.1575510734965306e-05, 1.1575510734965306e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:39,628] [INFO] [timer.py:199:stop] epoch=1/micro_step=1370/global_step=5050, RunningAvgSamplesPerSec=202.64973596396766, CurrSamplesPerSec=199.47822675092593, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:41,210] [INFO] [logging.py:96:log_dist] [Rank 0] step=5060, skipped=42, lr=[1.1485611815230545e-05, 1.1485611815230545e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:41,217] [INFO] [timer.py:199:stop] epoch=1/micro_step=1380/global_step=5060, RunningAvgSamplesPerSec=202.64891813894226, CurrSamplesPerSec=201.62075668400195, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:41,990] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:41,990] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:03:41,990] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:41,990] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:41,990] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:03:41,990] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:03:41,990] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:41,990] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:41,991] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:03:41,991] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:03:41,991] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:41,991] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:03:41,991] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:41,991] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:03:41,994] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:41,994] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:03:42,141] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5065 [2023-05-15 23:03:42,141] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5065 [2023-05-15 23:03:42,141] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:03:42,141] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:03:42,141] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5065 [2023-05-15 23:03:42,141] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:03:42,141] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5065 [2023-05-15 23:03:42,141] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5065 [2023-05-15 23:03:42,141] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5065 [2023-05-15 23:03:42,141] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:03:42,141] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:03:42,141] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:03:42,141] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-05-15 23:03:42,141] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5065 [2023-05-15 23:03:42,141] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:03:42,142] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5065 [2023-05-15 23:03:42,143] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:03:42,580] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5068 [2023-05-15 23:03:42,580] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5068 [2023-05-15 23:03:42,580] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:03:42,580] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:03:42,580] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5068 [2023-05-15 23:03:42,580] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5068 [2023-05-15 23:03:42,580] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5068 [2023-05-15 23:03:42,580] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5068 [2023-05-15 23:03:42,580] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:03:42,580] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5068 [2023-05-15 23:03:42,580] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5068 [2023-05-15 23:03:42,580] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:03:42,580] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:03:42,580] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:03:42,580] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:03:42,580] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:03:42,580] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-05-15 23:03:42,730] [INFO] [logging.py:96:log_dist] [Rank 0] step=5070, skipped=44, lr=[1.1413869886115575e-05, 1.1413869886115575e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:42,736] [INFO] [timer.py:199:stop] epoch=1/micro_step=1390/global_step=5070, RunningAvgSamplesPerSec=202.66540425929807, CurrSamplesPerSec=206.58511275256004, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:44,317] [INFO] [logging.py:96:log_dist] [Rank 0] step=5080, skipped=44, lr=[1.1324415355542328e-05, 1.1324415355542328e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:44,323] [INFO] [timer.py:199:stop] epoch=1/micro_step=1400/global_step=5080, RunningAvgSamplesPerSec=202.66480798161027, CurrSamplesPerSec=201.5178324229694, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:45,912] [INFO] [logging.py:96:log_dist] [Rank 0] step=5090, skipped=44, lr=[1.1235209991301229e-05, 1.1235209991301229e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:45,918] [INFO] [timer.py:199:stop] epoch=1/micro_step=1410/global_step=5090, RunningAvgSamplesPerSec=202.66233250285694, CurrSamplesPerSec=201.7962728250002, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:47,509] [INFO] [logging.py:96:log_dist] [Rank 0] step=5100, skipped=44, lr=[1.1146255418695634e-05, 1.1146255418695634e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:47,516] [INFO] [timer.py:199:stop] epoch=1/micro_step=1420/global_step=5100, RunningAvgSamplesPerSec=202.65919902967065, CurrSamplesPerSec=201.87640238188706, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:49,103] [INFO] [logging.py:96:log_dist] [Rank 0] step=5110, skipped=44, lr=[1.1057553258459497e-05, 1.1057553258459497e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:49,109] [INFO] [timer.py:199:stop] epoch=1/micro_step=1430/global_step=5110, RunningAvgSamplesPerSec=202.6570003333625, CurrSamplesPerSec=201.36243451334337, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:50,699] [INFO] [logging.py:96:log_dist] [Rank 0] step=5120, skipped=44, lr=[1.0969105126727903e-05, 1.0969105126727903e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:50,705] [INFO] [timer.py:199:stop] epoch=1/micro_step=1440/global_step=5120, RunningAvgSamplesPerSec=202.65430376590572, CurrSamplesPerSec=201.20516468311374, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:52,281] [INFO] [logging.py:96:log_dist] [Rank 0] step=5130, skipped=44, lr=[1.0880912635007553e-05, 1.0880912635007553e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:52,287] [INFO] [timer.py:199:stop] epoch=1/micro_step=1450/global_step=5130, RunningAvgSamplesPerSec=202.6549168900984, CurrSamplesPerSec=203.56406643623387, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:53,864] [INFO] [logging.py:96:log_dist] [Rank 0] step=5140, skipped=44, lr=[1.0792977390147474e-05, 1.0792977390147474e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:53,870] [INFO] [timer.py:199:stop] epoch=1/micro_step=1460/global_step=5140, RunningAvgSamplesPerSec=202.65538904865573, CurrSamplesPerSec=201.6837739636897, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:55,514] [INFO] [logging.py:96:log_dist] [Rank 0] step=5150, skipped=44, lr=[1.0705300994309697e-05, 1.0705300994309697e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:55,521] [INFO] [timer.py:199:stop] epoch=1/micro_step=1470/global_step=5150, RunningAvgSamplesPerSec=202.63895710827026, CurrSamplesPerSec=200.35935720310204, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:57,099] [INFO] [logging.py:96:log_dist] [Rank 0] step=5160, skipped=44, lr=[1.0617885044940063e-05, 1.0617885044940063e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:57,106] [INFO] [timer.py:199:stop] epoch=1/micro_step=1480/global_step=5160, RunningAvgSamplesPerSec=202.63892097787502, CurrSamplesPerSec=202.87299137676942, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:03:58,674] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:58,674] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:58,674] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:03:58,674] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:58,674] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:03:58,675] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:03:58,675] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:58,675] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:58,675] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:03:58,675] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:03:58,675] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:58,675] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:58,675] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:03:58,675] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:03:58,678] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:03:58,678] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:03:58,687] [INFO] [logging.py:96:log_dist] [Rank 0] step=5170, skipped=44, lr=[1.0530731134739144e-05, 1.0530731134739144e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:03:58,693] [INFO] [timer.py:199:stop] epoch=1/micro_step=1490/global_step=5170, RunningAvgSamplesPerSec=202.63834675290553, CurrSamplesPerSec=201.03609484295194, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:00,276] [INFO] [logging.py:96:log_dist] [Rank 0] step=5180, skipped=44, lr=[1.0443840851633227e-05, 1.0443840851633227e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:00,282] [INFO] [timer.py:199:stop] epoch=1/micro_step=1500/global_step=5180, RunningAvgSamplesPerSec=202.6373856826185, CurrSamplesPerSec=202.32008477629398, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:01,859] [INFO] [logging.py:96:log_dist] [Rank 0] step=5190, skipped=44, lr=[1.0357215778745333e-05, 1.0357215778745333e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:01,865] [INFO] [timer.py:199:stop] epoch=1/micro_step=1510/global_step=5190, RunningAvgSamplesPerSec=202.63777990931618, CurrSamplesPerSec=203.69505626674155, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:03,442] [INFO] [logging.py:96:log_dist] [Rank 0] step=5200, skipped=44, lr=[1.0270857494366442e-05, 1.0270857494366442e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:03,448] [INFO] [timer.py:199:stop] epoch=1/micro_step=1520/global_step=5200, RunningAvgSamplesPerSec=202.63827014639992, CurrSamplesPerSec=200.82042412100301, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:05,025] [INFO] [logging.py:96:log_dist] [Rank 0] step=5210, skipped=44, lr=[1.0184767571926676e-05, 1.0184767571926676e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:05,032] [INFO] [timer.py:199:stop] epoch=1/micro_step=1530/global_step=5210, RunningAvgSamplesPerSec=202.63854156152252, CurrSamplesPerSec=201.85970347688246, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:06,610] [INFO] [logging.py:96:log_dist] [Rank 0] step=5220, skipped=44, lr=[1.009894757996668e-05, 1.009894757996668e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:06,616] [INFO] [timer.py:199:stop] epoch=1/micro_step=1540/global_step=5220, RunningAvgSamplesPerSec=202.6386614300127, CurrSamplesPerSec=203.04762365812627, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:08,183] [INFO] [logging.py:96:log_dist] [Rank 0] step=5230, skipped=44, lr=[1.0013399082108996e-05, 1.0013399082108996e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:08,189] [INFO] [timer.py:199:stop] epoch=1/micro_step=1550/global_step=5230, RunningAvgSamplesPerSec=202.64156868285528, CurrSamplesPerSec=207.51232695417707, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:09,743] [INFO] [logging.py:96:log_dist] [Rank 0] step=5240, skipped=44, lr=[9.92812363702963e-06, 9.92812363702963e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:09,749] [INFO] [timer.py:199:stop] epoch=1/micro_step=1560/global_step=5240, RunningAvgSamplesPerSec=202.64790672365467, CurrSamplesPerSec=206.51168744845583, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:11,347] [INFO] [logging.py:96:log_dist] [Rank 0] step=5250, skipped=44, lr=[9.843122798429591e-06, 9.843122798429591e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:11,353] [INFO] [timer.py:199:stop] epoch=1/micro_step=1570/global_step=5250, RunningAvgSamplesPerSec=202.6431718891647, CurrSamplesPerSec=196.64362245309087, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:12,790] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5258 [2023-05-15 23:04:12,790] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5258 [2023-05-15 23:04:12,790] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:04:12,790] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:04:12,790] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5258 [2023-05-15 23:04:12,790] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5258 [2023-05-15 23:04:12,790] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5258 [2023-05-15 23:04:12,790] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5258 [2023-05-15 23:04:12,790] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:04:12,790] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5258 [2023-05-15 23:04:12,790] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:04:12,790] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5258 [2023-05-15 23:04:12,790] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:04:12,790] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:04:12,790] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:04:12,790] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:04:12,790] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-05-15 23:04:12,944] [INFO] [logging.py:96:log_dist] [Rank 0] step=5260, skipped=45, lr=[9.766858112331917e-06, 9.766858112331917e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:12,950] [INFO] [timer.py:199:stop] epoch=1/micro_step=1580/global_step=5260, RunningAvgSamplesPerSec=202.64041556636099, CurrSamplesPerSec=202.0794389498075, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:14,503] [INFO] [logging.py:96:log_dist] [Rank 0] step=5270, skipped=45, lr=[9.682383288546982e-06, 9.682383288546982e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:14,509] [INFO] [timer.py:199:stop] epoch=1/micro_step=1590/global_step=5270, RunningAvgSamplesPerSec=202.64670400631306, CurrSamplesPerSec=206.5037441168823, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:16,056] [INFO] [logging.py:96:log_dist] [Rank 0] step=5280, skipped=45, lr=[9.598187548579845e-06, 9.598187548579845e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:16,062] [INFO] [timer.py:199:stop] epoch=1/micro_step=1600/global_step=5280, RunningAvgSamplesPerSec=202.65442570653306, CurrSamplesPerSec=206.90038491953996, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:17,610] [INFO] [logging.py:96:log_dist] [Rank 0] step=5290, skipped=45, lr=[9.514272426459542e-06, 9.514272426459542e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:17,617] [INFO] [timer.py:199:stop] epoch=1/micro_step=1610/global_step=5290, RunningAvgSamplesPerSec=202.6618065860562, CurrSamplesPerSec=206.36245498909898, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:19,163] [INFO] [logging.py:96:log_dist] [Rank 0] step=5300, skipped=45, lr=[9.430639451102286e-06, 9.430639451102286e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:19,170] [INFO] [timer.py:199:stop] epoch=1/micro_step=1620/global_step=5300, RunningAvgSamplesPerSec=202.66950712324402, CurrSamplesPerSec=205.9662916692882, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:20,723] [INFO] [logging.py:96:log_dist] [Rank 0] step=5310, skipped=45, lr=[9.347290146283654e-06, 9.347290146283654e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:20,729] [INFO] [timer.py:199:stop] epoch=1/micro_step=1630/global_step=5310, RunningAvgSamplesPerSec=202.6755799458198, CurrSamplesPerSec=206.94185578582034, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:22,275] [INFO] [logging.py:96:log_dist] [Rank 0] step=5320, skipped=45, lr=[9.264226030610776e-06, 9.264226030610776e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:22,282] [INFO] [timer.py:199:stop] epoch=1/micro_step=1640/global_step=5320, RunningAvgSamplesPerSec=202.68331512214004, CurrSamplesPerSec=206.90134175217048, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:23,827] [INFO] [logging.py:96:log_dist] [Rank 0] step=5330, skipped=45, lr=[9.181448617494718e-06, 9.181448617494718e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:23,834] [INFO] [timer.py:199:stop] epoch=1/micro_step=1650/global_step=5330, RunningAvgSamplesPerSec=202.691114919737, CurrSamplesPerSec=206.30631057141758, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:25,384] [INFO] [logging.py:96:log_dist] [Rank 0] step=5340, skipped=45, lr=[9.098959415122885e-06, 9.098959415122885e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:25,390] [INFO] [timer.py:199:stop] epoch=1/micro_step=1660/global_step=5340, RunningAvgSamplesPerSec=202.69789025668672, CurrSamplesPerSec=206.61818708021792, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:26,937] [INFO] [logging.py:96:log_dist] [Rank 0] step=5350, skipped=45, lr=[9.016759926431518e-06, 9.016759926431518e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:26,943] [INFO] [timer.py:199:stop] epoch=1/micro_step=1670/global_step=5350, RunningAvgSamplesPerSec=202.70548064344104, CurrSamplesPerSec=206.97568279845976, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:28,478] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:04:28,478] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:04:28,478] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:04:28,478] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:04:28,478] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:04:28,478] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:04:28,478] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:04:28,479] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:04:28,479] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:04:28,479] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:04:28,479] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:04:28,479] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:04:28,479] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:04:28,479] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:04:28,479] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:04:28,479] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:04:28,491] [INFO] [logging.py:96:log_dist] [Rank 0] step=5360, skipped=45, lr=[8.934851649078368e-06, 8.934851649078368e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:28,497] [INFO] [timer.py:199:stop] epoch=1/micro_step=1680/global_step=5360, RunningAvgSamplesPerSec=202.71275601324143, CurrSamplesPerSec=206.8480915312905, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:30,047] [INFO] [logging.py:96:log_dist] [Rank 0] step=5370, skipped=45, lr=[8.853236075415338e-06, 8.853236075415338e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:30,053] [INFO] [timer.py:199:stop] epoch=1/micro_step=1690/global_step=5370, RunningAvgSamplesPerSec=202.71951734303335, CurrSamplesPerSec=206.7375334633359, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:30,948] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5375 [2023-05-15 23:04:30,949] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5375 [2023-05-15 23:04:30,949] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:04:30,949] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:04:30,949] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5375 [2023-05-15 23:04:30,949] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5375 [2023-05-15 23:04:30,949] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5375 [2023-05-15 23:04:30,949] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:04:30,949] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:04:30,949] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5375 [2023-05-15 23:04:30,949] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:04:30,949] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-05-15 23:04:30,949] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5375 [2023-05-15 23:04:30,949] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:04:30,949] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:04:30,949] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5375 [2023-05-15 23:04:30,949] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:04:31,565] [INFO] [logging.py:96:log_dist] [Rank 0] step=5380, skipped=46, lr=[8.780033549905447e-06, 8.780033549905447e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:31,572] [INFO] [timer.py:199:stop] epoch=1/micro_step=1700/global_step=5380, RunningAvgSamplesPerSec=202.73515914992154, CurrSamplesPerSec=206.63027457259201, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:33,119] [INFO] [logging.py:96:log_dist] [Rank 0] step=5390, skipped=46, lr=[8.69897820556124e-06, 8.69897820556124e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:33,126] [INFO] [timer.py:199:stop] epoch=1/micro_step=1710/global_step=5390, RunningAvgSamplesPerSec=202.7423704766288, CurrSamplesPerSec=206.34627465028004, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:34,672] [INFO] [logging.py:96:log_dist] [Rank 0] step=5400, skipped=46, lr=[8.61821986247267e-06, 8.61821986247267e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:34,679] [INFO] [timer.py:199:stop] epoch=1/micro_step=1720/global_step=5400, RunningAvgSamplesPerSec=202.74970220713263, CurrSamplesPerSec=206.12634705433368, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:36,223] [INFO] [logging.py:96:log_dist] [Rank 0] step=5410, skipped=46, lr=[8.537759992040081e-06, 8.537759992040081e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:36,230] [INFO] [timer.py:199:stop] epoch=1/micro_step=1730/global_step=5410, RunningAvgSamplesPerSec=202.75756681879207, CurrSamplesPerSec=206.50310867675455, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:37,775] [INFO] [logging.py:96:log_dist] [Rank 0] step=5420, skipped=46, lr=[8.45760006022574e-06, 8.45760006022574e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:37,781] [INFO] [timer.py:199:stop] epoch=1/micro_step=1740/global_step=5420, RunningAvgSamplesPerSec=202.7652741562726, CurrSamplesPerSec=206.76269335227124, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:39,335] [INFO] [logging.py:96:log_dist] [Rank 0] step=5430, skipped=46, lr=[8.377741527527053e-06, 8.377741527527053e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:39,342] [INFO] [timer.py:199:stop] epoch=1/micro_step=1750/global_step=5430, RunningAvgSamplesPerSec=202.7708466498736, CurrSamplesPerSec=205.94448519144782, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:40,892] [INFO] [logging.py:96:log_dist] [Rank 0] step=5440, skipped=46, lr=[8.298185848950039e-06, 8.298185848950039e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:40,898] [INFO] [timer.py:199:stop] epoch=1/micro_step=1760/global_step=5440, RunningAvgSamplesPerSec=202.7773411204748, CurrSamplesPerSec=206.887946900636, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:42,445] [INFO] [logging.py:96:log_dist] [Rank 0] step=5450, skipped=46, lr=[8.218934473982737e-06, 8.218934473982737e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:42,451] [INFO] [timer.py:199:stop] epoch=1/micro_step=1770/global_step=5450, RunningAvgSamplesPerSec=202.78457630988066, CurrSamplesPerSec=206.2124875743427, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:43,997] [INFO] [logging.py:96:log_dist] [Rank 0] step=5460, skipped=46, lr=[8.139988846568863e-06, 8.139988846568863e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:44,004] [INFO] [timer.py:199:stop] epoch=1/micro_step=1780/global_step=5460, RunningAvgSamplesPerSec=202.79190726404155, CurrSamplesPerSec=207.12772166451902, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:45,555] [INFO] [logging.py:96:log_dist] [Rank 0] step=5470, skipped=46, lr=[8.061350405081483e-06, 8.061350405081483e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:45,561] [INFO] [timer.py:199:stop] epoch=1/micro_step=1790/global_step=5470, RunningAvgSamplesPerSec=202.7981262842066, CurrSamplesPerSec=206.08994899102967, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:46,642] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:04:46,642] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:04:46,643] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:04:46,643] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:04:46,643] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:04:46,643] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:04:46,643] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:04:46,643] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:04:46,643] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:04:46,643] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:04:46,643] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:04:46,643] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:04:46,644] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:04:46,644] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:04:46,646] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:04:46,646] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:04:47,145] [INFO] [logging.py:96:log_dist] [Rank 0] step=5480, skipped=46, lr=[7.983020582296763e-06, 7.983020582296763e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:47,148] [INFO] [timer.py:199:stop] epoch=1/micro_step=1800/global_step=5480, RunningAvgSamplesPerSec=202.79753359929285, CurrSamplesPerSec=189.52621597768913, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:48,700] [INFO] [logging.py:96:log_dist] [Rank 0] step=5490, skipped=46, lr=[7.905000805367932e-06, 7.905000805367932e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:48,707] [INFO] [timer.py:199:stop] epoch=1/micro_step=1810/global_step=5490, RunningAvgSamplesPerSec=202.8035552685137, CurrSamplesPerSec=205.62487150143014, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:50,264] [INFO] [logging.py:96:log_dist] [Rank 0] step=5500, skipped=46, lr=[7.827292495799247e-06, 7.827292495799247e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:50,271] [INFO] [timer.py:199:stop] epoch=1/micro_step=1820/global_step=5500, RunningAvgSamplesPerSec=202.80821531090766, CurrSamplesPerSec=205.93152990599285, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:51,819] [INFO] [logging.py:96:log_dist] [Rank 0] step=5510, skipped=46, lr=[7.749897069420061e-06, 7.749897069420061e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:51,825] [INFO] [timer.py:199:stop] epoch=1/micro_step=1830/global_step=5510, RunningAvgSamplesPerSec=202.81503018665927, CurrSamplesPerSec=206.86116240001232, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:53,371] [INFO] [logging.py:96:log_dist] [Rank 0] step=5520, skipped=46, lr=[7.672815936359107e-06, 7.672815936359107e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:53,378] [INFO] [timer.py:199:stop] epoch=1/micro_step=1840/global_step=5520, RunningAvgSamplesPerSec=202.8222088490886, CurrSamplesPerSec=206.75059036287485, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:54,951] [INFO] [logging.py:96:log_dist] [Rank 0] step=5530, skipped=46, lr=[7.596050501018723e-06, 7.596050501018723e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:54,957] [INFO] [timer.py:199:stop] epoch=1/micro_step=1850/global_step=5530, RunningAvgSamplesPerSec=202.82312481127948, CurrSamplesPerSec=197.04316906918942, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:56,664] [INFO] [logging.py:96:log_dist] [Rank 0] step=5540, skipped=46, lr=[7.519602162049302e-06, 7.519602162049302e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:56,670] [INFO] [timer.py:199:stop] epoch=1/micro_step=1860/global_step=5540, RunningAvgSamplesPerSec=202.79324835901411, CurrSamplesPerSec=195.05952453617715, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:58,238] [INFO] [logging.py:96:log_dist] [Rank 0] step=5550, skipped=46, lr=[7.4434723123238236e-06, 7.4434723123238236e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:58,244] [INFO] [timer.py:199:stop] epoch=1/micro_step=1870/global_step=5550, RunningAvgSamplesPerSec=202.7956237783477, CurrSamplesPerSec=207.36837323849002, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:04:59,812] [INFO] [logging.py:96:log_dist] [Rank 0] step=5560, skipped=46, lr=[7.367662338912451e-06, 7.367662338912451e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:04:59,819] [INFO] [timer.py:199:stop] epoch=1/micro_step=1880/global_step=5560, RunningAvgSamplesPerSec=202.7979418037593, CurrSamplesPerSec=206.5933803417422, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:01,366] [INFO] [logging.py:96:log_dist] [Rank 0] step=5570, skipped=46, lr=[7.292173623057277e-06, 7.292173623057277e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:01,372] [INFO] [timer.py:199:stop] epoch=1/micro_step=1890/global_step=5570, RunningAvgSamplesPerSec=202.80489044331566, CurrSamplesPerSec=206.82801024449327, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:02,447] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:05:02,447] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:05:02,447] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:05:02,447] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:05:02,447] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:05:02,447] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:05:02,447] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:05:02,447] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:05:02,447] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:05:02,447] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:05:02,447] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:05:02,448] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:05:02,448] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:05:02,448] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:05:02,450] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:05:02,450] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:05:02,742] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5578 [2023-05-15 23:05:02,742] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5578 [2023-05-15 23:05:02,742] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:05:02,742] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5578 [2023-05-15 23:05:02,742] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:05:02,742] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5578 [2023-05-15 23:05:02,742] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:05:02,742] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5578 [2023-05-15 23:05:02,742] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:05:02,742] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5578 [2023-05-15 23:05:02,742] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:05:02,742] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-05-15 23:05:02,742] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5578 [2023-05-15 23:05:02,743] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:05:02,742] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5578 [2023-05-15 23:05:02,743] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:05:02,743] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:05:02,892] [INFO] [logging.py:96:log_dist] [Rank 0] step=5580, skipped=47, lr=[7.2245095908783786e-06, 7.2245095908783786e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:02,899] [INFO] [timer.py:199:stop] epoch=1/micro_step=1900/global_step=5580, RunningAvgSamplesPerSec=202.81810269206144, CurrSamplesPerSec=206.58511275256004, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:04,416] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5589 [2023-05-15 23:05:04,416] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5589 [2023-05-15 23:05:04,417] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:05:04,417] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:05:04,416] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5589 [2023-05-15 23:05:04,417] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:05:04,417] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5589 [2023-05-15 23:05:04,417] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5589 [2023-05-15 23:05:04,417] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5589 [2023-05-15 23:05:04,417] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:05:04,417] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:05:04,417] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5589 [2023-05-15 23:05:04,417] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:05:04,417] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-05-15 23:05:04,417] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:05:04,417] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5589 [2023-05-15 23:05:04,417] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:05:04,417] [INFO] [logging.py:96:log_dist] [Rank 0] step=5590, skipped=48, lr=[7.157107890048637e-06, 7.157107890048637e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:04,418] [INFO] [timer.py:199:stop] epoch=1/micro_step=1910/global_step=5590, RunningAvgSamplesPerSec=202.83294196885322, CurrSamplesPerSec=264.5700870285134, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:05,965] [INFO] [logging.py:96:log_dist] [Rank 0] step=5600, skipped=48, lr=[7.082525993639916e-06, 7.082525993639916e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:05,972] [INFO] [timer.py:199:stop] epoch=1/micro_step=1920/global_step=5600, RunningAvgSamplesPerSec=202.83968884467066, CurrSamplesPerSec=206.27523809962915, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:07,520] [INFO] [logging.py:96:log_dist] [Rank 0] step=5610, skipped=48, lr=[7.008270549912787e-06, 7.008270549912787e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:07,526] [INFO] [timer.py:199:stop] epoch=1/micro_step=1930/global_step=5610, RunningAvgSamplesPerSec=202.8463608833717, CurrSamplesPerSec=206.61087165340754, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:09,077] [INFO] [logging.py:96:log_dist] [Rank 0] step=5620, skipped=48, lr=[6.934342911786143e-06, 6.934342911786143e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:09,084] [INFO] [timer.py:199:stop] epoch=1/micro_step=1940/global_step=5620, RunningAvgSamplesPerSec=202.85230753766206, CurrSamplesPerSec=206.1722585937284, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:10,647] [INFO] [logging.py:96:log_dist] [Rank 0] step=5630, skipped=48, lr=[6.860744426206292e-06, 6.860744426206292e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:10,653] [INFO] [timer.py:199:stop] epoch=1/micro_step=1950/global_step=5630, RunningAvgSamplesPerSec=202.85565782730254, CurrSamplesPerSec=206.2004489106772, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:12,199] [INFO] [logging.py:96:log_dist] [Rank 0] step=5640, skipped=48, lr=[6.787476434122461e-06, 6.787476434122461e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:12,206] [INFO] [timer.py:199:stop] epoch=1/micro_step=1960/global_step=5640, RunningAvgSamplesPerSec=202.8625887046263, CurrSamplesPerSec=207.10982314663505, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:13,754] [INFO] [logging.py:96:log_dist] [Rank 0] step=5650, skipped=48, lr=[6.7145402704623635e-06, 6.7145402704623635e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:13,760] [INFO] [timer.py:199:stop] epoch=1/micro_step=1970/global_step=5650, RunningAvgSamplesPerSec=202.8691782876845, CurrSamplesPerSec=207.1254841806044, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:15,307] [INFO] [logging.py:96:log_dist] [Rank 0] step=5660, skipped=48, lr=[6.641937264107867e-06, 6.641937264107867e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:15,313] [INFO] [timer.py:199:stop] epoch=1/micro_step=1980/global_step=5660, RunningAvgSamplesPerSec=202.87593908869806, CurrSamplesPerSec=206.8579742309352, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:16,860] [INFO] [logging.py:96:log_dist] [Rank 0] step=5670, skipped=48, lr=[6.569668737870763e-06, 6.569668737870763e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:16,866] [INFO] [timer.py:199:stop] epoch=1/micro_step=1990/global_step=5670, RunningAvgSamplesPerSec=202.88285124638003, CurrSamplesPerSec=206.9163332855423, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:18,415] [INFO] [logging.py:96:log_dist] [Rank 0] step=5680, skipped=48, lr=[6.497736008468702e-06, 6.497736008468702e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:18,421] [INFO] [timer.py:199:stop] epoch=1/micro_step=2000/global_step=5680, RunningAvgSamplesPerSec=202.88924388442226, CurrSamplesPerSec=206.65636292809708, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:19,997] [INFO] [logging.py:96:log_dist] [Rank 0] step=5690, skipped=48, lr=[6.426140386501189e-06, 6.426140386501189e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:20,003] [INFO] [timer.py:199:stop] epoch=1/micro_step=2010/global_step=5690, RunningAvgSamplesPerSec=202.8894897071596, CurrSamplesPerSec=190.0497972314733, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:20,148] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:05:20,148] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:05:20,148] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:05:20,148] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:05:20,148] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:05:20,148] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:05:20,148] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:05:20,148] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:05:20,148] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:05:20,149] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:05:20,149] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:05:20,149] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:05:20,149] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:05:20,149] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:05:20,149] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:05:20,149] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:05:21,561] [INFO] [logging.py:96:log_dist] [Rank 0] step=5700, skipped=48, lr=[6.35488317642568e-06, 6.35488317642568e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:21,567] [INFO] [timer.py:199:stop] epoch=1/micro_step=2020/global_step=5700, RunningAvgSamplesPerSec=202.89386666390783, CurrSamplesPerSec=201.8211547396896, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:23,115] [INFO] [logging.py:96:log_dist] [Rank 0] step=5710, skipped=48, lr=[6.283965676533851e-06, 6.283965676533851e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:23,121] [INFO] [timer.py:199:stop] epoch=1/micro_step=2030/global_step=5710, RunningAvgSamplesPerSec=202.90043300208688, CurrSamplesPerSec=205.44263519615495, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:24,667] [INFO] [logging.py:96:log_dist] [Rank 0] step=5720, skipped=48, lr=[6.2133891789279365e-06, 6.2133891789279365e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:24,673] [INFO] [timer.py:199:stop] epoch=1/micro_step=2040/global_step=5720, RunningAvgSamplesPerSec=202.9073019466751, CurrSamplesPerSec=206.38054190032614, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:26,221] [INFO] [logging.py:96:log_dist] [Rank 0] step=5730, skipped=48, lr=[6.143154969497161e-06, 6.143154969497161e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:26,228] [INFO] [timer.py:199:stop] epoch=1/micro_step=2050/global_step=5730, RunningAvgSamplesPerSec=202.9136953547691, CurrSamplesPerSec=206.75345673230967, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:27,776] [INFO] [logging.py:96:log_dist] [Rank 0] step=5740, skipped=48, lr=[6.073264327894332e-06, 6.073264327894332e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:27,782] [INFO] [timer.py:199:stop] epoch=1/micro_step=2060/global_step=5740, RunningAvgSamplesPerSec=202.92010467985574, CurrSamplesPerSec=205.27076715550953, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:29,341] [INFO] [logging.py:96:log_dist] [Rank 0] step=5750, skipped=48, lr=[6.003718527512531e-06, 6.003718527512531e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:29,347] [INFO] [timer.py:199:stop] epoch=1/micro_step=2070/global_step=5750, RunningAvgSamplesPerSec=202.92402202812804, CurrSamplesPerSec=201.8700261102889, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:30,894] [INFO] [logging.py:96:log_dist] [Rank 0] step=5760, skipped=48, lr=[5.934518835461908e-06, 5.934518835461908e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:30,901] [INFO] [timer.py:199:stop] epoch=1/micro_step=2080/global_step=5760, RunningAvgSamplesPerSec=202.93061224796963, CurrSamplesPerSec=206.8114381512082, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:32,448] [INFO] [logging.py:96:log_dist] [Rank 0] step=5770, skipped=48, lr=[5.865666512546569e-06, 5.865666512546569e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:32,454] [INFO] [timer.py:199:stop] epoch=1/micro_step=2090/global_step=5770, RunningAvgSamplesPerSec=202.9370962572559, CurrSamplesPerSec=206.92271332039337, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:33,504] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5776 [2023-05-15 23:05:33,504] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5776 [2023-05-15 23:05:33,505] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:05:33,504] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5776 [2023-05-15 23:05:33,505] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5776 [2023-05-15 23:05:33,505] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:05:33,505] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:05:33,505] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:05:33,505] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5776 [2023-05-15 23:05:33,505] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5776 [2023-05-15 23:05:33,505] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-05-15 23:05:33,505] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:05:33,505] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5776 [2023-05-15 23:05:33,505] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:05:33,505] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:05:33,505] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5776 [2023-05-15 23:05:33,505] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:05:33,965] [INFO] [logging.py:96:log_dist] [Rank 0] step=5780, skipped=49, lr=[5.803997459488275e-06, 5.803997459488275e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:33,971] [INFO] [timer.py:199:stop] epoch=1/micro_step=2100/global_step=5780, RunningAvgSamplesPerSec=202.95172376575255, CurrSamplesPerSec=207.29214911325585, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:35,518] [INFO] [logging.py:96:log_dist] [Rank 0] step=5790, skipped=49, lr=[5.735808588759633e-06, 5.735808588759633e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:35,524] [INFO] [timer.py:199:stop] epoch=1/micro_step=2110/global_step=5790, RunningAvgSamplesPerSec=202.95833428046734, CurrSamplesPerSec=206.98717373368186, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:37,070] [INFO] [logging.py:96:log_dist] [Rank 0] step=5800, skipped=49, lr=[5.6679707076259916e-06, 5.6679707076259916e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:37,076] [INFO] [timer.py:199:stop] epoch=1/micro_step=2120/global_step=5800, RunningAvgSamplesPerSec=202.96505597932682, CurrSamplesPerSec=206.6118258101362, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:38,688] [INFO] [logging.py:96:log_dist] [Rank 0] step=5810, skipped=49, lr=[5.600485052079568e-06, 5.600485052079568e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:38,694] [INFO] [timer.py:199:stop] epoch=1/micro_step=2130/global_step=5810, RunningAvgSamplesPerSec=202.95724570058405, CurrSamplesPerSec=197.35782482174707, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:40,329] [INFO] [logging.py:96:log_dist] [Rank 0] step=5820, skipped=49, lr=[5.533352851695093e-06, 5.533352851695093e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:40,335] [INFO] [timer.py:199:stop] epoch=1/micro_step=2140/global_step=5820, RunningAvgSamplesPerSec=202.9444398019723, CurrSamplesPerSec=195.4833977576268, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:41,914] [INFO] [logging.py:96:log_dist] [Rank 0] step=5830, skipped=49, lr=[5.466575329607398e-06, 5.466575329607398e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:41,920] [INFO] [timer.py:199:stop] epoch=1/micro_step=2150/global_step=5830, RunningAvgSamplesPerSec=202.94388672517542, CurrSamplesPerSec=202.68090581264005, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:43,499] [INFO] [logging.py:96:log_dist] [Rank 0] step=5840, skipped=49, lr=[5.400153702489177e-06, 5.400153702489177e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:43,506] [INFO] [timer.py:199:stop] epoch=1/micro_step=2160/global_step=5840, RunningAvgSamplesPerSec=202.9433131545231, CurrSamplesPerSec=202.40948273261952, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:45,086] [INFO] [logging.py:96:log_dist] [Rank 0] step=5850, skipped=49, lr=[5.334089180528776e-06, 5.334089180528776e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:45,092] [INFO] [timer.py:199:stop] epoch=1/micro_step=2170/global_step=5850, RunningAvgSamplesPerSec=202.94258586360576, CurrSamplesPerSec=200.7933851656925, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:46,647] [INFO] [logging.py:96:log_dist] [Rank 0] step=5860, skipped=49, lr=[5.2683829674081724e-06, 5.2683829674081724e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:46,653] [INFO] [timer.py:199:stop] epoch=1/micro_step=2180/global_step=5860, RunningAvgSamplesPerSec=202.94732483232943, CurrSamplesPerSec=205.08508378804154, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:48,215] [INFO] [logging.py:96:log_dist] [Rank 0] step=5870, skipped=49, lr=[5.203036260281002e-06, 5.203036260281002e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:48,221] [INFO] [timer.py:199:stop] epoch=1/micro_step=2190/global_step=5870, RunningAvgSamplesPerSec=202.95052872209232, CurrSamplesPerSec=206.40561496750541, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:49,453] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:05:49,453] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:05:49,453] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:05:49,453] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:05:49,453] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:05:49,453] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:05:49,454] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:05:49,454] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:05:49,454] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:05:49,454] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:05:49,454] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:05:49,454] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:05:49,454] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:05:49,454] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:05:49,454] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:05:49,455] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:05:49,778] [INFO] [logging.py:96:log_dist] [Rank 0] step=5880, skipped=49, lr=[5.1380502497508086e-06, 5.1380502497508086e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:49,785] [INFO] [timer.py:199:stop] epoch=1/micro_step=2200/global_step=5880, RunningAvgSamplesPerSec=202.9548357721437, CurrSamplesPerSec=206.3243876418485, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:51,343] [INFO] [logging.py:96:log_dist] [Rank 0] step=5890, skipped=49, lr=[5.073426119849295e-06, 5.073426119849295e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:51,349] [INFO] [timer.py:199:stop] epoch=1/micro_step=2210/global_step=5890, RunningAvgSamplesPerSec=202.95883642686726, CurrSamplesPerSec=206.25621875638703, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:52,895] [INFO] [logging.py:96:log_dist] [Rank 0] step=5900, skipped=49, lr=[5.00916504801478e-06, 5.00916504801478e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:52,901] [INFO] [timer.py:199:stop] epoch=1/micro_step=2220/global_step=5900, RunningAvgSamplesPerSec=202.96542803805391, CurrSamplesPerSec=207.1085447994383, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:54,447] [INFO] [logging.py:96:log_dist] [Rank 0] step=5910, skipped=49, lr=[4.945268205070741e-06, 4.945268205070741e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:54,453] [INFO] [timer.py:199:stop] epoch=1/micro_step=2230/global_step=5910, RunningAvgSamplesPerSec=202.9720038397123, CurrSamplesPerSec=206.9252654444991, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:56,010] [INFO] [logging.py:96:log_dist] [Rank 0] step=5920, skipped=49, lr=[4.881736755204491e-06, 4.881736755204491e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:56,017] [INFO] [timer.py:199:stop] epoch=1/micro_step=2240/global_step=5920, RunningAvgSamplesPerSec=202.97608108436114, CurrSamplesPerSec=206.9903658865713, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:57,563] [INFO] [logging.py:96:log_dist] [Rank 0] step=5930, skipped=49, lr=[4.818571855945933e-06, 4.818571855945933e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:57,569] [INFO] [timer.py:199:stop] epoch=1/micro_step=2250/global_step=5930, RunningAvgSamplesPerSec=202.98250079631595, CurrSamplesPerSec=206.9488755772448, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:05:59,118] [INFO] [logging.py:96:log_dist] [Rank 0] step=5940, skipped=49, lr=[4.755774658146508e-06, 4.755774658146508e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:05:59,124] [INFO] [timer.py:199:stop] epoch=1/micro_step=2260/global_step=5940, RunningAvgSamplesPerSec=202.98855925641718, CurrSamplesPerSec=205.20893993462315, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:00,725] [INFO] [logging.py:96:log_dist] [Rank 0] step=5950, skipped=49, lr=[4.693346305958218e-06, 4.693346305958218e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:00,729] [INFO] [timer.py:199:stop] epoch=1/micro_step=2270/global_step=5950, RunningAvgSamplesPerSec=202.98391856554778, CurrSamplesPerSec=197.41617245599173, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:01,501] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5954 [2023-05-15 23:06:01,501] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:06:01,501] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5954 [2023-05-15 23:06:01,501] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5954 [2023-05-15 23:06:01,501] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5954 [2023-05-15 23:06:01,501] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5954 [2023-05-15 23:06:01,501] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:06:01,501] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:06:01,501] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5954 [2023-05-15 23:06:01,501] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:06:01,501] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:06:01,501] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5954 [2023-05-15 23:06:01,501] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5954 [2023-05-15 23:06:01,501] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:06:01,501] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:06:01,501] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:06:01,501] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-05-15 23:06:02,311] [INFO] [logging.py:96:log_dist] [Rank 0] step=5960, skipped=50, lr=[4.637477092212775e-06, 4.637477092212775e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:02,315] [INFO] [timer.py:199:stop] epoch=1/micro_step=2280/global_step=5960, RunningAvgSamplesPerSec=202.98351979601296, CurrSamplesPerSec=197.87602979247876, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:03,939] [INFO] [logging.py:96:log_dist] [Rank 0] step=5970, skipped=50, lr=[4.575752674738951e-06, 4.575752674738951e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:03,942] [INFO] [timer.py:199:stop] epoch=1/micro_step=2290/global_step=5970, RunningAvgSamplesPerSec=202.97418546535417, CurrSamplesPerSec=196.9489101007359, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:05,545] [INFO] [logging.py:96:log_dist] [Rank 0] step=5980, skipped=50, lr=[4.514400382839673e-06, 4.514400382839673e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:05,549] [INFO] [timer.py:199:stop] epoch=1/micro_step=2300/global_step=5980, RunningAvgSamplesPerSec=202.9690884000143, CurrSamplesPerSec=200.36204909565353, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:07,146] [INFO] [logging.py:96:log_dist] [Rank 0] step=5990, skipped=50, lr=[4.453421334341051e-06, 4.453421334341051e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:07,150] [INFO] [timer.py:199:stop] epoch=1/micro_step=2310/global_step=5990, RunningAvgSamplesPerSec=202.96524538198574, CurrSamplesPerSec=200.47727004545243, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:08,752] [INFO] [logging.py:96:log_dist] [Rank 0] step=6000, skipped=50, lr=[4.3928166402687685e-06, 4.3928166402687685e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:08,756] [INFO] [timer.py:199:stop] epoch=1/micro_step=2320/global_step=6000, RunningAvgSamplesPerSec=202.96029645839013, CurrSamplesPerSec=198.34243095506557, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:10,364] [INFO] [logging.py:96:log_dist] [Rank 0] step=6010, skipped=50, lr=[4.332587404827854e-06, 4.332587404827854e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:10,367] [INFO] [timer.py:199:stop] epoch=1/micro_step=2330/global_step=6010, RunningAvgSamplesPerSec=202.95421625652412, CurrSamplesPerSec=200.73362487212773, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:11,949] [INFO] [logging.py:96:log_dist] [Rank 0] step=6020, skipped=50, lr=[4.272734725382565e-06, 4.272734725382565e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:11,956] [INFO] [timer.py:199:stop] epoch=1/micro_step=2340/global_step=6020, RunningAvgSamplesPerSec=202.95298177146591, CurrSamplesPerSec=202.1622295759522, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:13,536] [INFO] [logging.py:96:log_dist] [Rank 0] step=6030, skipped=50, lr=[4.213259692436367e-06, 4.213259692436367e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:13,543] [INFO] [timer.py:199:stop] epoch=1/micro_step=2350/global_step=6030, RunningAvgSamplesPerSec=202.95204662428722, CurrSamplesPerSec=202.25697408077156, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:15,124] [INFO] [logging.py:96:log_dist] [Rank 0] step=6040, skipped=50, lr=[4.154163389612109e-06, 4.154163389612109e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:15,130] [INFO] [timer.py:199:stop] epoch=1/micro_step=2360/global_step=6040, RunningAvgSamplesPerSec=202.95100042967937, CurrSamplesPerSec=201.76533071163246, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:16,713] [INFO] [logging.py:96:log_dist] [Rank 0] step=6050, skipped=50, lr=[4.095446893632235e-06, 4.095446893632235e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:16,719] [INFO] [timer.py:199:stop] epoch=1/micro_step=2370/global_step=6050, RunningAvgSamplesPerSec=202.94954269828665, CurrSamplesPerSec=202.1412211211181, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:17,656] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:06:17,656] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:06:17,656] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:06:17,657] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:06:17,657] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:06:17,657] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:06:17,657] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:06:17,657] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:06:17,657] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:06:17,657] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:06:17,657] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:06:17,657] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:06:17,657] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:06:17,658] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:06:17,660] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:06:17,660] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:06:18,308] [INFO] [logging.py:96:log_dist] [Rank 0] step=6060, skipped=50, lr=[4.0371112742991855e-06, 4.0371112742991855e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:18,314] [INFO] [timer.py:199:stop] epoch=1/micro_step=2380/global_step=6060, RunningAvgSamplesPerSec=202.9469021776423, CurrSamplesPerSec=201.07072949671542, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:19,919] [INFO] [logging.py:96:log_dist] [Rank 0] step=6070, skipped=50, lr=[3.979157594475924e-06, 3.979157594475924e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:19,925] [INFO] [timer.py:199:stop] epoch=1/micro_step=2390/global_step=6070, RunningAvgSamplesPerSec=202.94084063640838, CurrSamplesPerSec=200.0061513885329, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:21,519] [INFO] [logging.py:96:log_dist] [Rank 0] step=6080, skipped=50, lr=[3.921586910066552e-06, 3.921586910066552e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:21,525] [INFO] [timer.py:199:stop] epoch=1/micro_step=2400/global_step=6080, RunningAvgSamplesPerSec=202.93740767340768, CurrSamplesPerSec=200.72942196964033, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:23,115] [INFO] [logging.py:96:log_dist] [Rank 0] step=6090, skipped=50, lr=[3.864400269997057e-06, 3.864400269997057e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:23,122] [INFO] [timer.py:199:stop] epoch=1/micro_step=2410/global_step=6090, RunningAvgSamplesPerSec=202.93438556991566, CurrSamplesPerSec=201.77382368018712, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:24,710] [INFO] [logging.py:96:log_dist] [Rank 0] step=6100, skipped=50, lr=[3.807598716196231e-06, 3.807598716196231e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:24,717] [INFO] [timer.py:199:stop] epoch=1/micro_step=2420/global_step=6100, RunningAvgSamplesPerSec=202.93173347523063, CurrSamplesPerSec=200.99485000007488, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:26,307] [INFO] [logging.py:96:log_dist] [Rank 0] step=6110, skipped=50, lr=[3.751183283576673e-06, 3.751183283576673e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:26,313] [INFO] [timer.py:199:stop] epoch=1/micro_step=2430/global_step=6110, RunningAvgSamplesPerSec=202.9288283251422, CurrSamplesPerSec=200.35307640176384, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:27,901] [INFO] [logging.py:96:log_dist] [Rank 0] step=6120, skipped=50, lr=[3.6951550000159374e-06, 3.6951550000159374e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:27,907] [INFO] [timer.py:199:stop] epoch=1/micro_step=2440/global_step=6120, RunningAvgSamplesPerSec=202.92643584333246, CurrSamplesPerSec=201.59834657117796, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:29,496] [INFO] [logging.py:96:log_dist] [Rank 0] step=6130, skipped=50, lr=[3.6395148863377858e-06, 3.6395148863377858e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:29,502] [INFO] [timer.py:199:stop] epoch=1/micro_step=2450/global_step=6130, RunningAvgSamplesPerSec=202.92398749268114, CurrSamplesPerSec=206.39767979369108, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:31,050] [INFO] [logging.py:96:log_dist] [Rank 0] step=6140, skipped=50, lr=[3.5842639562936263e-06, 3.5842639562936263e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:31,056] [INFO] [timer.py:199:stop] epoch=1/micro_step=2460/global_step=6140, RunningAvgSamplesPerSec=202.9298904809933, CurrSamplesPerSec=206.63218325330345, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:32,604] [INFO] [logging.py:96:log_dist] [Rank 0] step=6150, skipped=50, lr=[3.5294032165440055e-06, 3.5294032165440055e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:32,611] [INFO] [timer.py:199:stop] epoch=1/micro_step=2470/global_step=6150, RunningAvgSamplesPerSec=202.93582137228242, CurrSamplesPerSec=207.03538572600897, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:33,524] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:06:33,524] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:06:33,524] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:06:33,524] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:06:33,524] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:06:33,524] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:06:33,524] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:06:33,524] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:06:33,525] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:06:33,525] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:06:33,525] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:06:33,525] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:06:33,525] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:06:33,525] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:06:33,525] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:06:33,525] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-05-15 23:06:33,662] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6156 [2023-05-15 23:06:33,662] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6156 [2023-05-15 23:06:33,662] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:06:33,662] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:06:33,662] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6156 [2023-05-15 23:06:33,662] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6156 [2023-05-15 23:06:33,662] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:06:33,662] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6156 [2023-05-15 23:06:33,662] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:06:33,662] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6156 [2023-05-15 23:06:33,662] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:06:33,663] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-05-15 23:06:33,663] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:06:33,662] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6156 [2023-05-15 23:06:33,663] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6156 [2023-05-15 23:06:33,663] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:06:33,663] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-05-15 23:06:34,123] [INFO] [logging.py:96:log_dist] [Rank 0] step=6160, skipped=51, lr=[3.48036298974756e-06, 3.48036298974756e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:34,129] [INFO] [timer.py:199:stop] epoch=1/micro_step=2480/global_step=6160, RunningAvgSamplesPerSec=202.94920626696543, CurrSamplesPerSec=206.4522861339917, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:35,680] [INFO] [logging.py:96:log_dist] [Rank 0] step=6170, skipped=51, lr=[3.4262463594311483e-06, 3.4262463594311483e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:35,686] [INFO] [timer.py:199:stop] epoch=1/micro_step=2490/global_step=6170, RunningAvgSamplesPerSec=202.95444573116927, CurrSamplesPerSec=204.8434467149252, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:37,245] [INFO] [logging.py:96:log_dist] [Rank 0] step=6180, skipped=51, lr=[3.3725227984573116e-06, 3.3725227984573116e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:37,252] [INFO] [timer.py:199:stop] epoch=1/micro_step=2500/global_step=6180, RunningAvgSamplesPerSec=202.95800759143418, CurrSamplesPerSec=206.92111827479565, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:38,800] [INFO] [logging.py:96:log_dist] [Rank 0] step=6190, skipped=51, lr=[3.3191932856582454e-06, 3.3191932856582454e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:38,807] [INFO] [timer.py:199:stop] epoch=1/micro_step=2510/global_step=6190, RunningAvgSamplesPerSec=202.96370862239775, CurrSamplesPerSec=206.40498013114674, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:40,360] [INFO] [logging.py:96:log_dist] [Rank 0] step=6200, skipped=51, lr=[3.266258792686672e-06, 3.266258792686672e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:40,366] [INFO] [timer.py:199:stop] epoch=1/micro_step=2520/global_step=6200, RunningAvgSamplesPerSec=202.96851092686154, CurrSamplesPerSec=207.21854036754047, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:41,417] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6206 [2023-05-15 23:06:41,417] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6206 [2023-05-15 23:06:41,417] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:06:41,417] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:06:41,417] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6206 [2023-05-15 23:06:41,417] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6206 [2023-05-15 23:06:41,417] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6206 [2023-05-15 23:06:41,417] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:06:41,417] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:06:41,417] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:06:41,417] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-05-15 23:06:41,417] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6206 [2023-05-15 23:06:41,417] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6206 [2023-05-15 23:06:41,417] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6206 [2023-05-15 23:06:41,417] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:06:41,417] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:06:41,417] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:06:41,877] [INFO] [logging.py:96:log_dist] [Rank 0] step=6210, skipped=52, lr=[3.2189562882362716e-06, 3.2189562882362716e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:41,884] [INFO] [timer.py:199:stop] epoch=1/micro_step=2530/global_step=6210, RunningAvgSamplesPerSec=202.98193923974026, CurrSamplesPerSec=206.88188789469223, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:43,427] [INFO] [logging.py:96:log_dist] [Rank 0] step=6220, skipped=52, lr=[3.166774984049342e-06, 3.166774984049342e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:43,434] [INFO] [timer.py:199:stop] epoch=1/micro_step=2540/global_step=6220, RunningAvgSamplesPerSec=202.98857825382566, CurrSamplesPerSec=207.4372793758539, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:44,985] [INFO] [logging.py:96:log_dist] [Rank 0] step=6230, skipped=52, lr=[3.1149914767199334e-06, 3.1149914767199334e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:44,991] [INFO] [timer.py:199:stop] epoch=1/micro_step=2550/global_step=6230, RunningAvgSamplesPerSec=202.99379475237353, CurrSamplesPerSec=206.42561431192604, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:46,548] [INFO] [logging.py:96:log_dist] [Rank 0] step=6240, skipped=52, lr=[3.0636067097328773e-06, 3.0636067097328773e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:46,554] [INFO] [timer.py:199:stop] epoch=1/micro_step=2560/global_step=6240, RunningAvgSamplesPerSec=202.99779262662125, CurrSamplesPerSec=205.12144122335263, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:48,106] [INFO] [logging.py:96:log_dist] [Rank 0] step=6250, skipped=52, lr=[3.0126216193080285e-06, 3.0126216193080285e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:48,112] [INFO] [timer.py:199:stop] epoch=1/micro_step=2570/global_step=6250, RunningAvgSamplesPerSec=203.00280433401414, CurrSamplesPerSec=205.80490677134446, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:49,685] [INFO] [logging.py:96:log_dist] [Rank 0] step=6260, skipped=52, lr=[2.962037134383211e-06, 2.962037134383211e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:49,688] [INFO] [timer.py:199:stop] epoch=1/micro_step=2580/global_step=6260, RunningAvgSamplesPerSec=203.00412603528173, CurrSamplesPerSec=199.34460823264692, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:51,296] [INFO] [logging.py:96:log_dist] [Rank 0] step=6270, skipped=52, lr=[2.9118541765973202e-06, 2.9118541765973202e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:51,300] [INFO] [timer.py:199:stop] epoch=1/micro_step=2590/global_step=6270, RunningAvgSamplesPerSec=202.99829126810465, CurrSamplesPerSec=199.47763381219477, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:52,895] [INFO] [logging.py:96:log_dist] [Rank 0] step=6280, skipped=52, lr=[2.8620736602734983e-06, 2.8620736602734983e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:52,899] [INFO] [timer.py:199:stop] epoch=1/micro_step=2600/global_step=6280, RunningAvgSamplesPerSec=202.9947571776134, CurrSamplesPerSec=200.88835405562764, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:54,473] [INFO] [logging.py:96:log_dist] [Rank 0] step=6290, skipped=52, lr=[2.8126964924024806e-06, 2.8126964924024806e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:54,479] [INFO] [timer.py:199:stop] epoch=1/micro_step=2610/global_step=6290, RunningAvgSamplesPerSec=202.99520864425256, CurrSamplesPerSec=207.62371547860846, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:56,030] [INFO] [logging.py:96:log_dist] [Rank 0] step=6300, skipped=52, lr=[2.763723572626087e-06, 2.763723572626087e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:56,037] [INFO] [timer.py:199:stop] epoch=1/micro_step=2620/global_step=6300, RunningAvgSamplesPerSec=203.00027809154022, CurrSamplesPerSec=206.79773631070415, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:57,260] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:06:57,260] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:06:57,260] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:06:57,260] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:06:57,260] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:06:57,260] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:06:57,260] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:06:57,260] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:06:57,260] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:06:57,261] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:06:57,260] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:06:57,261] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:06:57,261] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:06:57,261] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:06:57,261] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:06:57,261] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:06:57,398] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6308 [2023-05-15 23:06:57,398] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6308 [2023-05-15 23:06:57,398] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:06:57,398] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6308 [2023-05-15 23:06:57,398] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:06:57,398] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6308 [2023-05-15 23:06:57,398] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:06:57,398] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6308 [2023-05-15 23:06:57,398] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:06:57,398] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6308 [2023-05-15 23:06:57,398] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:06:57,398] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6308 [2023-05-15 23:06:57,398] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-05-15 23:06:57,398] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:06:57,398] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:06:57,398] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6308 [2023-05-15 23:06:57,398] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:06:57,548] [INFO] [logging.py:96:log_dist] [Rank 0] step=6310, skipped=53, lr=[2.7199943145672204e-06, 2.7199943145672204e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:57,555] [INFO] [timer.py:199:stop] epoch=1/micro_step=2630/global_step=6310, RunningAvgSamplesPerSec=203.01336661346838, CurrSamplesPerSec=206.7066493150474, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:06:59,102] [INFO] [logging.py:96:log_dist] [Rank 0] step=6320, skipped=53, lr=[2.6717919182918326e-06, 2.6717919182918326e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:06:59,109] [INFO] [timer.py:199:stop] epoch=1/micro_step=2640/global_step=6320, RunningAvgSamplesPerSec=203.01910783583747, CurrSamplesPerSec=205.56283081748677, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:00,662] [INFO] [logging.py:96:log_dist] [Rank 0] step=6330, skipped=53, lr=[2.6239963373633546e-06, 2.6239963373633546e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:00,668] [INFO] [timer.py:199:stop] epoch=1/micro_step=2650/global_step=6330, RunningAvgSamplesPerSec=203.02364960262673, CurrSamplesPerSec=206.84968545123337, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:02,213] [INFO] [logging.py:96:log_dist] [Rank 0] step=6340, skipped=53, lr=[2.57660844260742e-06, 2.57660844260742e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:02,219] [INFO] [timer.py:199:stop] epoch=1/micro_step=2660/global_step=6340, RunningAvgSamplesPerSec=203.029897793278, CurrSamplesPerSec=207.64748527552976, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:03,766] [INFO] [logging.py:96:log_dist] [Rank 0] step=6350, skipped=53, lr=[2.5296290974216876e-06, 2.5296290974216876e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:03,773] [INFO] [timer.py:199:stop] epoch=1/micro_step=2670/global_step=6350, RunningAvgSamplesPerSec=203.03572006465535, CurrSamplesPerSec=206.62645731695113, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:05,324] [INFO] [logging.py:96:log_dist] [Rank 0] step=6360, skipped=53, lr=[2.4830591577601426e-06, 2.4830591577601426e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:05,331] [INFO] [timer.py:199:stop] epoch=1/micro_step=2680/global_step=6360, RunningAvgSamplesPerSec=203.04060588320797, CurrSamplesPerSec=203.4199870871135, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:06,910] [INFO] [logging.py:96:log_dist] [Rank 0] step=6370, skipped=53, lr=[2.4368994721174904e-06, 2.4368994721174904e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:06,916] [INFO] [timer.py:199:stop] epoch=1/micro_step=2690/global_step=6370, RunningAvgSamplesPerSec=203.0399559071791, CurrSamplesPerSec=202.7354002966615, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:08,460] [INFO] [logging.py:96:log_dist] [Rank 0] step=6380, skipped=53, lr=[2.3911508815136764e-06, 2.3911508815136764e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:08,467] [INFO] [timer.py:199:stop] epoch=1/micro_step=2700/global_step=6380, RunningAvgSamplesPerSec=203.0461553161323, CurrSamplesPerSec=207.27486236265221, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:10,017] [INFO] [logging.py:96:log_dist] [Rank 0] step=6390, skipped=53, lr=[2.3458142194785927e-06, 2.3458142194785927e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:10,023] [INFO] [timer.py:199:stop] epoch=1/micro_step=2710/global_step=6390, RunningAvgSamplesPerSec=203.05134026537368, CurrSamplesPerSec=205.93848131991894, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:11,568] [INFO] [logging.py:96:log_dist] [Rank 0] step=6400, skipped=53, lr=[2.3008903120368657e-06, 2.3008903120368657e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:11,575] [INFO] [timer.py:199:stop] epoch=1/micro_step=2720/global_step=6400, RunningAvgSamplesPerSec=203.0573717727836, CurrSamplesPerSec=207.58710779029357, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:13,107] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:07:13,107] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:07:13,107] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:07:13,107] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:07:13,107] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:07:13,107] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:07:13,107] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:07:13,107] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:07:13,108] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:07:13,108] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:07:13,108] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:07:13,108] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:07:13,108] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:07:13,108] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:07:13,108] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:07:13,108] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:07:13,119] [INFO] [logging.py:96:log_dist] [Rank 0] step=6410, skipped=53, lr=[2.2563799776928325e-06, 2.2563799776928325e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:13,126] [INFO] [timer.py:199:stop] epoch=1/micro_step=2730/global_step=6410, RunningAvgSamplesPerSec=203.06349465082403, CurrSamplesPerSec=206.3780031952075, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:14,671] [INFO] [logging.py:96:log_dist] [Rank 0] step=6420, skipped=53, lr=[2.212284027415598e-06, 2.212284027415598e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:14,677] [INFO] [timer.py:199:stop] epoch=1/micro_step=2740/global_step=6420, RunningAvgSamplesPerSec=203.06950498728665, CurrSamplesPerSec=207.2284584980237, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:16,223] [INFO] [logging.py:96:log_dist] [Rank 0] step=6430, skipped=53, lr=[2.1686032646242915e-06, 2.1686032646242915e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:16,230] [INFO] [timer.py:199:stop] epoch=1/micro_step=2750/global_step=6430, RunningAvgSamplesPerSec=203.07537264928055, CurrSamplesPerSec=206.62582112138972, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:17,777] [INFO] [logging.py:96:log_dist] [Rank 0] step=6440, skipped=53, lr=[2.1253384851734033e-06, 2.1253384851734033e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:17,783] [INFO] [timer.py:199:stop] epoch=1/micro_step=2760/global_step=6440, RunningAvgSamplesPerSec=203.08094297002228, CurrSamplesPerSec=205.747803679364, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:19,374] [INFO] [logging.py:96:log_dist] [Rank 0] step=6450, skipped=53, lr=[2.082490477338284e-06, 2.082490477338284e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:19,380] [INFO] [timer.py:199:stop] epoch=1/micro_step=2770/global_step=6450, RunningAvgSamplesPerSec=203.07798135973354, CurrSamplesPerSec=206.43990193092714, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:20,926] [INFO] [logging.py:96:log_dist] [Rank 0] step=6460, skipped=53, lr=[2.0400600218008077e-06, 2.0400600218008077e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:20,933] [INFO] [timer.py:199:stop] epoch=1/micro_step=2780/global_step=6460, RunningAvgSamplesPerSec=203.08378366783765, CurrSamplesPerSec=206.26160721420712, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:22,477] [INFO] [logging.py:96:log_dist] [Rank 0] step=6470, skipped=53, lr=[1.99804789163513e-06, 1.99804789163513e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:22,483] [INFO] [timer.py:199:stop] epoch=1/micro_step=2790/global_step=6470, RunningAvgSamplesPerSec=203.08995819394767, CurrSamplesPerSec=207.3177644149453, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:23,844] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6478 [2023-05-15 23:07:23,844] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:07:23,844] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6478 [2023-05-15 23:07:23,844] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6478 [2023-05-15 23:07:23,844] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:07:23,844] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6478 [2023-05-15 23:07:23,844] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6478 [2023-05-15 23:07:23,844] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:07:23,844] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:07:23,844] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6478 [2023-05-15 23:07:23,844] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-05-15 23:07:23,844] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:07:23,844] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:07:23,844] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6478 [2023-05-15 23:07:23,844] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6478 [2023-05-15 23:07:23,845] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:07:23,845] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:07:23,994] [INFO] [logging.py:96:log_dist] [Rank 0] step=6480, skipped=54, lr=[1.9605952754829738e-06, 1.9605952754829738e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:24,001] [INFO] [timer.py:199:stop] epoch=1/micro_step=2800/global_step=6480, RunningAvgSamplesPerSec=203.10263417747015, CurrSamplesPerSec=206.79455009637326, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:25,546] [INFO] [logging.py:96:log_dist] [Rank 0] step=6490, skipped=54, lr=[1.919380066034257e-06, 1.919380066034257e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:25,553] [INFO] [timer.py:199:stop] epoch=1/micro_step=2810/global_step=6490, RunningAvgSamplesPerSec=203.1084075412814, CurrSamplesPerSec=207.55500640985, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:27,098] [INFO] [logging.py:96:log_dist] [Rank 0] step=6500, skipped=54, lr=[1.8785853807212428e-06, 1.8785853807212428e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:27,105] [INFO] [timer.py:199:stop] epoch=1/micro_step=2820/global_step=6500, RunningAvgSamplesPerSec=203.1142135089828, CurrSamplesPerSec=207.168644181599, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:28,652] [INFO] [logging.py:96:log_dist] [Rank 0] step=6510, skipped=54, lr=[1.8382119628146788e-06, 1.8382119628146788e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:28,659] [INFO] [timer.py:199:stop] epoch=1/micro_step=2830/global_step=6510, RunningAvgSamplesPerSec=203.11968616343154, CurrSamplesPerSec=206.62295828984355, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:30,210] [INFO] [logging.py:96:log_dist] [Rank 0] step=6520, skipped=54, lr=[1.7982605479099191e-06, 1.7982605479099191e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:30,216] [INFO] [timer.py:199:stop] epoch=1/micro_step=2840/global_step=6520, RunningAvgSamplesPerSec=203.12447175505403, CurrSamplesPerSec=204.78031439238475, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:31,762] [INFO] [logging.py:96:log_dist] [Rank 0] step=6530, skipped=54, lr=[1.7587318639135069e-06, 1.7587318639135069e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:31,768] [INFO] [timer.py:199:stop] epoch=1/micro_step=2850/global_step=6530, RunningAvgSamplesPerSec=203.13018726649597, CurrSamplesPerSec=207.19934668270113, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:33,313] [INFO] [logging.py:96:log_dist] [Rank 0] step=6540, skipped=54, lr=[1.7196266310299108e-06, 1.7196266310299108e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:33,319] [INFO] [timer.py:199:stop] epoch=1/micro_step=2860/global_step=6540, RunningAvgSamplesPerSec=203.1360779420141, CurrSamplesPerSec=207.22941836748873, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:34,865] [INFO] [logging.py:96:log_dist] [Rank 0] step=6550, skipped=54, lr=[1.6809455617484121e-06, 1.6809455617484121e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:34,872] [INFO] [timer.py:199:stop] epoch=1/micro_step=2870/global_step=6550, RunningAvgSamplesPerSec=203.14165961076378, CurrSamplesPerSec=206.3780031952075, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:36,421] [INFO] [logging.py:96:log_dist] [Rank 0] step=6560, skipped=54, lr=[1.6426893608301102e-06, 1.6426893608301102e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:36,428] [INFO] [timer.py:199:stop] epoch=1/micro_step=2880/global_step=6560, RunningAvgSamplesPerSec=203.14656858654612, CurrSamplesPerSec=202.95766124459217, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:37,973] [INFO] [logging.py:96:log_dist] [Rank 0] step=6570, skipped=54, lr=[1.6048587252951024e-06, 1.6048587252951024e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:37,979] [INFO] [timer.py:199:stop] epoch=1/micro_step=2890/global_step=6570, RunningAvgSamplesPerSec=203.15236008827046, CurrSamplesPerSec=206.51931362170413, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:39,514] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:07:39,514] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:07:39,514] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:07:39,514] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:07:39,514] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:07:39,514] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:07:39,514] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:07:39,514] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:07:39,514] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:07:39,515] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:07:39,515] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:07:39,515] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:07:39,515] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:07:39,515] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:07:39,515] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:07:39,515] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:07:39,527] [INFO] [logging.py:96:log_dist] [Rank 0] step=6580, skipped=54, lr=[1.5674543444097634e-06, 1.5674543444097634e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:39,533] [INFO] [timer.py:199:stop] epoch=1/micro_step=2900/global_step=6580, RunningAvgSamplesPerSec=203.15772185169018, CurrSamplesPerSec=206.05198210562838, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:40,277] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6584 [2023-05-15 23:07:40,277] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6584 [2023-05-15 23:07:40,278] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:07:40,278] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:07:40,278] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6584 [2023-05-15 23:07:40,278] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6584 [2023-05-15 23:07:40,278] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:07:40,278] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:07:40,278] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6584 [2023-05-15 23:07:40,278] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-05-15 23:07:40,278] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6584 [2023-05-15 23:07:40,278] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:07:40,278] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:07:40,278] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6584 [2023-05-15 23:07:40,278] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6584 [2023-05-15 23:07:40,278] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:07:40,278] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:07:41,048] [INFO] [logging.py:96:log_dist] [Rank 0] step=6590, skipped=55, lr=[1.534155412758939e-06, 1.534155412758939e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:41,054] [INFO] [timer.py:199:stop] epoch=1/micro_step=2910/global_step=6590, RunningAvgSamplesPerSec=203.16934746459043, CurrSamplesPerSec=206.79837356535245, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:42,599] [INFO] [logging.py:96:log_dist] [Rank 0] step=6600, skipped=55, lr=[1.4975627868119374e-06, 1.4975627868119374e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:42,606] [INFO] [timer.py:199:stop] epoch=1/micro_step=2920/global_step=6600, RunningAvgSamplesPerSec=203.1750381422306, CurrSamplesPerSec=207.01143656523624, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:44,151] [INFO] [logging.py:96:log_dist] [Rank 0] step=6610, skipped=55, lr=[1.4613983704244826e-06, 1.4613983704244826e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:44,157] [INFO] [timer.py:199:stop] epoch=1/micro_step=2930/global_step=6610, RunningAvgSamplesPerSec=203.1807295081174, CurrSamplesPerSec=206.6305926835952, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:45,713] [INFO] [logging.py:96:log_dist] [Rank 0] step=6620, skipped=55, lr=[1.425662822504778e-06, 1.425662822504778e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:45,719] [INFO] [timer.py:199:stop] epoch=1/micro_step=2940/global_step=6620, RunningAvgSamplesPerSec=203.18447105989972, CurrSamplesPerSec=206.13394479350166, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:47,270] [INFO] [logging.py:96:log_dist] [Rank 0] step=6630, skipped=55, lr=[1.3903567941471462e-06, 1.3903567941471462e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:47,276] [INFO] [timer.py:199:stop] epoch=1/micro_step=2950/global_step=6630, RunningAvgSamplesPerSec=203.18931398745772, CurrSamplesPerSec=205.67560261734374, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:48,827] [INFO] [logging.py:96:log_dist] [Rank 0] step=6640, skipped=55, lr=[1.3554809286201343e-06, 1.3554809286201343e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:48,834] [INFO] [timer.py:199:stop] epoch=1/micro_step=2960/global_step=6640, RunningAvgSamplesPerSec=203.19383873529316, CurrSamplesPerSec=205.619201255, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:50,392] [INFO] [logging.py:96:log_dist] [Rank 0] step=6650, skipped=55, lr=[1.3210358613548218e-06, 1.3210358613548218e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:50,398] [INFO] [timer.py:199:stop] epoch=1/micro_step=2970/global_step=6650, RunningAvgSamplesPerSec=203.1970202967453, CurrSamplesPerSec=206.59433433692286, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:51,945] [INFO] [logging.py:96:log_dist] [Rank 0] step=6660, skipped=55, lr=[1.2870222199332399e-06, 1.2870222199332399e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:51,952] [INFO] [timer.py:199:stop] epoch=1/micro_step=2980/global_step=6660, RunningAvgSamplesPerSec=203.20235133337692, CurrSamplesPerSec=207.46068582822608, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:53,498] [INFO] [logging.py:96:log_dist] [Rank 0] step=6670, skipped=55, lr=[1.2534406240769097e-06, 1.2534406240769097e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:53,504] [INFO] [timer.py:199:stop] epoch=1/micro_step=2990/global_step=6670, RunningAvgSamplesPerSec=203.20776018377617, CurrSamplesPerSec=207.2892677879195, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:55,059] [INFO] [logging.py:96:log_dist] [Rank 0] step=6680, skipped=55, lr=[1.220291685635591e-06, 1.220291685635591e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:55,066] [INFO] [timer.py:199:stop] epoch=1/micro_step=3000/global_step=6680, RunningAvgSamplesPerSec=203.2114570012797, CurrSamplesPerSec=202.06209803685414, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:55,983] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:07:55,983] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:07:55,983] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:07:55,983] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:07:55,983] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:07:55,983] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:07:55,983] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:07:55,983] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:07:55,984] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:07:55,984] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:07:55,984] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:07:55,984] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:07:55,984] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:07:55,984] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:07:55,984] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:07:55,984] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:07:56,617] [INFO] [logging.py:96:log_dist] [Rank 0] step=6690, skipped=55, lr=[1.187576008576105e-06, 1.187576008576105e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:56,623] [INFO] [timer.py:199:stop] epoch=1/micro_step=3010/global_step=6690, RunningAvgSamplesPerSec=203.21585087257654, CurrSamplesPerSec=207.13347541656572, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:57,519] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6695 [2023-05-15 23:07:57,519] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6695 [2023-05-15 23:07:57,519] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6695 [2023-05-15 23:07:57,520] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:07:57,520] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:07:57,519] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6695 [2023-05-15 23:07:57,520] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:07:57,520] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:07:57,520] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6695 [2023-05-15 23:07:57,520] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-05-15 23:07:57,520] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6695 [2023-05-15 23:07:57,520] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:07:57,520] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6695 [2023-05-15 23:07:57,520] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:07:57,520] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6695 [2023-05-15 23:07:57,520] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:07:57,520] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:07:58,141] [INFO] [logging.py:96:log_dist] [Rank 0] step=6700, skipped=56, lr=[1.15850283052156e-06, 1.15850283052156e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:58,148] [INFO] [timer.py:199:stop] epoch=1/micro_step=3020/global_step=6700, RunningAvgSamplesPerSec=203.22662622425716, CurrSamplesPerSec=206.29426095082638, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:07:59,703] [INFO] [logging.py:96:log_dist] [Rank 0] step=6710, skipped=56, lr=[1.1266119857352709e-06, 1.1266119857352709e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:07:59,710] [INFO] [timer.py:199:stop] epoch=1/micro_step=3030/global_step=6710, RunningAvgSamplesPerSec=203.2302327298697, CurrSamplesPerSec=206.94951376371708, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:01,260] [INFO] [logging.py:96:log_dist] [Rank 0] step=6720, skipped=56, lr=[1.095156109155629e-06, 1.095156109155629e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:01,266] [INFO] [timer.py:199:stop] epoch=1/micro_step=3040/global_step=6720, RunningAvgSamplesPerSec=203.2348427304696, CurrSamplesPerSec=206.29838303104827, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:02,813] [INFO] [logging.py:96:log_dist] [Rank 0] step=6730, skipped=56, lr=[1.0641357739022224e-06, 1.0641357739022224e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:02,819] [INFO] [timer.py:199:stop] epoch=1/micro_step=3050/global_step=6730, RunningAvgSamplesPerSec=203.24004253527988, CurrSamplesPerSec=206.57398435047065, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:04,364] [INFO] [logging.py:96:log_dist] [Rank 0] step=6740, skipped=56, lr=[1.0335515451591503e-06, 1.0335515451591503e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:04,371] [INFO] [timer.py:199:stop] epoch=1/micro_step=3060/global_step=6740, RunningAvgSamplesPerSec=203.24547711686793, CurrSamplesPerSec=207.27486236265221, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:05,918] [INFO] [logging.py:96:log_dist] [Rank 0] step=6750, skipped=56, lr=[1.0034039801647687e-06, 1.0034039801647687e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:05,924] [INFO] [timer.py:199:stop] epoch=1/micro_step=3070/global_step=6750, RunningAvgSamplesPerSec=203.25059830773006, CurrSamplesPerSec=206.95972528260458, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:07,471] [INFO] [logging.py:96:log_dist] [Rank 0] step=6760, skipped=56, lr=[9.73693628201483e-07, 9.73693628201483e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:07,478] [INFO] [timer.py:199:stop] epoch=1/micro_step=3080/global_step=6760, RunningAvgSamplesPerSec=203.25566966524272, CurrSamplesPerSec=206.21755690591607, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:09,029] [INFO] [logging.py:96:log_dist] [Rank 0] step=6770, skipped=56, lr=[9.444210305857848e-07, 9.444210305857848e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:09,036] [INFO] [timer.py:199:stop] epoch=1/micro_step=3090/global_step=6770, RunningAvgSamplesPerSec=203.25994441549324, CurrSamplesPerSec=205.82258044729062, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:10,589] [INFO] [logging.py:96:log_dist] [Rank 0] step=6780, skipped=56, lr=[9.155867206583624e-07, 9.155867206583624e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:10,595] [INFO] [timer.py:199:stop] epoch=1/micro_step=3100/global_step=6780, RunningAvgSamplesPerSec=203.26394672399718, CurrSamplesPerSec=206.118749875224, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:12,142] [INFO] [logging.py:96:log_dist] [Rank 0] step=6790, skipped=56, lr=[8.871912237744029e-07, 8.871912237744029e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:12,148] [INFO] [timer.py:199:stop] epoch=1/micro_step=3110/global_step=6790, RunningAvgSamplesPerSec=203.26904242349568, CurrSamplesPerSec=206.39133209391872, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:12,743] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6793 [2023-05-15 23:08:12,743] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6793 [2023-05-15 23:08:12,743] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 23:08:12,743] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 23:08:12,743] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6793 [2023-05-15 23:08:12,743] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6793 [2023-05-15 23:08:12,743] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 23:08:12,743] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6793 [2023-05-15 23:08:12,743] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 23:08:12,743] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0 [2023-05-15 23:08:12,743] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6793 [2023-05-15 23:08:12,743] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 23:08:12,743] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 23:08:12,743] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6793 [2023-05-15 23:08:12,743] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 23:08:12,743] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6793 [2023-05-15 23:08:12,743] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 23:08:13,669] [INFO] [logging.py:96:log_dist] [Rank 0] step=6800, skipped=57, lr=[8.620108894944661e-07, 8.620108894944661e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:13,675] [INFO] [timer.py:199:stop] epoch=1/micro_step=3120/global_step=6800, RunningAvgSamplesPerSec=203.27909823441738, CurrSamplesPerSec=207.06988225420255, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:15,222] [INFO] [logging.py:96:log_dist] [Rank 0] step=6810, skipped=57, lr=[8.344505561046157e-07, 8.344505561046157e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:15,228] [INFO] [timer.py:199:stop] epoch=1/micro_step=3130/global_step=6810, RunningAvgSamplesPerSec=203.2841711662123, CurrSamplesPerSec=206.7416732773161, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:16,799] [INFO] [logging.py:96:log_dist] [Rank 0] step=6820, skipped=57, lr=[8.073305140424093e-07, 8.073305140424093e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:16,805] [INFO] [timer.py:199:stop] epoch=1/micro_step=3140/global_step=6820, RunningAvgSamplesPerSec=203.28476934847103, CurrSamplesPerSec=194.80670383277285, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:18,352] [INFO] [logging.py:96:log_dist] [Rank 0] step=6830, skipped=57, lr=[7.806512574294239e-07, 7.806512574294239e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:18,358] [INFO] [timer.py:199:stop] epoch=1/micro_step=3150/global_step=6830, RunningAvgSamplesPerSec=203.28979527081148, CurrSamplesPerSec=205.3744191135179, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:19,908] [INFO] [logging.py:96:log_dist] [Rank 0] step=6840, skipped=57, lr=[7.544132723562108e-07, 7.544132723562108e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:19,914] [INFO] [timer.py:199:stop] epoch=1/micro_step=3160/global_step=6840, RunningAvgSamplesPerSec=203.29443708010083, CurrSamplesPerSec=205.8528852406719, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:21,461] [INFO] [logging.py:96:log_dist] [Rank 0] step=6850, skipped=57, lr=[7.286170368734496e-07, 7.286170368734496e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:21,468] [INFO] [timer.py:199:stop] epoch=1/micro_step=3170/global_step=6850, RunningAvgSamplesPerSec=203.29935436934846, CurrSamplesPerSec=206.93706983743246, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:23,013] [INFO] [logging.py:96:log_dist] [Rank 0] step=6860, skipped=57, lr=[7.03263020983233e-07, 7.03263020983233e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:23,020] [INFO] [timer.py:199:stop] epoch=1/micro_step=3180/global_step=6860, RunningAvgSamplesPerSec=203.3045919113007, CurrSamplesPerSec=207.23677766214675, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:24,564] [INFO] [logging.py:96:log_dist] [Rank 0] step=6870, skipped=57, lr=[6.783516866304989e-07, 6.783516866304989e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:24,571] [INFO] [timer.py:199:stop] epoch=1/micro_step=3190/global_step=6870, RunningAvgSamplesPerSec=203.30996953066898, CurrSamplesPerSec=207.6988980413579, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:26,118] [INFO] [logging.py:96:log_dist] [Rank 0] step=6880, skipped=57, lr=[6.538834876946232e-07, 6.538834876946232e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:26,124] [INFO] [timer.py:199:stop] epoch=1/micro_step=3200/global_step=6880, RunningAvgSamplesPerSec=203.31493626962856, CurrSamplesPerSec=207.0392181059802, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:27,671] [INFO] [logging.py:96:log_dist] [Rank 0] step=6890, skipped=57, lr=[6.298588699811481e-07, 6.298588699811481e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:27,677] [INFO] [timer.py:199:stop] epoch=1/micro_step=3210/global_step=6890, RunningAvgSamplesPerSec=203.31998367009857, CurrSamplesPerSec=206.23023132533822, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:28,435] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:08:28,435] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 23:08:28,435] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:08:28,435] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 23:08:28,435] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:08:28,435] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 23:08:28,435] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:08:28,436] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 23:08:28,436] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:08:28,436] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:08:28,436] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 23:08:28,436] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 23:08:28,436] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:08:28,436] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 23:08:28,436] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:08:28,436] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 23:08:29,237] [INFO] [logging.py:96:log_dist] [Rank 0] step=6900, skipped=57, lr=[6.062782712136506e-07, 6.062782712136506e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:29,243] [INFO] [timer.py:199:stop] epoch=1/micro_step=3220/global_step=6900, RunningAvgSamplesPerSec=203.32249508485145, CurrSamplesPerSec=204.60861651317967, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:30,807] [INFO] [logging.py:96:log_dist] [Rank 0] step=6910, skipped=57, lr=[5.831421210257787e-07, 5.831421210257787e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:30,813] [INFO] [timer.py:199:stop] epoch=1/micro_step=3230/global_step=6910, RunningAvgSamplesPerSec=203.32442446168133, CurrSamplesPerSec=190.68133004681164, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:32,367] [INFO] [logging.py:96:log_dist] [Rank 0] step=6920, skipped=57, lr=[5.604508409534165e-07, 5.604508409534165e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:32,374] [INFO] [timer.py:199:stop] epoch=1/micro_step=3240/global_step=6920, RunningAvgSamplesPerSec=203.32802914701998, CurrSamplesPerSec=206.74867949407178, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:33,922] [INFO] [logging.py:96:log_dist] [Rank 0] step=6930, skipped=57, lr=[5.382048444270094e-07, 5.382048444270094e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:33,929] [INFO] [timer.py:199:stop] epoch=1/micro_step=3250/global_step=6930, RunningAvgSamplesPerSec=203.33257642987564, CurrSamplesPerSec=206.3484953308673, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:35,479] [INFO] [logging.py:96:log_dist] [Rank 0] step=6940, skipped=57, lr=[5.164045367640258e-07, 5.164045367640258e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:35,486] [INFO] [timer.py:199:stop] epoch=1/micro_step=3260/global_step=6940, RunningAvgSamplesPerSec=203.33673322466717, CurrSamplesPerSec=206.1630746497457, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:37,041] [INFO] [logging.py:96:log_dist] [Rank 0] step=6950, skipped=57, lr=[4.950503151615743e-07, 4.950503151615743e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:37,047] [INFO] [timer.py:199:stop] epoch=1/micro_step=3270/global_step=6950, RunningAvgSamplesPerSec=203.34014659847665, CurrSamplesPerSec=206.16845824180888, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:38,597] [INFO] [logging.py:96:log_dist] [Rank 0] step=6960, skipped=57, lr=[4.741425686891732e-07, 4.741425686891732e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:38,603] [INFO] [timer.py:199:stop] epoch=1/micro_step=3280/global_step=6960, RunningAvgSamplesPerSec=203.34445726900316, CurrSamplesPerSec=206.29996849038187, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:40,174] [INFO] [logging.py:96:log_dist] [Rank 0] step=6970, skipped=57, lr=[4.5368167828165055e-07, 4.5368167828165055e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:40,180] [INFO] [timer.py:199:stop] epoch=1/micro_step=3290/global_step=6970, RunningAvgSamplesPerSec=203.34489744099, CurrSamplesPerSec=206.2508305800999, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:41,752] [INFO] [logging.py:96:log_dist] [Rank 0] step=6980, skipped=57, lr=[4.3366801673220545e-07, 4.3366801673220545e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:41,759] [INFO] [timer.py:199:stop] epoch=1/micro_step=3300/global_step=6980, RunningAvgSamplesPerSec=203.34512889412852, CurrSamplesPerSec=206.37990721819102, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:43,124] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6988 [2023-05-15 23:08:43,124] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6988 [2023-05-15 23:08:43,124] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 23:08:43,124] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 23:08:43,124] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6988 [2023-05-15 23:08:43,124] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6988 [2023-05-15 23:08:43,124] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 23:08:43,124] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 23:08:43,124] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6988 [2023-05-15 23:08:43,124] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6988 [2023-05-15 23:08:43,124] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 23:08:43,124] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6988 [2023-05-15 23:08:43,124] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0 [2023-05-15 23:08:43,124] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6988 [2023-05-15 23:08:43,125] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 23:08:43,125] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 23:08:43,125] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 23:08:43,275] [INFO] [logging.py:96:log_dist] [Rank 0] step=6990, skipped=58, lr=[4.1603840355876057e-07, 4.1603840355876057e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:43,281] [INFO] [timer.py:199:stop] epoch=1/micro_step=3310/global_step=6990, RunningAvgSamplesPerSec=203.3556553333026, CurrSamplesPerSec=206.29013903532905, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:44,832] [INFO] [logging.py:96:log_dist] [Rank 0] step=7000, skipped=58, lr=[3.968754746960346e-07, 3.968754746960346e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:44,839] [INFO] [timer.py:199:stop] epoch=1/micro_step=3320/global_step=7000, RunningAvgSamplesPerSec=203.35971441047016, CurrSamplesPerSec=205.950805434419, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:46,429] [INFO] [logging.py:96:log_dist] [Rank 0] step=7010, skipped=58, lr=[3.781608096887046e-07, 3.781608096887046e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:46,436] [INFO] [timer.py:199:stop] epoch=1/micro_step=3330/global_step=7010, RunningAvgSamplesPerSec=203.35650661665017, CurrSamplesPerSec=200.34559977072266, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:48,029] [INFO] [logging.py:96:log_dist] [Rank 0] step=7020, skipped=58, lr=[3.598947495141114e-07, 3.598947495141114e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:48,035] [INFO] [timer.py:199:stop] epoch=1/micro_step=3340/global_step=7020, RunningAvgSamplesPerSec=203.3529128647905, CurrSamplesPerSec=200.83875466491892, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:49,639] [INFO] [logging.py:96:log_dist] [Rank 0] step=7030, skipped=58, lr=[3.4207762697610057e-07, 3.4207762697610057e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:49,646] [INFO] [timer.py:199:stop] epoch=1/micro_step=3350/global_step=7030, RunningAvgSamplesPerSec=203.34734852165258, CurrSamplesPerSec=206.1510417546635, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:51,196] [INFO] [logging.py:96:log_dist] [Rank 0] step=7040, skipped=58, lr=[3.2470976669896905e-07, 3.2470976669896905e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:51,202] [INFO] [timer.py:199:stop] epoch=1/micro_step=3360/global_step=7040, RunningAvgSamplesPerSec=203.35159111134342, CurrSamplesPerSec=206.80665823319225, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:52,756] [INFO] [logging.py:96:log_dist] [Rank 0] step=7050, skipped=58, lr=[3.077914851215585e-07, 3.077914851215585e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:52,763] [INFO] [timer.py:199:stop] epoch=1/micro_step=3370/global_step=7050, RunningAvgSamplesPerSec=203.3550217128685, CurrSamplesPerSec=207.48730506189787, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:54,308] [INFO] [logging.py:96:log_dist] [Rank 0] step=7060, skipped=58, lr=[2.9132309049146046e-07, 2.9132309049146046e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:54,315] [INFO] [timer.py:199:stop] epoch=1/micro_step=3380/global_step=7060, RunningAvgSamplesPerSec=203.36002302770763, CurrSamplesPerSec=207.08745434880214, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:55,904] [INFO] [logging.py:96:log_dist] [Rank 0] step=7070, skipped=58, lr=[2.75304882859434e-07, 2.75304882859434e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:55,910] [INFO] [timer.py:199:stop] epoch=1/micro_step=3390/global_step=7070, RunningAvgSamplesPerSec=203.35707183761897, CurrSamplesPerSec=198.2724038904712, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:57,527] [INFO] [logging.py:96:log_dist] [Rank 0] step=7080, skipped=58, lr=[2.5973715407391905e-07, 2.5973715407391905e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:57,533] [INFO] [timer.py:199:stop] epoch=1/micro_step=3400/global_step=7080, RunningAvgSamplesPerSec=203.34904880544028, CurrSamplesPerSec=198.2539531077594, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:08:59,141] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:08:59,141] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:08:59,141] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 23:08:59,141] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 23:08:59,141] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:08:59,141] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:08:59,141] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 23:08:59,141] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:08:59,141] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 23:08:59,141] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 23:08:59,141] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:08:59,142] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 23:08:59,142] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:08:59,142] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 23:08:59,145] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:08:59,145] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 23:08:59,153] [INFO] [logging.py:96:log_dist] [Rank 0] step=7090, skipped=58, lr=[2.4462018777572095e-07, 2.4462018777572095e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:08:59,160] [INFO] [timer.py:199:stop] epoch=1/micro_step=3410/global_step=7090, RunningAvgSamplesPerSec=203.3405264914874, CurrSamplesPerSec=197.39120357433842, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:00,777] [INFO] [logging.py:96:log_dist] [Rank 0] step=7100, skipped=58, lr=[2.2995425939285053e-07, 2.2995425939285053e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:00,784] [INFO] [timer.py:199:stop] epoch=1/micro_step=3420/global_step=7100, RunningAvgSamplesPerSec=203.33240893559915, CurrSamplesPerSec=197.75561989376828, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:02,402] [INFO] [logging.py:96:log_dist] [Rank 0] step=7110, skipped=58, lr=[2.1573963613549796e-07, 2.1573963613549796e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:02,408] [INFO] [timer.py:199:stop] epoch=1/micro_step=3430/global_step=7110, RunningAvgSamplesPerSec=203.3243241390103, CurrSamplesPerSec=204.4755149299208, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:03,997] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 7119 [2023-05-15 23:09:03,997] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 7119 [2023-05-15 23:09:03,997] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 7119 [2023-05-15 23:09:03,997] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 23:09:03,997] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 7119 [2023-05-15 23:09:03,997] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 7119 [2023-05-15 23:09:03,997] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 7119 [2023-05-15 23:09:03,997] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 7119 [2023-05-15 23:09:03,997] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 23:09:03,997] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 23:09:03,997] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 7119 [2023-05-15 23:09:03,997] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 23:09:03,997] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 23:09:03,997] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 23:09:03,997] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 23:09:03,997] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0 [2023-05-15 23:09:03,997] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-05-15 23:09:03,998] [INFO] [logging.py:96:log_dist] [Rank 0] step=7120, skipped=59, lr=[2.0333255530934902e-07, 2.0333255530934902e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:03,998] [INFO] [timer.py:199:stop] epoch=1/micro_step=3440/global_step=7120, RunningAvgSamplesPerSec=203.32235995314176, CurrSamplesPerSec=249.95526354505958, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:05,609] [INFO] [logging.py:96:log_dist] [Rank 0] step=7130, skipped=59, lr=[1.8997611850120333e-07, 1.8997611850120333e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:05,616] [INFO] [timer.py:199:stop] epoch=1/micro_step=3450/global_step=7130, RunningAvgSamplesPerSec=203.31552021492288, CurrSamplesPerSec=198.02375381019968, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:07,236] [INFO] [logging.py:96:log_dist] [Rank 0] step=7140, skipped=59, lr=[1.7707171521205712e-07, 1.7707171521205712e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:07,242] [INFO] [timer.py:199:stop] epoch=1/micro_step=3460/global_step=7140, RunningAvgSamplesPerSec=203.3071122643779, CurrSamplesPerSec=195.83793997528275, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:08,867] [INFO] [logging.py:96:log_dist] [Rank 0] step=7150, skipped=59, lr=[1.6461958055747906e-07, 1.6461958055747906e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:08,874] [INFO] [timer.py:199:stop] epoch=1/micro_step=3470/global_step=7150, RunningAvgSamplesPerSec=203.29782471889547, CurrSamplesPerSec=192.36038587384144, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:10,496] [INFO] [logging.py:96:log_dist] [Rank 0] step=7160, skipped=59, lr=[1.52619941412796e-07, 1.52619941412796e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:10,502] [INFO] [timer.py:199:stop] epoch=1/micro_step=3480/global_step=7160, RunningAvgSamplesPerSec=203.28908084326932, CurrSamplesPerSec=198.19598728287338, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:12,121] [INFO] [logging.py:96:log_dist] [Rank 0] step=7170, skipped=59, lr=[1.4107301640895464e-07, 1.4107301640895464e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:12,127] [INFO] [timer.py:199:stop] epoch=1/micro_step=3490/global_step=7170, RunningAvgSamplesPerSec=203.28101459888828, CurrSamplesPerSec=198.2442897824623, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:13,728] [INFO] [logging.py:96:log_dist] [Rank 0] step=7180, skipped=59, lr=[1.2997901592855521e-07, 1.2997901592855521e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:13,734] [INFO] [timer.py:199:stop] epoch=1/micro_step=3500/global_step=7180, RunningAvgSamplesPerSec=203.27626955588477, CurrSamplesPerSec=201.04874263767474, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:15,318] [INFO] [logging.py:96:log_dist] [Rank 0] step=7190, skipped=59, lr=[1.1933814210199067e-07, 1.1933814210199067e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:15,324] [INFO] [timer.py:199:stop] epoch=1/micro_step=3510/global_step=7190, RunningAvgSamplesPerSec=203.27458327578108, CurrSamplesPerSec=202.27038561929965, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:16,906] [INFO] [logging.py:96:log_dist] [Rank 0] step=7200, skipped=59, lr=[1.0915058880379137e-07, 1.0915058880379137e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:16,913] [INFO] [timer.py:199:stop] epoch=1/micro_step=3520/global_step=7200, RunningAvgSamplesPerSec=203.2730193687464, CurrSamplesPerSec=202.50965339513436, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:18,500] [INFO] [logging.py:96:log_dist] [Rank 0] step=7210, skipped=59, lr=[9.941654164907776e-08, 9.941654164907776e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:18,507] [INFO] [timer.py:199:stop] epoch=1/micro_step=3530/global_step=7210, RunningAvgSamplesPerSec=203.27050308596168, CurrSamplesPerSec=201.73924963512866, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:20,091] [INFO] [logging.py:96:log_dist] [Rank 0] step=7220, skipped=59, lr=[9.013617799018271e-08, 9.013617799018271e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:20,098] [INFO] [timer.py:199:stop] epoch=1/micro_step=3540/global_step=7220, RunningAvgSamplesPerSec=203.26866974477542, CurrSamplesPerSec=206.97983215566776, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:20,234] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:09:20,234] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 23:09:20,234] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:09:20,234] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 23:09:20,234] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:09:20,235] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 23:09:20,234] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:09:20,235] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 23:09:20,235] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:09:20,235] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 23:09:20,235] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:09:20,235] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 23:09:20,235] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:09:20,235] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 23:09:20,235] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:09:20,235] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-05-15 23:09:21,646] [INFO] [logging.py:96:log_dist] [Rank 0] step=7230, skipped=59, lr=[8.130966691341502e-08, 8.130966691341502e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:21,652] [INFO] [timer.py:199:stop] epoch=1/micro_step=3550/global_step=7230, RunningAvgSamplesPerSec=203.27317556863946, CurrSamplesPerSec=205.68537359452844, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:23,201] [INFO] [logging.py:96:log_dist] [Rank 0] step=7240, skipped=59, lr=[7.293716923598704e-08, 7.293716923598704e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:23,207] [INFO] [timer.py:199:stop] epoch=1/micro_step=3560/global_step=7240, RunningAvgSamplesPerSec=203.2778737791596, CurrSamplesPerSec=207.18495375230003, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:24,753] [INFO] [logging.py:96:log_dist] [Rank 0] step=7250, skipped=59, lr=[6.501883750308357e-08, 6.501883750308357e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:24,759] [INFO] [timer.py:199:stop] epoch=1/micro_step=3570/global_step=7250, RunningAvgSamplesPerSec=203.28316860192436, CurrSamplesPerSec=207.2985523452378, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:26,305] [INFO] [logging.py:96:log_dist] [Rank 0] step=7260, skipped=59, lr=[5.755481598507528e-08, 5.755481598507528e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:26,312] [INFO] [timer.py:199:stop] epoch=1/micro_step=3580/global_step=7260, RunningAvgSamplesPerSec=203.28812084256649, CurrSamplesPerSec=206.76524153060015, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:27,865] [INFO] [logging.py:96:log_dist] [Rank 0] step=7270, skipped=59, lr=[5.0545240674890194e-08, 5.0545240674890194e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:27,871] [INFO] [timer.py:199:stop] epoch=1/micro_step=3590/global_step=7270, RunningAvgSamplesPerSec=203.29168627529126, CurrSamplesPerSec=208.99547184383778, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:29,424] [INFO] [logging.py:96:log_dist] [Rank 0] step=7280, skipped=59, lr=[4.399023928554902e-08, 4.399023928554902e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:29,430] [INFO] [timer.py:199:stop] epoch=1/micro_step=3600/global_step=7280, RunningAvgSamplesPerSec=203.29547689212, CurrSamplesPerSec=204.885660814, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:30,979] [INFO] [logging.py:96:log_dist] [Rank 0] step=7290, skipped=59, lr=[3.788993124782536e-08, 3.788993124782536e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:30,985] [INFO] [timer.py:199:stop] epoch=1/micro_step=3610/global_step=7290, RunningAvgSamplesPerSec=203.29986594001602, CurrSamplesPerSec=207.1350737377541, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:32,531] [INFO] [logging.py:96:log_dist] [Rank 0] step=7300, skipped=59, lr=[3.2244427708072455e-08, 3.2244427708072455e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:32,537] [INFO] [timer.py:199:stop] epoch=1/micro_step=3620/global_step=7300, RunningAvgSamplesPerSec=203.30473921493268, CurrSamplesPerSec=206.70473925839454, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:34,106] [INFO] [logging.py:96:log_dist] [Rank 0] step=7310, skipped=59, lr=[2.705383152619978e-08, 2.705383152619978e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:34,112] [INFO] [timer.py:199:stop] epoch=1/micro_step=3630/global_step=7310, RunningAvgSamplesPerSec=203.30559933328414, CurrSamplesPerSec=207.33217580434876, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:35,677] [INFO] [logging.py:96:log_dist] [Rank 0] step=7320, skipped=59, lr=[2.2318237273802333e-08, 2.2318237273802333e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:35,683] [INFO] [timer.py:199:stop] epoch=1/micro_step=3640/global_step=7320, RunningAvgSamplesPerSec=203.30710804017355, CurrSamplesPerSec=207.56977181166536, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:35,820] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:09:35,820] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:09:35,820] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:09:35,820] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:09:35,820] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:09:35,820] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:09:35,820] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:09:35,820] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:09:35,820] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:09:35,821] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:09:35,821] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:09:35,821] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:09:35,821] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:09:35,821] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-05-15 23:09:35,821] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:09:35,821] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-05-15 23:09:36,114] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 7322 [2023-05-15 23:09:36,114] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 7322 [2023-05-15 23:09:36,114] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:09:36,114] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 7322 [2023-05-15 23:09:36,114] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:09:36,114] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 7322 [2023-05-15 23:09:36,114] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:09:36,114] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 7322 [2023-05-15 23:09:36,114] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:09:36,114] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-05-15 23:09:36,114] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 7322 [2023-05-15 23:09:36,114] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:09:36,114] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 7322 [2023-05-15 23:09:36,114] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:09:36,114] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 7322 [2023-05-15 23:09:36,114] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:09:36,114] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-05-15 23:09:37,196] [INFO] [logging.py:96:log_dist] [Rank 0] step=7330, skipped=60, lr=[1.8445300579364445e-08, 1.8445300579364445e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:37,202] [INFO] [timer.py:199:stop] epoch=1/micro_step=3650/global_step=7330, RunningAvgSamplesPerSec=203.31780659411254, CurrSamplesPerSec=207.15937125711534, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:38,760] [INFO] [logging.py:96:log_dist] [Rank 0] step=7340, skipped=60, lr=[1.4574440845649407e-08, 1.4574440845649407e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:38,766] [INFO] [timer.py:199:stop] epoch=1/micro_step=3660/global_step=7340, RunningAvgSamplesPerSec=203.32057021501436, CurrSamplesPerSec=207.78024645488884, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:40,330] [INFO] [logging.py:96:log_dist] [Rank 0] step=7350, skipped=60, lr=[1.1158810413333665e-08, 1.1158810413333665e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:40,337] [INFO] [timer.py:199:stop] epoch=1/micro_step=3670/global_step=7350, RunningAvgSamplesPerSec=203.32226818683523, CurrSamplesPerSec=206.79136398022337, MemAllocated=4.34GB, MaxMemAllocated=12.81GB [2023-05-15 23:09:41,889] [INFO] [logging.py:96:log_dist] [Rank 0] step=7360, skipped=60, lr=[8.198471514497819e-09, 8.198471514497819e-09], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-05-15 23:09:41,895] [INFO] [timer.py:199:stop] epoch=1/micro_step=3680/global_step=7360, RunningAvgSamplesPerSec=203.32598593787492, CurrSamplesPerSec=206.83470357351447, MemAllocated=4.34GB, MaxMemAllocated=12.81GB Epoch 2/2 with loss 0.4578983452009118 ***** Evaluating reward, Epoch 2/2 ***** chosen_last_scores (higher is better) : 6.453372001647949, acc (higher is better) : 0.7074999809265137 saving model ... [2023-05-15 23:09:58,706] [INFO] [launch.py:460:main] Process 58137 exits successfully. [2023-05-15 23:09:59,706] [INFO] [launch.py:460:main] Process 58143 exits successfully. [2023-05-15 23:09:59,711] [INFO] [launch.py:460:main] Process 58138 exits successfully. [2023-05-15 23:09:59,712] [INFO] [launch.py:460:main] Process 58139 exits successfully. [2023-05-15 23:09:59,712] [INFO] [launch.py:460:main] Process 58142 exits successfully. [2023-05-15 23:09:59,713] [INFO] [launch.py:460:main] Process 58141 exits successfully. [2023-05-15 23:10:00,713] [INFO] [launch.py:460:main] Process 58140 exits successfully. [2023-05-15 23:10:00,714] [INFO] [launch.py:460:main] Process 58136 exits successfully.