[2023-06-29 16:59:28,411] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-06-29 16:59:29,413] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-06-29 16:59:29,469] [INFO] [runner.py:555:main] cmd = /home/mxfeng/miniconda3/envs/safe-rlhf/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets --data_split 2,4,4 --model_name_or_path facebook/opt-1.3b --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --max_seq_len 512 --learning_rate 9.65e-6 --weight_decay 0. --num_train_epochs 16 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --zero_stage 2 --deepspeed --output_dir ./output [2023-06-29 16:59:30,743] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-06-29 16:59:31,756] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} [2023-06-29 16:59:31,756] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0 [2023-06-29 16:59:31,756] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) [2023-06-29 16:59:31,756] [INFO] [launch.py:163:main] dist_world_size=8 [2023-06-29 16:59:31,757] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 [2023-06-29 16:59:33,441] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-06-29 16:59:33,500] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-06-29 16:59:33,518] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-06-29 16:59:33,522] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-06-29 16:59:33,523] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-06-29 16:59:33,523] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-06-29 16:59:33,524] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-06-29 16:59:33,524] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-06-29 16:59:35,870] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-29 16:59:35,870] [INFO] [comm.py:594:init_distributed] cdb=None [2023-06-29 16:59:35,870] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2023-06-29 16:59:35,894] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-29 16:59:35,894] [INFO] [comm.py:594:init_distributed] cdb=None [2023-06-29 16:59:35,935] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-29 16:59:35,935] [INFO] [comm.py:594:init_distributed] cdb=None [2023-06-29 16:59:35,956] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-29 16:59:35,956] [INFO] [comm.py:594:init_distributed] cdb=None [2023-06-29 16:59:35,969] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-29 16:59:35,970] [INFO] [comm.py:594:init_distributed] cdb=None [2023-06-29 16:59:35,986] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-29 16:59:35,986] [INFO] [comm.py:594:init_distributed] cdb=None [2023-06-29 16:59:35,988] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-29 16:59:35,988] [INFO] [comm.py:594:init_distributed] cdb=None [2023-06-29 16:59:35,990] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-29 16:59:35,991] [INFO] [comm.py:594:init_distributed] cdb=None Installed CUDA version 11.1 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/mxfeng/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/mxfeng/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) Installed CUDA version 11.1 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/mxfeng/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.07970833778381348 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.10200619697570801 seconds Installed CUDA version 11.1 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/mxfeng/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/mxfeng/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.12644672393798828 seconds Installed CUDA version 11.1 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/mxfeng/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/mxfeng/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.14471650123596191 seconds Installed CUDA version 11.1 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/mxfeng/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/mxfeng/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.1467268466949463 seconds [2023-06-29 17:00:26,305] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.5, git-hash=unknown, git-branch=unknown [2023-06-29 17:00:26,306] [INFO] [comm.py:619:init_distributed] Distributed backend already initialized Installed CUDA version 11.1 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/mxfeng/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/mxfeng/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.1389760971069336 seconds Installed CUDA version 11.1 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/mxfeng/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/mxfeng/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.14158344268798828 seconds Installed CUDA version 11.1 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/mxfeng/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/mxfeng/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.15918207168579102 seconds [2023-06-29 17:00:30,426] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False [2023-06-29 17:00:30,428] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer [2023-06-29 17:00:30,428] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer [2023-06-29 17:00:30,446] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam [2023-06-29 17:00:30,446] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type= [2023-06-29 17:00:30,446] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer [2023-06-29 17:00:30,447] [INFO] [stage_1_and_2.py:133:__init__] Reduce bucket size 500,000,000 [2023-06-29 17:00:30,447] [INFO] [stage_1_and_2.py:134:__init__] Allgather bucket size 500,000,000 [2023-06-29 17:00:30,447] [INFO] [stage_1_and_2.py:135:__init__] CPU Offload: False [2023-06-29 17:00:30,447] [INFO] [stage_1_and_2.py:136:__init__] Round robin gradient partitioning: False Rank: 0 partition count [8, 8] and sizes[(164401920, False), (67840, False)] Rank: 3 partition count [8, 8] and sizes[(164401920, False), (67840, False)] Rank: 2 partition count [8, 8] and sizes[(164401920, False), (67840, False)] Rank: 7 partition count [8, 8] and sizes[(164401920, False), (67840, False)] Rank: 4 partition count [8, 8] and sizes[(164401920, False), (67840, False)] Rank: 5 partition count [8, 8] and sizes[(164401920, False), (67840, False)] Rank: 6 partition count [8, 8] and sizes[(164401920, False), (67840, False)] Rank: 1 partition count [8, 8] and sizes[(164401920, False), (67840, False)] [2023-06-29 17:00:36,881] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states [2023-06-29 17:00:36,882] [INFO] [utils.py:786:see_memory_usage] MA 3.06 GB Max_MA 3.06 GB CA 3.07 GB Max_CA 3 GB [2023-06-29 17:00:36,882] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 78.85 GB, percent = 15.7% [2023-06-29 17:00:37,270] [INFO] [utils.py:785:see_memory_usage] After initializing optimizer states [2023-06-29 17:00:37,271] [INFO] [utils.py:786:see_memory_usage] MA 4.29 GB Max_MA 4.91 GB CA 4.91 GB Max_CA 5 GB [2023-06-29 17:00:37,272] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 78.86 GB, percent = 15.7% [2023-06-29 17:00:37,272] [INFO] [stage_1_and_2.py:488:__init__] optimizer state initialized [2023-06-29 17:00:37,663] [INFO] [utils.py:785:see_memory_usage] After initializing ZeRO optimizer [2023-06-29 17:00:37,664] [INFO] [utils.py:786:see_memory_usage] MA 4.29 GB Max_MA 4.29 GB CA 4.91 GB Max_CA 5 GB [2023-06-29 17:00:37,664] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 78.86 GB, percent = 15.7% [2023-06-29 17:00:37,667] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam [2023-06-29 17:00:37,667] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler [2023-06-29 17:00:37,667] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = [2023-06-29 17:00:37,667] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[9.65e-06, 9.65e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:00:37,668] [INFO] [config.py:960:print] DeepSpeedEngine configuration: [2023-06-29 17:00:37,668] [INFO] [config.py:964:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2023-06-29 17:00:37,668] [INFO] [config.py:964:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2023-06-29 17:00:37,668] [INFO] [config.py:964:print] amp_enabled .................. False [2023-06-29 17:00:37,668] [INFO] [config.py:964:print] amp_params ................... False [2023-06-29 17:00:37,669] [INFO] [config.py:964:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2023-06-29 17:00:37,669] [INFO] [config.py:964:print] bfloat16_enabled ............. False [2023-06-29 17:00:37,669] [INFO] [config.py:964:print] checkpoint_parallel_write_pipeline False [2023-06-29 17:00:37,669] [INFO] [config.py:964:print] checkpoint_tag_validation_enabled True [2023-06-29 17:00:37,669] [INFO] [config.py:964:print] checkpoint_tag_validation_fail False [2023-06-29 17:00:37,669] [INFO] [config.py:964:print] comms_config ................. [2023-06-29 17:00:37,669] [INFO] [config.py:964:print] communication_data_type ...... None [2023-06-29 17:00:37,669] [INFO] [config.py:964:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2023-06-29 17:00:37,669] [INFO] [config.py:964:print] curriculum_enabled_legacy .... False [2023-06-29 17:00:37,669] [INFO] [config.py:964:print] curriculum_params_legacy ..... False [2023-06-29 17:00:37,669] [INFO] [config.py:964:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2023-06-29 17:00:37,669] [INFO] [config.py:964:print] data_efficiency_enabled ...... False [2023-06-29 17:00:37,669] [INFO] [config.py:964:print] dataloader_drop_last ......... False [2023-06-29 17:00:37,669] [INFO] [config.py:964:print] disable_allgather ............ False [2023-06-29 17:00:37,669] [INFO] [config.py:964:print] dump_state ................... False [2023-06-29 17:00:37,669] [INFO] [config.py:964:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 100, 'delayed_shift': 2, 'consecutive_hysteresis': False, 'min_scale': 1} [2023-06-29 17:00:37,669] [INFO] [config.py:964:print] eigenvalue_enabled ........... False [2023-06-29 17:00:37,669] [INFO] [config.py:964:print] eigenvalue_gas_boundary_resolution 1 [2023-06-29 17:00:37,670] [INFO] [config.py:964:print] eigenvalue_layer_name ........ bert.encoder.layer [2023-06-29 17:00:37,670] [INFO] [config.py:964:print] eigenvalue_layer_num ......... 0 [2023-06-29 17:00:37,670] [INFO] [config.py:964:print] eigenvalue_max_iter .......... 100 [2023-06-29 17:00:37,670] [INFO] [config.py:964:print] eigenvalue_stability ......... 1e-06 [2023-06-29 17:00:37,670] [INFO] [config.py:964:print] eigenvalue_tol ............... 0.01 [2023-06-29 17:00:37,670] [INFO] [config.py:964:print] eigenvalue_verbose ........... False [2023-06-29 17:00:37,670] [INFO] [config.py:964:print] elasticity_enabled ........... False [2023-06-29 17:00:37,670] [INFO] [config.py:964:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2023-06-29 17:00:37,670] [INFO] [config.py:964:print] fp16_auto_cast ............... False [2023-06-29 17:00:37,670] [INFO] [config.py:964:print] fp16_enabled ................. True [2023-06-29 17:00:37,670] [INFO] [config.py:964:print] fp16_master_weights_and_gradients False [2023-06-29 17:00:37,670] [INFO] [config.py:964:print] global_rank .................. 0 [2023-06-29 17:00:37,670] [INFO] [config.py:964:print] grad_accum_dtype ............. None [2023-06-29 17:00:37,670] [INFO] [config.py:964:print] gradient_accumulation_steps .. 1 [2023-06-29 17:00:37,670] [INFO] [config.py:964:print] gradient_clipping ............ 1.0 [2023-06-29 17:00:37,670] [INFO] [config.py:964:print] gradient_predivide_factor .... 1.0 [2023-06-29 17:00:37,670] [INFO] [config.py:964:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2023-06-29 17:00:37,670] [INFO] [config.py:964:print] initial_dynamic_scale ........ 65536 [2023-06-29 17:00:37,670] [INFO] [config.py:964:print] load_universal_checkpoint .... False [2023-06-29 17:00:37,670] [INFO] [config.py:964:print] loss_scale ................... 0 [2023-06-29 17:00:37,670] [INFO] [config.py:964:print] memory_breakdown ............. False [2023-06-29 17:00:37,670] [INFO] [config.py:964:print] mics_hierarchial_params_gather False [2023-06-29 17:00:37,670] [INFO] [config.py:964:print] mics_shard_size .............. -1 [2023-06-29 17:00:37,670] [INFO] [config.py:964:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2023-06-29 17:00:37,671] [INFO] [config.py:964:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2023-06-29 17:00:37,671] [INFO] [config.py:964:print] optimizer_legacy_fusion ...... False [2023-06-29 17:00:37,671] [INFO] [config.py:964:print] optimizer_name ............... None [2023-06-29 17:00:37,671] [INFO] [config.py:964:print] optimizer_params ............. None [2023-06-29 17:00:37,671] [INFO] [config.py:964:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2023-06-29 17:00:37,671] [INFO] [config.py:964:print] pld_enabled .................. False [2023-06-29 17:00:37,671] [INFO] [config.py:964:print] pld_params ................... False [2023-06-29 17:00:37,671] [INFO] [config.py:964:print] prescale_gradients ........... False [2023-06-29 17:00:37,671] [INFO] [config.py:964:print] scheduler_name ............... None [2023-06-29 17:00:37,671] [INFO] [config.py:964:print] scheduler_params ............. None [2023-06-29 17:00:37,671] [INFO] [config.py:964:print] sparse_attention ............. None [2023-06-29 17:00:37,671] [INFO] [config.py:964:print] sparse_gradients_enabled ..... False [2023-06-29 17:00:37,671] [INFO] [config.py:964:print] steps_per_print .............. 10 [2023-06-29 17:00:37,671] [INFO] [config.py:964:print] train_batch_size ............. 64 [2023-06-29 17:00:37,671] [INFO] [config.py:964:print] train_micro_batch_size_per_gpu 8 [2023-06-29 17:00:37,671] [INFO] [config.py:964:print] use_node_local_storage ....... False [2023-06-29 17:00:37,671] [INFO] [config.py:964:print] wall_clock_breakdown ......... False [2023-06-29 17:00:37,671] [INFO] [config.py:964:print] world_size ................... 8 [2023-06-29 17:00:37,671] [INFO] [config.py:964:print] zero_allow_untested_optimizer False [2023-06-29 17:00:37,671] [INFO] [config.py:964:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=30000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=30000000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=False [2023-06-29 17:00:37,671] [INFO] [config.py:964:print] zero_enabled ................. True [2023-06-29 17:00:37,671] [INFO] [config.py:964:print] zero_force_ds_cpu_optimizer .. True [2023-06-29 17:00:37,671] [INFO] [config.py:964:print] zero_optimization_stage ...... 2 [2023-06-29 17:00:37,671] [INFO] [config.py:950:print_user_config] json = { "train_batch_size": 64, "train_micro_batch_size_per_gpu": 8, "steps_per_print": 10, "zero_optimization": { "stage": 2, "offload_param": { "device": "none" }, "offload_optimizer": { "device": "none" }, "stage3_param_persistence_threshold": 1.000000e+04, "stage3_max_live_parameters": 3.000000e+07, "stage3_prefetch_bucket_size": 3.000000e+07, "memory_efficient_linear": false }, "fp16": { "enabled": true, "loss_scale_window": 100 }, "gradient_clipping": 1.0, "prescale_gradients": false, "wall_clock_breakdown": false, "hybrid_engine": { "enabled": false, "max_out_tokens": 512, "inference_tp_size": 1, "release_inference_cache": false, "pin_parameters": true, "tp_gather_partition_size": 8 } } ***** Running training ***** ***** Evaluating perplexity, Epoch 0/16 ***** ppl: 4937.3388671875 Beginning of Epoch 1/16, Total Micro Batches 920 [2023-06-29 17:00:56,813] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:00:57,505] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:00:58,211] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768, reducing to 16384 [2023-06-29 17:00:58,922] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16384, reducing to 8192 [2023-06-29 17:00:59,633] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8192, reducing to 4096 [2023-06-29 17:01:01,864] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4096, reducing to 2048 [2023-06-29 17:01:03,338] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=6, lr=[9.649998241787337e-06, 9.649998241787337e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:01:03,371] [INFO] [timer.py:215:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=87.36301184232441, CurrSamplesPerSec=85.00568455846425, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:01:10,870] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=6, lr=[9.649978461909591e-06, 9.649978461909591e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:01:10,904] [INFO] [timer.py:215:stop] epoch=0/micro_step=20/global_step=20, RunningAvgSamplesPerSec=86.0546405067235, CurrSamplesPerSec=85.1797474138478, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:01:18,403] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=6, lr=[9.649936704478667e-06, 9.649936704478667e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:01:18,436] [INFO] [timer.py:215:stop] epoch=0/micro_step=30/global_step=30, RunningAvgSamplesPerSec=85.68645474055734, CurrSamplesPerSec=84.50329736541306, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:01:25,936] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=6, lr=[9.649872969684765e-06, 9.649872969684765e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:01:25,969] [INFO] [timer.py:215:stop] epoch=0/micro_step=40/global_step=40, RunningAvgSamplesPerSec=85.5153506698691, CurrSamplesPerSec=85.17185562548624, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:01:33,452] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=6, lr=[9.649787257818198e-06, 9.649787257818198e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:01:33,486] [INFO] [timer.py:215:stop] epoch=0/micro_step=50/global_step=50, RunningAvgSamplesPerSec=85.45203017762917, CurrSamplesPerSec=85.58679663258526, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:01:40,949] [INFO] [logging.py:96:log_dist] [Rank 0] step=60, skipped=6, lr=[9.649679569269376e-06, 9.649679569269376e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:01:40,982] [INFO] [timer.py:215:stop] epoch=0/micro_step=60/global_step=60, RunningAvgSamplesPerSec=85.45094509115019, CurrSamplesPerSec=85.36359036867175, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:01:48,447] [INFO] [logging.py:96:log_dist] [Rank 0] step=70, skipped=6, lr=[9.649549904528819e-06, 9.649549904528819e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:01:48,480] [INFO] [timer.py:215:stop] epoch=0/micro_step=70/global_step=70, RunningAvgSamplesPerSec=85.44715075192755, CurrSamplesPerSec=85.34535216999413, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:01:55,942] [INFO] [logging.py:96:log_dist] [Rank 0] step=80, skipped=6, lr=[9.649398264187143e-06, 9.649398264187143e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:01:55,975] [INFO] [timer.py:215:stop] epoch=0/micro_step=80/global_step=80, RunningAvgSamplesPerSec=85.4500661241271, CurrSamplesPerSec=85.22756182309553, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:02:03,456] [INFO] [logging.py:96:log_dist] [Rank 0] step=90, skipped=6, lr=[9.64922464893506e-06, 9.64922464893506e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:02:03,489] [INFO] [timer.py:215:stop] epoch=0/micro_step=90/global_step=90, RunningAvgSamplesPerSec=85.4263714759143, CurrSamplesPerSec=85.54697451011386, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:02:10,969] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=6, lr=[9.649029059563382e-06, 9.649029059563382e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:02:11,003] [INFO] [timer.py:215:stop] epoch=0/micro_step=100/global_step=100, RunningAvgSamplesPerSec=85.40901035131566, CurrSamplesPerSec=84.86006135419275, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:02:18,490] [INFO] [logging.py:96:log_dist] [Rank 0] step=110, skipped=6, lr=[9.648811496963009e-06, 9.648811496963009e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:02:18,524] [INFO] [timer.py:215:stop] epoch=0/micro_step=110/global_step=110, RunningAvgSamplesPerSec=85.38598035952431, CurrSamplesPerSec=84.75240852590069, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:02:26,014] [INFO] [logging.py:96:log_dist] [Rank 0] step=120, skipped=6, lr=[9.64857196212493e-06, 9.64857196212493e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:02:26,047] [INFO] [timer.py:215:stop] epoch=0/micro_step=120/global_step=120, RunningAvgSamplesPerSec=85.36547958640068, CurrSamplesPerSec=85.19710368108245, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:02:33,534] [INFO] [logging.py:96:log_dist] [Rank 0] step=130, skipped=6, lr=[9.648310456140211e-06, 9.648310456140211e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:02:33,567] [INFO] [timer.py:215:stop] epoch=0/micro_step=130/global_step=130, RunningAvgSamplesPerSec=85.35045395000549, CurrSamplesPerSec=85.41620453397042, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:02:41,060] [INFO] [logging.py:96:log_dist] [Rank 0] step=140, skipped=6, lr=[9.648026980200002e-06, 9.648026980200002e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:02:41,093] [INFO] [timer.py:215:stop] epoch=0/micro_step=140/global_step=140, RunningAvgSamplesPerSec=85.33289124845325, CurrSamplesPerSec=85.36956290132945, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:02:48,571] [INFO] [logging.py:96:log_dist] [Rank 0] step=150, skipped=6, lr=[9.647721535595524e-06, 9.647721535595524e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:02:48,605] [INFO] [timer.py:215:stop] epoch=0/micro_step=150/global_step=150, RunningAvgSamplesPerSec=85.32902993034078, CurrSamplesPerSec=85.11365833535996, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:02:56,093] [INFO] [logging.py:96:log_dist] [Rank 0] step=160, skipped=6, lr=[9.647394123718063e-06, 9.647394123718063e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:02:56,127] [INFO] [timer.py:215:stop] epoch=0/micro_step=160/global_step=160, RunningAvgSamplesPerSec=85.31740141120706, CurrSamplesPerSec=85.13865586863737, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:03:03,621] [INFO] [logging.py:96:log_dist] [Rank 0] step=170, skipped=6, lr=[9.647044746058962e-06, 9.647044746058962e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:03:03,655] [INFO] [timer.py:215:stop] epoch=0/micro_step=170/global_step=170, RunningAvgSamplesPerSec=85.30413445328544, CurrSamplesPerSec=85.09992404154762, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:03:11,140] [INFO] [logging.py:96:log_dist] [Rank 0] step=180, skipped=6, lr=[9.646673404209623e-06, 9.646673404209623e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:03:11,174] [INFO] [timer.py:215:stop] epoch=0/micro_step=180/global_step=180, RunningAvgSamplesPerSec=85.29783463745754, CurrSamplesPerSec=85.26776418000811, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:03:18,662] [INFO] [logging.py:96:log_dist] [Rank 0] step=190, skipped=6, lr=[9.64628009986149e-06, 9.64628009986149e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:03:18,695] [INFO] [timer.py:215:stop] epoch=0/micro_step=190/global_step=190, RunningAvgSamplesPerSec=85.29031716307757, CurrSamplesPerSec=84.98937488186486, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:03:26,189] [INFO] [logging.py:96:log_dist] [Rank 0] step=200, skipped=6, lr=[9.645864834806044e-06, 9.645864834806044e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:03:26,223] [INFO] [timer.py:215:stop] epoch=0/micro_step=200/global_step=200, RunningAvgSamplesPerSec=85.28038755175349, CurrSamplesPerSec=85.04686337651015, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:03:33,716] [INFO] [logging.py:96:log_dist] [Rank 0] step=210, skipped=6, lr=[9.6454276109348e-06, 9.6454276109348e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:03:33,749] [INFO] [timer.py:215:stop] epoch=0/micro_step=210/global_step=210, RunningAvgSamplesPerSec=85.2719310561742, CurrSamplesPerSec=85.21825436130307, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:03:41,239] [INFO] [logging.py:96:log_dist] [Rank 0] step=220, skipped=6, lr=[9.644968430239294e-06, 9.644968430239294e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:03:41,272] [INFO] [timer.py:215:stop] epoch=0/micro_step=220/global_step=220, RunningAvgSamplesPerSec=85.26586888884698, CurrSamplesPerSec=85.36923710527111, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:03:48,764] [INFO] [logging.py:96:log_dist] [Rank 0] step=230, skipped=6, lr=[9.644487294811071e-06, 9.644487294811071e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:03:48,797] [INFO] [timer.py:215:stop] epoch=0/micro_step=230/global_step=230, RunningAvgSamplesPerSec=85.25941376967288, CurrSamplesPerSec=85.07319670388074, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:03:56,295] [INFO] [logging.py:96:log_dist] [Rank 0] step=240, skipped=6, lr=[9.643984206841679e-06, 9.643984206841679e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:03:56,328] [INFO] [timer.py:215:stop] epoch=0/micro_step=240/global_step=240, RunningAvgSamplesPerSec=85.25069092140829, CurrSamplesPerSec=84.78490598456891, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:04:03,840] [INFO] [logging.py:96:log_dist] [Rank 0] step=250, skipped=6, lr=[9.643459168622665e-06, 9.643459168622665e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:04:03,873] [INFO] [timer.py:215:stop] epoch=0/micro_step=250/global_step=250, RunningAvgSamplesPerSec=85.23608337190929, CurrSamplesPerSec=84.95247245817893, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:04:11,365] [INFO] [logging.py:96:log_dist] [Rank 0] step=260, skipped=6, lr=[9.64291218254555e-06, 9.64291218254555e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:04:11,398] [INFO] [timer.py:215:stop] epoch=0/micro_step=260/global_step=260, RunningAvgSamplesPerSec=85.23161286924586, CurrSamplesPerSec=85.18810024845364, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:04:18,899] [INFO] [logging.py:96:log_dist] [Rank 0] step=270, skipped=6, lr=[9.64234325110183e-06, 9.64234325110183e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:04:18,932] [INFO] [timer.py:215:stop] epoch=0/micro_step=270/global_step=270, RunningAvgSamplesPerSec=85.22374135026561, CurrSamplesPerSec=84.97912398222132, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:04:26,420] [INFO] [logging.py:96:log_dist] [Rank 0] step=280, skipped=6, lr=[9.641752376882963e-06, 9.641752376882963e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:04:26,454] [INFO] [timer.py:215:stop] epoch=0/micro_step=280/global_step=280, RunningAvgSamplesPerSec=85.22282530586891, CurrSamplesPerSec=84.79958351339808, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:04:33,952] [INFO] [logging.py:96:log_dist] [Rank 0] step=290, skipped=6, lr=[9.64113956258035e-06, 9.64113956258035e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:04:33,985] [INFO] [timer.py:215:stop] epoch=0/micro_step=290/global_step=290, RunningAvgSamplesPerSec=85.21674229742284, CurrSamplesPerSec=84.16553588085439, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:04:41,478] [INFO] [logging.py:96:log_dist] [Rank 0] step=300, skipped=6, lr=[9.640504810985339e-06, 9.640504810985339e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:04:41,511] [INFO] [timer.py:215:stop] epoch=0/micro_step=300/global_step=300, RunningAvgSamplesPerSec=85.21307508802276, CurrSamplesPerSec=85.4199282809219, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:04:49,003] [INFO] [logging.py:96:log_dist] [Rank 0] step=310, skipped=6, lr=[9.639848124989188e-06, 9.639848124989188e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:04:49,036] [INFO] [timer.py:215:stop] epoch=0/micro_step=310/global_step=310, RunningAvgSamplesPerSec=85.20993477999993, CurrSamplesPerSec=84.71544419870533, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:04:56,542] [INFO] [logging.py:96:log_dist] [Rank 0] step=320, skipped=6, lr=[9.639169507583073e-06, 9.639169507583073e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:04:56,576] [INFO] [timer.py:215:stop] epoch=0/micro_step=320/global_step=320, RunningAvgSamplesPerSec=85.20194057883123, CurrSamplesPerSec=85.14232844929165, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:05:04,086] [INFO] [logging.py:96:log_dist] [Rank 0] step=330, skipped=6, lr=[9.638468961858065e-06, 9.638468961858065e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:05:04,119] [INFO] [timer.py:215:stop] epoch=0/micro_step=330/global_step=330, RunningAvgSamplesPerSec=85.19284390956219, CurrSamplesPerSec=85.19867204113372, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:05:11,623] [INFO] [logging.py:96:log_dist] [Rank 0] step=340, skipped=6, lr=[9.637746491005118e-06, 9.637746491005118e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:05:11,657] [INFO] [timer.py:215:stop] epoch=0/micro_step=340/global_step=340, RunningAvgSamplesPerSec=85.18661009890147, CurrSamplesPerSec=85.1956705749726, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:05:19,152] [INFO] [logging.py:96:log_dist] [Rank 0] step=350, skipped=6, lr=[9.637002098315053e-06, 9.637002098315053e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:05:19,185] [INFO] [timer.py:215:stop] epoch=0/micro_step=350/global_step=350, RunningAvgSamplesPerSec=85.1838857973827, CurrSamplesPerSec=84.99884772745192, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:05:26,676] [INFO] [logging.py:96:log_dist] [Rank 0] step=360, skipped=6, lr=[9.636235787178543e-06, 9.636235787178543e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:05:26,709] [INFO] [timer.py:215:stop] epoch=0/micro_step=360/global_step=360, RunningAvgSamplesPerSec=85.18244981324582, CurrSamplesPerSec=84.64198583474543, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:05:34,208] [INFO] [logging.py:96:log_dist] [Rank 0] step=370, skipped=6, lr=[9.635447561086101e-06, 9.635447561086101e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:05:34,241] [INFO] [timer.py:215:stop] epoch=0/micro_step=370/global_step=370, RunningAvgSamplesPerSec=85.17871840789455, CurrSamplesPerSec=85.09803558555502, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:05:41,754] [INFO] [logging.py:96:log_dist] [Rank 0] step=380, skipped=6, lr=[9.634637423628059e-06, 9.634637423628059e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:05:41,788] [INFO] [timer.py:215:stop] epoch=0/micro_step=380/global_step=380, RunningAvgSamplesPerSec=85.17075772917457, CurrSamplesPerSec=84.8219309671014, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:05:49,293] [INFO] [logging.py:96:log_dist] [Rank 0] step=390, skipped=6, lr=[9.633805378494556e-06, 9.633805378494556e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:05:49,326] [INFO] [timer.py:215:stop] epoch=0/micro_step=390/global_step=390, RunningAvgSamplesPerSec=85.16555966861414, CurrSamplesPerSec=84.77400826090019, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:05:56,832] [INFO] [logging.py:96:log_dist] [Rank 0] step=400, skipped=6, lr=[9.632951429475518e-06, 9.632951429475518e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:05:56,865] [INFO] [timer.py:215:stop] epoch=0/micro_step=400/global_step=400, RunningAvgSamplesPerSec=85.16082671166932, CurrSamplesPerSec=84.65746826868069, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:06:04,369] [INFO] [logging.py:96:log_dist] [Rank 0] step=410, skipped=6, lr=[9.632075580460647e-06, 9.632075580460647e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:06:04,402] [INFO] [timer.py:215:stop] epoch=0/micro_step=410/global_step=410, RunningAvgSamplesPerSec=85.15645986552776, CurrSamplesPerSec=84.81327462064185, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:06:11,913] [INFO] [logging.py:96:log_dist] [Rank 0] step=420, skipped=6, lr=[9.631177835439391e-06, 9.631177835439391e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:06:11,946] [INFO] [timer.py:215:stop] epoch=0/micro_step=420/global_step=420, RunningAvgSamplesPerSec=85.1504787329919, CurrSamplesPerSec=85.02078870120344, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:06:19,454] [INFO] [logging.py:96:log_dist] [Rank 0] step=430, skipped=6, lr=[9.630258198500938e-06, 9.630258198500938e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:06:19,488] [INFO] [timer.py:215:stop] epoch=0/micro_step=430/global_step=430, RunningAvgSamplesPerSec=85.14534641355554, CurrSamplesPerSec=84.99279239801263, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:06:26,998] [INFO] [logging.py:96:log_dist] [Rank 0] step=440, skipped=6, lr=[9.629316673834193e-06, 9.629316673834193e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:06:27,031] [INFO] [timer.py:215:stop] epoch=0/micro_step=440/global_step=440, RunningAvgSamplesPerSec=85.14033638269318, CurrSamplesPerSec=84.79411902207768, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:06:34,524] [INFO] [logging.py:96:log_dist] [Rank 0] step=450, skipped=6, lr=[9.628353265727755e-06, 9.628353265727755e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:06:34,557] [INFO] [timer.py:215:stop] epoch=0/micro_step=450/global_step=450, RunningAvgSamplesPerSec=85.13945499926778, CurrSamplesPerSec=85.03438971333827, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:06:42,070] [INFO] [logging.py:96:log_dist] [Rank 0] step=460, skipped=6, lr=[9.627367978569902e-06, 9.627367978569902e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:06:42,103] [INFO] [timer.py:215:stop] epoch=0/micro_step=460/global_step=460, RunningAvgSamplesPerSec=85.13391307878791, CurrSamplesPerSec=84.96231355625082, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:06:49,611] [INFO] [logging.py:96:log_dist] [Rank 0] step=470, skipped=6, lr=[9.626360816848576e-06, 9.626360816848576e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:06:49,645] [INFO] [timer.py:215:stop] epoch=0/micro_step=470/global_step=470, RunningAvgSamplesPerSec=85.12966345758294, CurrSamplesPerSec=85.00180842477052, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:06:57,151] [INFO] [logging.py:96:log_dist] [Rank 0] step=480, skipped=6, lr=[9.625331785151348e-06, 9.625331785151348e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:06:57,185] [INFO] [timer.py:215:stop] epoch=0/micro_step=480/global_step=480, RunningAvgSamplesPerSec=85.12590508216977, CurrSamplesPerSec=84.7359016536786, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:07:04,697] [INFO] [logging.py:96:log_dist] [Rank 0] step=490, skipped=6, lr=[9.624280888165412e-06, 9.624280888165412e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:07:04,731] [INFO] [timer.py:215:stop] epoch=0/micro_step=490/global_step=490, RunningAvgSamplesPerSec=85.12090336966642, CurrSamplesPerSec=84.54151451046567, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:07:12,244] [INFO] [logging.py:96:log_dist] [Rank 0] step=500, skipped=6, lr=[9.623208130677554e-06, 9.623208130677554e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:07:12,277] [INFO] [timer.py:215:stop] epoch=0/micro_step=500/global_step=500, RunningAvgSamplesPerSec=85.11595361411979, CurrSamplesPerSec=84.72648732316392, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:07:18,991] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:07:19,687] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:07:19,688] [INFO] [logging.py:96:log_dist] [Rank 0] step=510, skipped=8, lr=[9.622334188406173e-06, 9.622334188406173e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:07:19,689] [INFO] [timer.py:215:stop] epoch=0/micro_step=510/global_step=510, RunningAvgSamplesPerSec=85.14129699241064, CurrSamplesPerSec=91.92016055824192, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:07:27,190] [INFO] [logging.py:96:log_dist] [Rank 0] step=520, skipped=8, lr=[9.621222094395383e-06, 9.621222094395383e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:07:27,223] [INFO] [timer.py:215:stop] epoch=0/micro_step=520/global_step=520, RunningAvgSamplesPerSec=85.13883034658456, CurrSamplesPerSec=84.90362339399472, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:07:34,734] [INFO] [logging.py:96:log_dist] [Rank 0] step=530, skipped=8, lr=[9.620088153815335e-06, 9.620088153815335e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:07:34,768] [INFO] [timer.py:215:stop] epoch=0/micro_step=530/global_step=530, RunningAvgSamplesPerSec=85.13442717768643, CurrSamplesPerSec=84.88410486292585, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:07:42,275] [INFO] [logging.py:96:log_dist] [Rank 0] step=540, skipped=8, lr=[9.618932371831077e-06, 9.618932371831077e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:07:42,309] [INFO] [timer.py:215:stop] epoch=0/micro_step=540/global_step=540, RunningAvgSamplesPerSec=85.13069591695952, CurrSamplesPerSec=85.0811511189039, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:07:49,813] [INFO] [logging.py:96:log_dist] [Rank 0] step=550, skipped=8, lr=[9.61775475370714e-06, 9.61775475370714e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:07:49,847] [INFO] [timer.py:215:stop] epoch=0/micro_step=550/global_step=550, RunningAvgSamplesPerSec=85.12785368152765, CurrSamplesPerSec=85.0392925310579, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:07:57,356] [INFO] [logging.py:96:log_dist] [Rank 0] step=560, skipped=8, lr=[9.61655530480752e-06, 9.61655530480752e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:07:57,389] [INFO] [timer.py:215:stop] epoch=0/micro_step=560/global_step=560, RunningAvgSamplesPerSec=85.12415259122345, CurrSamplesPerSec=85.05440862041213, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:08:04,902] [INFO] [logging.py:96:log_dist] [Rank 0] step=570, skipped=8, lr=[9.615334030595654e-06, 9.615334030595654e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:08:04,936] [INFO] [timer.py:215:stop] epoch=0/micro_step=570/global_step=570, RunningAvgSamplesPerSec=85.1196897004734, CurrSamplesPerSec=84.99343825871321, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:08:12,446] [INFO] [logging.py:96:log_dist] [Rank 0] step=580, skipped=8, lr=[9.614090936634385e-06, 9.614090936634385e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:08:12,479] [INFO] [timer.py:215:stop] epoch=0/micro_step=580/global_step=580, RunningAvgSamplesPerSec=85.11629322232528, CurrSamplesPerSec=84.82469172161585, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:08:19,999] [INFO] [logging.py:96:log_dist] [Rank 0] step=590, skipped=8, lr=[9.612826028585952e-06, 9.612826028585952e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:08:20,033] [INFO] [timer.py:215:stop] epoch=0/micro_step=590/global_step=590, RunningAvgSamplesPerSec=85.11096222005614, CurrSamplesPerSec=84.87607990071696, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:08:27,551] [INFO] [logging.py:96:log_dist] [Rank 0] step=600, skipped=8, lr=[9.611539312211953e-06, 9.611539312211953e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:08:27,584] [INFO] [timer.py:215:stop] epoch=0/micro_step=600/global_step=600, RunningAvgSamplesPerSec=85.10601174726905, CurrSamplesPerSec=85.08128595231703, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:08:35,095] [INFO] [logging.py:96:log_dist] [Rank 0] step=610, skipped=8, lr=[9.610230793373317e-06, 9.610230793373317e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:08:35,128] [INFO] [timer.py:215:stop] epoch=0/micro_step=610/global_step=610, RunningAvgSamplesPerSec=85.10276056586702, CurrSamplesPerSec=85.11087873935858, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:08:35,825] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:08:36,524] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:08:42,524] [INFO] [logging.py:96:log_dist] [Rank 0] step=620, skipped=10, lr=[9.60916828452982e-06, 9.60916828452982e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:08:42,558] [INFO] [timer.py:215:stop] epoch=0/micro_step=620/global_step=620, RunningAvgSamplesPerSec=85.12059554109679, CurrSamplesPerSec=85.08932279556947, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:08:50,079] [INFO] [logging.py:96:log_dist] [Rank 0] step=630, skipped=10, lr=[9.607820536341373e-06, 9.607820536341373e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:08:50,112] [INFO] [timer.py:215:stop] epoch=0/micro_step=630/global_step=630, RunningAvgSamplesPerSec=85.11530364401824, CurrSamplesPerSec=84.95551058670425, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:08:57,622] [INFO] [logging.py:96:log_dist] [Rank 0] step=640, skipped=10, lr=[9.606451002627145e-06, 9.606451002627145e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:08:57,656] [INFO] [timer.py:215:stop] epoch=0/micro_step=640/global_step=640, RunningAvgSamplesPerSec=85.11194370661472, CurrSamplesPerSec=84.97872045408859, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:09:05,175] [INFO] [logging.py:96:log_dist] [Rank 0] step=650, skipped=10, lr=[9.605059689625296e-06, 9.605059689625296e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:09:05,208] [INFO] [timer.py:215:stop] epoch=0/micro_step=650/global_step=650, RunningAvgSamplesPerSec=85.107325382068, CurrSamplesPerSec=85.11562845023137, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:09:12,721] [INFO] [logging.py:96:log_dist] [Rank 0] step=660, skipped=10, lr=[9.603646603673193e-06, 9.603646603673193e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:09:12,755] [INFO] [timer.py:215:stop] epoch=0/micro_step=660/global_step=660, RunningAvgSamplesPerSec=85.1038351667805, CurrSamplesPerSec=84.84364660651275, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:09:20,267] [INFO] [logging.py:96:log_dist] [Rank 0] step=670, skipped=10, lr=[9.60221175120738e-06, 9.60221175120738e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:09:20,301] [INFO] [timer.py:215:stop] epoch=0/micro_step=670/global_step=670, RunningAvgSamplesPerSec=85.10058825195995, CurrSamplesPerSec=84.73156866217245, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:09:27,805] [INFO] [logging.py:96:log_dist] [Rank 0] step=680, skipped=10, lr=[9.600755138763538e-06, 9.600755138763538e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:09:27,838] [INFO] [timer.py:215:stop] epoch=0/micro_step=680/global_step=680, RunningAvgSamplesPerSec=85.09869307387034, CurrSamplesPerSec=84.95440823176705, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:09:35,347] [INFO] [logging.py:96:log_dist] [Rank 0] step=690, skipped=10, lr=[9.599276772976471e-06, 9.599276772976471e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:09:35,380] [INFO] [timer.py:215:stop] epoch=0/micro_step=690/global_step=690, RunningAvgSamplesPerSec=85.09617901230331, CurrSamplesPerSec=84.91350689799484, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:09:42,893] [INFO] [logging.py:96:log_dist] [Rank 0] step=700, skipped=10, lr=[9.59777666058007e-06, 9.59777666058007e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:09:42,927] [INFO] [timer.py:215:stop] epoch=0/micro_step=700/global_step=700, RunningAvgSamplesPerSec=85.09306099130757, CurrSamplesPerSec=84.83410107956414, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:09:50,442] [INFO] [logging.py:96:log_dist] [Rank 0] step=710, skipped=10, lr=[9.596254808407273e-06, 9.596254808407273e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:09:50,475] [INFO] [timer.py:215:stop] epoch=0/micro_step=710/global_step=710, RunningAvgSamplesPerSec=85.08968219278204, CurrSamplesPerSec=84.7900478981542, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:09:52,679] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:09:53,374] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:09:57,865] [INFO] [logging.py:96:log_dist] [Rank 0] step=720, skipped=12, lr=[9.595021678684986e-06, 9.595021678684986e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:09:57,899] [INFO] [timer.py:215:stop] epoch=0/micro_step=720/global_step=720, RunningAvgSamplesPerSec=85.10621261054514, CurrSamplesPerSec=84.7901282455026, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:10:05,397] [INFO] [logging.py:96:log_dist] [Rank 0] step=730, skipped=12, lr=[9.593460712449759e-06, 9.593460712449759e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:10:05,431] [INFO] [timer.py:215:stop] epoch=0/micro_step=730/global_step=730, RunningAvgSamplesPerSec=85.1052267670939, CurrSamplesPerSec=84.96269003697161, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:10:12,947] [INFO] [logging.py:96:log_dist] [Rank 0] step=740, skipped=12, lr=[9.59187802609708e-06, 9.59187802609708e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:10:12,980] [INFO] [timer.py:215:stop] epoch=0/micro_step=740/global_step=740, RunningAvgSamplesPerSec=85.10168247194859, CurrSamplesPerSec=84.82276185489576, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:10:20,491] [INFO] [logging.py:96:log_dist] [Rank 0] step=750, skipped=12, lr=[9.590273626836016e-06, 9.590273626836016e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:10:20,524] [INFO] [timer.py:215:stop] epoch=0/micro_step=750/global_step=750, RunningAvgSamplesPerSec=85.09897439305959, CurrSamplesPerSec=84.72293074846799, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:10:28,044] [INFO] [logging.py:96:log_dist] [Rank 0] step=760, skipped=12, lr=[9.588647521974525e-06, 9.588647521974525e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:10:28,074] [INFO] [timer.py:215:stop] epoch=0/micro_step=760/global_step=760, RunningAvgSamplesPerSec=85.09560349445911, CurrSamplesPerSec=84.89119170984456, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:10:35,581] [INFO] [logging.py:96:log_dist] [Rank 0] step=770, skipped=12, lr=[9.586999718919445e-06, 9.586999718919445e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:10:35,615] [INFO] [timer.py:215:stop] epoch=0/micro_step=770/global_step=770, RunningAvgSamplesPerSec=85.09369456221842, CurrSamplesPerSec=85.26860382527201, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:10:43,118] [INFO] [logging.py:96:log_dist] [Rank 0] step=780, skipped=12, lr=[9.585330225176441e-06, 9.585330225176441e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:10:43,150] [INFO] [timer.py:215:stop] epoch=0/micro_step=780/global_step=780, RunningAvgSamplesPerSec=85.09248034910836, CurrSamplesPerSec=84.92696612861383, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:10:50,652] [INFO] [logging.py:96:log_dist] [Rank 0] step=790, skipped=12, lr=[9.583639048349978e-06, 9.583639048349978e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:10:50,685] [INFO] [timer.py:215:stop] epoch=0/micro_step=790/global_step=790, RunningAvgSamplesPerSec=85.09128740259379, CurrSamplesPerSec=85.23700664563965, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:10:58,192] [INFO] [logging.py:96:log_dist] [Rank 0] step=800, skipped=12, lr=[9.58192619614329e-06, 9.58192619614329e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:10:58,226] [INFO] [timer.py:215:stop] epoch=0/micro_step=800/global_step=800, RunningAvgSamplesPerSec=85.08959586077582, CurrSamplesPerSec=84.81721397719338, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:11:05,728] [INFO] [logging.py:96:log_dist] [Rank 0] step=810, skipped=12, lr=[9.580191676358337e-06, 9.580191676358337e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:11:05,762] [INFO] [timer.py:215:stop] epoch=0/micro_step=810/global_step=810, RunningAvgSamplesPerSec=85.08835656111171, CurrSamplesPerSec=84.90655060508486, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:11:09,468] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:11:10,165] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:11:13,152] [INFO] [logging.py:96:log_dist] [Rank 0] step=820, skipped=14, lr=[9.578788465179952e-06, 9.578788465179952e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:11:13,186] [INFO] [timer.py:215:stop] epoch=0/micro_step=820/global_step=820, RunningAvgSamplesPerSec=85.10266762998549, CurrSamplesPerSec=85.02625551576152, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:11:20,698] [INFO] [logging.py:96:log_dist] [Rank 0] step=830, skipped=14, lr=[9.57701496373008e-06, 9.57701496373008e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:11:20,732] [INFO] [timer.py:215:stop] epoch=0/micro_step=830/global_step=830, RunningAvgSamplesPerSec=85.09993639042285, CurrSamplesPerSec=84.96026986198312, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:11:28,250] [INFO] [logging.py:96:log_dist] [Rank 0] step=840, skipped=14, lr=[9.575219817072382e-06, 9.575219817072382e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:11:28,283] [INFO] [timer.py:215:stop] epoch=0/micro_step=840/global_step=840, RunningAvgSamplesPerSec=85.09664588871922, CurrSamplesPerSec=84.86993471822097, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:11:35,797] [INFO] [logging.py:96:log_dist] [Rank 0] step=850, skipped=14, lr=[9.573403033383666e-06, 9.573403033383666e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:11:35,831] [INFO] [timer.py:215:stop] epoch=0/micro_step=850/global_step=850, RunningAvgSamplesPerSec=85.09387182987135, CurrSamplesPerSec=84.99787881676036, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:11:43,341] [INFO] [logging.py:96:log_dist] [Rank 0] step=860, skipped=14, lr=[9.571564620939298e-06, 9.571564620939298e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:11:43,374] [INFO] [timer.py:215:stop] epoch=0/micro_step=860/global_step=860, RunningAvgSamplesPerSec=85.09165730523463, CurrSamplesPerSec=85.10985330318314, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:11:50,879] [INFO] [logging.py:96:log_dist] [Rank 0] step=870, skipped=14, lr=[9.56970458811316e-06, 9.56970458811316e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:11:50,913] [INFO] [timer.py:215:stop] epoch=0/micro_step=870/global_step=870, RunningAvgSamplesPerSec=85.09007431542291, CurrSamplesPerSec=84.8816891628242, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:11:58,424] [INFO] [logging.py:96:log_dist] [Rank 0] step=880, skipped=14, lr=[9.567822943377617e-06, 9.567822943377617e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:11:58,458] [INFO] [timer.py:215:stop] epoch=0/micro_step=880/global_step=880, RunningAvgSamplesPerSec=85.08791527690538, CurrSamplesPerSec=84.9645455978416, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:12:05,960] [INFO] [logging.py:96:log_dist] [Rank 0] step=890, skipped=14, lr=[9.565919695303474e-06, 9.565919695303474e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:12:05,994] [INFO] [timer.py:215:stop] epoch=0/micro_step=890/global_step=890, RunningAvgSamplesPerSec=85.08682888650826, CurrSamplesPerSec=84.92793342326996, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:12:13,497] [INFO] [logging.py:96:log_dist] [Rank 0] step=900, skipped=14, lr=[9.563994852559934e-06, 9.563994852559934e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:12:13,531] [INFO] [timer.py:215:stop] epoch=0/micro_step=900/global_step=900, RunningAvgSamplesPerSec=85.08577017249627, CurrSamplesPerSec=85.2403358355879, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:12:21,032] [INFO] [logging.py:96:log_dist] [Rank 0] step=910, skipped=14, lr=[9.562048423914571e-06, 9.562048423914571e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:12:21,066] [INFO] [timer.py:215:stop] epoch=0/micro_step=910/global_step=910, RunningAvgSamplesPerSec=85.0848209676363, CurrSamplesPerSec=84.61682561748097, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:12:26,294] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:12:26,993] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:12:28,472] [INFO] [logging.py:96:log_dist] [Rank 0] step=920, skipped=16, lr=[9.560475745103543e-06, 9.560475745103543e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:12:28,506] [INFO] [timer.py:215:stop] epoch=0/micro_step=920/global_step=920, RunningAvgSamplesPerSec=85.09565303556764, CurrSamplesPerSec=84.72857327280228, MemAllocated=5.0GB, MaxMemAllocated=23.99GB ***** Evaluating perplexity, Epoch 1/16 ***** ppl: 2.0014777183532715 Beginning of Epoch 2/16, Total Micro Batches 920 [2023-06-29 17:12:53,939] [INFO] [logging.py:96:log_dist] [Rank 0] step=930, skipped=16, lr=[9.55849048424299e-06, 9.55849048424299e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:12:53,973] [INFO] [timer.py:215:stop] epoch=1/micro_step=10/global_step=930, RunningAvgSamplesPerSec=85.09095755760896, CurrSamplesPerSec=84.72835932452263, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:13:01,485] [INFO] [logging.py:96:log_dist] [Rank 0] step=940, skipped=16, lr=[9.556483662552754e-06, 9.556483662552754e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:13:01,519] [INFO] [timer.py:215:stop] epoch=1/micro_step=20/global_step=940, RunningAvgSamplesPerSec=85.08870158560205, CurrSamplesPerSec=85.01637267471027, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:13:09,034] [INFO] [logging.py:96:log_dist] [Rank 0] step=950, skipped=16, lr=[9.554455289173818e-06, 9.554455289173818e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:13:09,068] [INFO] [timer.py:215:stop] epoch=1/micro_step=30/global_step=950, RunningAvgSamplesPerSec=85.08618260064492, CurrSamplesPerSec=84.71287768621748, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:13:16,579] [INFO] [logging.py:96:log_dist] [Rank 0] step=960, skipped=16, lr=[9.552405373345324e-06, 9.552405373345324e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:13:16,612] [INFO] [timer.py:215:stop] epoch=1/micro_step=40/global_step=960, RunningAvgSamplesPerSec=85.08420867713522, CurrSamplesPerSec=84.99688301490099, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:13:24,122] [INFO] [logging.py:96:log_dist] [Rank 0] step=970, skipped=16, lr=[9.550333924404544e-06, 9.550333924404544e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:13:24,155] [INFO] [timer.py:215:stop] epoch=1/micro_step=50/global_step=970, RunningAvgSamplesPerSec=85.08251306745198, CurrSamplesPerSec=85.18288291375833, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:13:31,666] [INFO] [logging.py:96:log_dist] [Rank 0] step=980, skipped=16, lr=[9.548240951786835e-06, 9.548240951786835e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:13:31,699] [INFO] [timer.py:215:stop] epoch=1/micro_step=60/global_step=980, RunningAvgSamplesPerSec=85.08065791901474, CurrSamplesPerSec=85.09112994590264, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:13:39,212] [INFO] [logging.py:96:log_dist] [Rank 0] step=990, skipped=16, lr=[9.546126465025589e-06, 9.546126465025589e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:13:39,246] [INFO] [timer.py:215:stop] epoch=1/micro_step=70/global_step=990, RunningAvgSamplesPerSec=85.07855393956544, CurrSamplesPerSec=84.96970932422552, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:13:46,734] [INFO] [logging.py:96:log_dist] [Rank 0] step=1000, skipped=16, lr=[9.543990473752193e-06, 9.543990473752193e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:13:46,768] [INFO] [timer.py:215:stop] epoch=1/micro_step=80/global_step=1000, RunningAvgSamplesPerSec=85.07932637037337, CurrSamplesPerSec=85.087407839433, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:13:54,278] [INFO] [logging.py:96:log_dist] [Rank 0] step=1010, skipped=16, lr=[9.54183298769599e-06, 9.54183298769599e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:13:54,312] [INFO] [timer.py:215:stop] epoch=1/micro_step=90/global_step=1010, RunningAvgSamplesPerSec=85.07759351713366, CurrSamplesPerSec=84.84343207669541, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:14:01,041] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:14:01,739] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:14:01,740] [INFO] [logging.py:96:log_dist] [Rank 0] step=1020, skipped=18, lr=[9.540091529208031e-06, 9.540091529208031e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:14:01,740] [INFO] [timer.py:215:stop] epoch=1/micro_step=100/global_step=1020, RunningAvgSamplesPerSec=85.08872075594867, CurrSamplesPerSec=91.7424755644886, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:14:09,243] [INFO] [logging.py:96:log_dist] [Rank 0] step=1030, skipped=18, lr=[9.53789537737321e-06, 9.53789537737321e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:14:09,276] [INFO] [timer.py:215:stop] epoch=1/micro_step=110/global_step=1030, RunningAvgSamplesPerSec=85.08778075954648, CurrSamplesPerSec=85.09590442874556, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:14:16,776] [INFO] [logging.py:96:log_dist] [Rank 0] step=1040, skipped=18, lr=[9.535677758518463e-06, 9.535677758518463e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:14:16,809] [INFO] [timer.py:215:stop] epoch=1/micro_step=120/global_step=1040, RunningAvgSamplesPerSec=85.08718092837732, CurrSamplesPerSec=84.93607569323395, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:14:24,302] [INFO] [logging.py:96:log_dist] [Rank 0] step=1050, skipped=18, lr=[9.53343868274494e-06, 9.53343868274494e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:14:24,336] [INFO] [timer.py:215:stop] epoch=1/micro_step=130/global_step=1050, RunningAvgSamplesPerSec=85.0872827402689, CurrSamplesPerSec=84.98692627946367, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:14:31,844] [INFO] [logging.py:96:log_dist] [Rank 0] step=1060, skipped=18, lr=[9.531178160251531e-06, 9.531178160251531e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:14:31,878] [INFO] [timer.py:215:stop] epoch=1/micro_step=140/global_step=1060, RunningAvgSamplesPerSec=85.08574703823578, CurrSamplesPerSec=84.61629215873862, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:14:39,377] [INFO] [logging.py:96:log_dist] [Rank 0] step=1070, skipped=18, lr=[9.528896201334807e-06, 9.528896201334807e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:14:39,411] [INFO] [timer.py:215:stop] epoch=1/micro_step=150/global_step=1070, RunningAvgSamplesPerSec=85.08519812535565, CurrSamplesPerSec=85.09711836574247, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:14:46,913] [INFO] [logging.py:96:log_dist] [Rank 0] step=1080, skipped=18, lr=[9.526592816388989e-06, 9.526592816388989e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:14:46,946] [INFO] [timer.py:215:stop] epoch=1/micro_step=160/global_step=1080, RunningAvgSamplesPerSec=85.08443232004264, CurrSamplesPerSec=84.81847357656336, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:14:54,440] [INFO] [logging.py:96:log_dist] [Rank 0] step=1090, skipped=18, lr=[9.524268015905887e-06, 9.524268015905887e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:14:54,474] [INFO] [timer.py:215:stop] epoch=1/micro_step=170/global_step=1090, RunningAvgSamplesPerSec=85.08449102374495, CurrSamplesPerSec=84.92290910494408, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:15:01,973] [INFO] [logging.py:96:log_dist] [Rank 0] step=1100, skipped=18, lr=[9.521921810474856e-06, 9.521921810474856e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:15:02,006] [INFO] [timer.py:215:stop] epoch=1/micro_step=180/global_step=1100, RunningAvgSamplesPerSec=85.08406427886868, CurrSamplesPerSec=84.77165236465565, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:15:09,517] [INFO] [logging.py:96:log_dist] [Rank 0] step=1110, skipped=18, lr=[9.519554210782758e-06, 9.519554210782758e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:15:09,551] [INFO] [timer.py:215:stop] epoch=1/micro_step=190/global_step=1110, RunningAvgSamplesPerSec=85.08243361127022, CurrSamplesPerSec=85.24447738604583, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:15:17,045] [INFO] [logging.py:96:log_dist] [Rank 0] step=1120, skipped=18, lr=[9.517165227613896e-06, 9.517165227613896e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:15:17,079] [INFO] [timer.py:215:stop] epoch=1/micro_step=200/global_step=1120, RunningAvgSamplesPerSec=85.08242850777373, CurrSamplesPerSec=84.86593680219711, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:15:17,776] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:15:18,474] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:15:24,471] [INFO] [logging.py:96:log_dist] [Rank 0] step=1130, skipped=20, lr=[9.515238652284776e-06, 9.515238652284776e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:15:24,505] [INFO] [timer.py:215:stop] epoch=1/micro_step=210/global_step=1130, RunningAvgSamplesPerSec=85.09264375880699, CurrSamplesPerSec=84.80290541112603, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:15:31,998] [INFO] [logging.py:96:log_dist] [Rank 0] step=1140, skipped=20, lr=[9.512811206345068e-06, 9.512811206345068e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:15:32,032] [INFO] [timer.py:215:stop] epoch=1/micro_step=220/global_step=1140, RunningAvgSamplesPerSec=85.09264989256897, CurrSamplesPerSec=84.96868728689414, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:15:39,532] [INFO] [logging.py:96:log_dist] [Rank 0] step=1150, skipped=20, lr=[9.51036240764267e-06, 9.51036240764267e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:15:39,566] [INFO] [timer.py:215:stop] epoch=1/micro_step=230/global_step=1150, RunningAvgSamplesPerSec=85.09206832411785, CurrSamplesPerSec=84.96317408851525, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:15:47,079] [INFO] [logging.py:96:log_dist] [Rank 0] step=1160, skipped=20, lr=[9.507892267331749e-06, 9.507892267331749e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:15:47,112] [INFO] [timer.py:215:stop] epoch=1/micro_step=240/global_step=1160, RunningAvgSamplesPerSec=85.09014723753654, CurrSamplesPerSec=84.82383399223225, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:15:54,617] [INFO] [logging.py:96:log_dist] [Rank 0] step=1170, skipped=20, lr=[9.505400796663676e-06, 9.505400796663676e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:15:54,651] [INFO] [timer.py:215:stop] epoch=1/micro_step=250/global_step=1170, RunningAvgSamplesPerSec=85.08909425157331, CurrSamplesPerSec=85.05936765255747, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:16:02,146] [INFO] [logging.py:96:log_dist] [Rank 0] step=1180, skipped=20, lr=[9.502888006986986e-06, 9.502888006986986e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:16:02,179] [INFO] [timer.py:215:stop] epoch=1/micro_step=260/global_step=1180, RunningAvgSamplesPerSec=85.08898750373706, CurrSamplesPerSec=84.77339250265119, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:16:09,675] [INFO] [logging.py:96:log_dist] [Rank 0] step=1190, skipped=20, lr=[9.500353909747319e-06, 9.500353909747319e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:16:09,709] [INFO] [timer.py:215:stop] epoch=1/micro_step=270/global_step=1190, RunningAvgSamplesPerSec=85.0888007532895, CurrSamplesPerSec=84.86963955718063, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:16:17,205] [INFO] [logging.py:96:log_dist] [Rank 0] step=1200, skipped=20, lr=[9.497798516487371e-06, 9.497798516487371e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:16:17,238] [INFO] [timer.py:215:stop] epoch=1/micro_step=280/global_step=1200, RunningAvgSamplesPerSec=85.0886610230178, CurrSamplesPerSec=84.70119667235159, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:16:24,735] [INFO] [logging.py:96:log_dist] [Rank 0] step=1210, skipped=20, lr=[9.49522183884684e-06, 9.49522183884684e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:16:24,769] [INFO] [timer.py:215:stop] epoch=1/micro_step=290/global_step=1210, RunningAvgSamplesPerSec=85.0883315257424, CurrSamplesPerSec=85.14627142530432, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:16:32,274] [INFO] [logging.py:96:log_dist] [Rank 0] step=1220, skipped=20, lr=[9.492623888562372e-06, 9.492623888562372e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:16:32,307] [INFO] [timer.py:215:stop] epoch=1/micro_step=300/global_step=1220, RunningAvgSamplesPerSec=85.08733259637978, CurrSamplesPerSec=85.22068925669056, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:16:34,516] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:16:35,213] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:16:39,700] [INFO] [logging.py:96:log_dist] [Rank 0] step=1230, skipped=22, lr=[9.490530219980049e-06, 9.490530219980049e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:16:39,733] [INFO] [timer.py:215:stop] epoch=1/micro_step=310/global_step=1230, RunningAvgSamplesPerSec=85.09677693971543, CurrSamplesPerSec=85.25766264784724, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:16:47,236] [INFO] [logging.py:96:log_dist] [Rank 0] step=1240, skipped=22, lr=[9.487894008822105e-06, 9.487894008822105e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:16:47,270] [INFO] [timer.py:215:stop] epoch=1/micro_step=320/global_step=1240, RunningAvgSamplesPerSec=85.09591462645653, CurrSamplesPerSec=85.25701276526547, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:16:54,780] [INFO] [logging.py:96:log_dist] [Rank 0] step=1250, skipped=22, lr=[9.485236558398151e-06, 9.485236558398151e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:16:54,814] [INFO] [timer.py:215:stop] epoch=1/micro_step=330/global_step=1250, RunningAvgSamplesPerSec=85.09429956035969, CurrSamplesPerSec=84.88700388455375, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:17:02,327] [INFO] [logging.py:96:log_dist] [Rank 0] step=1260, skipped=22, lr=[9.482557880812749e-06, 9.482557880812749e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:17:02,360] [INFO] [timer.py:215:stop] epoch=1/micro_step=340/global_step=1260, RunningAvgSamplesPerSec=85.09262107532282, CurrSamplesPerSec=85.02326618522741, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:17:09,863] [INFO] [logging.py:96:log_dist] [Rank 0] step=1270, skipped=22, lr=[9.479857988267154e-06, 9.479857988267154e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:17:09,896] [INFO] [timer.py:215:stop] epoch=1/micro_step=350/global_step=1270, RunningAvgSamplesPerSec=85.09182188893793, CurrSamplesPerSec=84.86438066827247, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:17:17,399] [INFO] [logging.py:96:log_dist] [Rank 0] step=1280, skipped=22, lr=[9.477136893059248e-06, 9.477136893059248e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:17:17,432] [INFO] [timer.py:215:stop] epoch=1/micro_step=360/global_step=1280, RunningAvgSamplesPerSec=85.09108781913172, CurrSamplesPerSec=84.75642251368743, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:17:24,929] [INFO] [logging.py:96:log_dist] [Rank 0] step=1290, skipped=22, lr=[9.474394607583496e-06, 9.474394607583496e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:17:24,963] [INFO] [timer.py:215:stop] epoch=1/micro_step=370/global_step=1290, RunningAvgSamplesPerSec=85.09073645235827, CurrSamplesPerSec=85.20881369382523, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:17:32,459] [INFO] [logging.py:96:log_dist] [Rank 0] step=1300, skipped=22, lr=[9.47163114433088e-06, 9.47163114433088e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:17:32,492] [INFO] [timer.py:215:stop] epoch=1/micro_step=380/global_step=1300, RunningAvgSamplesPerSec=85.09053883060427, CurrSamplesPerSec=85.2230972345202, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:17:39,990] [INFO] [logging.py:96:log_dist] [Rank 0] step=1310, skipped=22, lr=[9.468846515888848e-06, 9.468846515888848e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:17:40,024] [INFO] [timer.py:215:stop] epoch=1/micro_step=390/global_step=1310, RunningAvgSamplesPerSec=85.09023304237647, CurrSamplesPerSec=84.89621227574975, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:17:47,530] [INFO] [logging.py:96:log_dist] [Rank 0] step=1320, skipped=22, lr=[9.466040734941254e-06, 9.466040734941254e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:17:47,564] [INFO] [timer.py:215:stop] epoch=1/micro_step=400/global_step=1320, RunningAvgSamplesPerSec=85.08912660570299, CurrSamplesPerSec=84.88238701761931, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:17:51,281] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:17:51,985] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:17:54,965] [INFO] [logging.py:96:log_dist] [Rank 0] step=1330, skipped=24, lr=[9.463780888964232e-06, 9.463780888964232e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:17:54,998] [INFO] [timer.py:215:stop] epoch=1/micro_step=410/global_step=1330, RunningAvgSamplesPerSec=85.09709451731601, CurrSamplesPerSec=85.00159309437446, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:18:02,510] [INFO] [logging.py:96:log_dist] [Rank 0] step=1340, skipped=24, lr=[9.460937065777442e-06, 9.460937065777442e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:18:02,544] [INFO] [timer.py:215:stop] epoch=1/micro_step=420/global_step=1340, RunningAvgSamplesPerSec=85.09541950880275, CurrSamplesPerSec=84.99475692145577, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:18:10,037] [INFO] [logging.py:96:log_dist] [Rank 0] step=1350, skipped=24, lr=[9.458072126112267e-06, 9.458072126112267e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:18:10,071] [INFO] [timer.py:215:stop] epoch=1/micro_step=430/global_step=1350, RunningAvgSamplesPerSec=85.09543433264163, CurrSamplesPerSec=85.08109718565831, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:18:17,581] [INFO] [logging.py:96:log_dist] [Rank 0] step=1360, skipped=24, lr=[9.455186083018376e-06, 9.455186083018376e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:18:17,614] [INFO] [timer.py:215:stop] epoch=1/micro_step=440/global_step=1360, RunningAvgSamplesPerSec=85.0940345647251, CurrSamplesPerSec=85.00350418976457, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:18:25,125] [INFO] [logging.py:96:log_dist] [Rank 0] step=1370, skipped=24, lr=[9.45227894964156e-06, 9.45227894964156e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:18:25,159] [INFO] [timer.py:215:stop] epoch=1/micro_step=450/global_step=1370, RunningAvgSamplesPerSec=85.0926399264255, CurrSamplesPerSec=84.99341134765471, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:18:32,668] [INFO] [logging.py:96:log_dist] [Rank 0] step=1380, skipped=24, lr=[9.449350739223678e-06, 9.449350739223678e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:18:32,702] [INFO] [timer.py:215:stop] epoch=1/micro_step=460/global_step=1380, RunningAvgSamplesPerSec=85.09132159532479, CurrSamplesPerSec=84.59570579079447, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:18:40,212] [INFO] [logging.py:96:log_dist] [Rank 0] step=1390, skipped=24, lr=[9.446401465102589e-06, 9.446401465102589e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:18:40,245] [INFO] [timer.py:215:stop] epoch=1/micro_step=470/global_step=1390, RunningAvgSamplesPerSec=85.08994690241602, CurrSamplesPerSec=85.05621428800019, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:18:47,749] [INFO] [logging.py:96:log_dist] [Rank 0] step=1400, skipped=24, lr=[9.443431140712103e-06, 9.443431140712103e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:18:47,783] [INFO] [timer.py:215:stop] epoch=1/micro_step=480/global_step=1400, RunningAvgSamplesPerSec=85.08922139090708, CurrSamplesPerSec=84.8258711278423, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:18:55,287] [INFO] [logging.py:96:log_dist] [Rank 0] step=1410, skipped=24, lr=[9.440439779581911e-06, 9.440439779581911e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:18:55,319] [INFO] [timer.py:215:stop] epoch=1/micro_step=490/global_step=1410, RunningAvgSamplesPerSec=85.08850431980501, CurrSamplesPerSec=85.09228979924485, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:19:02,829] [INFO] [logging.py:96:log_dist] [Rank 0] step=1420, skipped=24, lr=[9.437427395337521e-06, 9.437427395337521e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:19:02,863] [INFO] [timer.py:215:stop] epoch=1/micro_step=500/global_step=1420, RunningAvgSamplesPerSec=85.08722979196028, CurrSamplesPerSec=84.92371510528169, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:19:08,077] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:19:08,774] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:19:10,251] [INFO] [logging.py:96:log_dist] [Rank 0] step=1430, skipped=26, lr=[9.435002360517267e-06, 9.435002360517267e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:19:10,284] [INFO] [timer.py:215:stop] epoch=1/micro_step=510/global_step=1430, RunningAvgSamplesPerSec=85.09563194942102, CurrSamplesPerSec=84.87709971141172, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:19:17,780] [INFO] [logging.py:96:log_dist] [Rank 0] step=1440, skipped=26, lr=[9.431952169309237e-06, 9.431952169309237e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:19:17,813] [INFO] [timer.py:215:stop] epoch=1/micro_step=520/global_step=1440, RunningAvgSamplesPerSec=85.09549870274256, CurrSamplesPerSec=85.1410052035724, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:19:25,321] [INFO] [logging.py:96:log_dist] [Rank 0] step=1450, skipped=26, lr=[9.428880993647682e-06, 9.428880993647682e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:19:25,355] [INFO] [timer.py:215:stop] epoch=1/micro_step=530/global_step=1450, RunningAvgSamplesPerSec=85.09433066575819, CurrSamplesPerSec=84.88402433737386, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:19:32,862] [INFO] [logging.py:96:log_dist] [Rank 0] step=1460, skipped=26, lr=[9.425788847521664e-06, 9.425788847521664e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:19:32,895] [INFO] [timer.py:215:stop] epoch=1/micro_step=540/global_step=1460, RunningAvgSamplesPerSec=85.0932779029051, CurrSamplesPerSec=84.97135001451349, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:19:40,395] [INFO] [logging.py:96:log_dist] [Rank 0] step=1470, skipped=26, lr=[9.422675745015768e-06, 9.422675745015768e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:19:40,428] [INFO] [timer.py:215:stop] epoch=1/micro_step=550/global_step=1470, RunningAvgSamplesPerSec=85.09279919733419, CurrSamplesPerSec=85.11066285600506, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:19:47,935] [INFO] [logging.py:96:log_dist] [Rank 0] step=1480, skipped=26, lr=[9.419541700310026e-06, 9.419541700310026e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:19:47,967] [INFO] [timer.py:215:stop] epoch=1/micro_step=560/global_step=1480, RunningAvgSamplesPerSec=85.09191804559595, CurrSamplesPerSec=84.21042170233791, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:19:55,470] [INFO] [logging.py:96:log_dist] [Rank 0] step=1490, skipped=26, lr=[9.416386727679873e-06, 9.416386727679873e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:19:55,503] [INFO] [timer.py:215:stop] epoch=1/micro_step=570/global_step=1490, RunningAvgSamplesPerSec=85.0912304420716, CurrSamplesPerSec=85.00735355823106, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:20:03,010] [INFO] [logging.py:96:log_dist] [Rank 0] step=1500, skipped=26, lr=[9.413210841496058e-06, 9.413210841496058e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:20:03,044] [INFO] [timer.py:215:stop] epoch=1/micro_step=580/global_step=1500, RunningAvgSamplesPerSec=85.09024119776028, CurrSamplesPerSec=84.80879974118409, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:20:10,550] [INFO] [logging.py:96:log_dist] [Rank 0] step=1510, skipped=26, lr=[9.410014056224598e-06, 9.410014056224598e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:20:10,583] [INFO] [timer.py:215:stop] epoch=1/micro_step=590/global_step=1510, RunningAvgSamplesPerSec=85.08935177061166, CurrSamplesPerSec=84.92417184559763, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:20:18,103] [INFO] [logging.py:96:log_dist] [Rank 0] step=1520, skipped=26, lr=[9.406796386426702e-06, 9.406796386426702e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:20:18,136] [INFO] [timer.py:215:stop] epoch=1/micro_step=600/global_step=1520, RunningAvgSamplesPerSec=85.08747343597769, CurrSamplesPerSec=84.99306150544498, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:20:24,870] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:20:25,568] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:20:25,568] [INFO] [logging.py:96:log_dist] [Rank 0] step=1530, skipped=28, lr=[9.404207223575212e-06, 9.404207223575212e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:20:25,569] [INFO] [timer.py:215:stop] epoch=1/micro_step=610/global_step=1530, RunningAvgSamplesPerSec=85.09454832411798, CurrSamplesPerSec=91.7359542747785, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:20:33,080] [INFO] [logging.py:96:log_dist] [Rank 0] step=1540, skipped=28, lr=[9.40095199862758e-06, 9.40095199862758e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:20:33,114] [INFO] [timer.py:215:stop] epoch=1/micro_step=620/global_step=1540, RunningAvgSamplesPerSec=85.09323520465328, CurrSamplesPerSec=84.75966074090493, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:20:40,632] [INFO] [logging.py:96:log_dist] [Rank 0] step=1550, skipped=28, lr=[9.397675930430762e-06, 9.397675930430762e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:20:40,666] [INFO] [timer.py:215:stop] epoch=1/micro_step=630/global_step=1550, RunningAvgSamplesPerSec=85.09138061401002, CurrSamplesPerSec=84.50824554902799, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:20:48,165] [INFO] [logging.py:96:log_dist] [Rank 0] step=1560, skipped=28, lr=[9.3943790339071e-06, 9.3943790339071e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:20:48,199] [INFO] [timer.py:215:stop] epoch=1/micro_step=640/global_step=1560, RunningAvgSamplesPerSec=85.09095276954959, CurrSamplesPerSec=85.20286363153042, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:20:55,716] [INFO] [logging.py:96:log_dist] [Rank 0] step=1570, skipped=28, lr=[9.391061324073802e-06, 9.391061324073802e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:20:55,750] [INFO] [timer.py:215:stop] epoch=1/micro_step=650/global_step=1570, RunningAvgSamplesPerSec=85.08927638617457, CurrSamplesPerSec=85.06796647547917, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:21:03,259] [INFO] [logging.py:96:log_dist] [Rank 0] step=1580, skipped=28, lr=[9.387722816042882e-06, 9.387722816042882e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:21:03,293] [INFO] [timer.py:215:stop] epoch=1/micro_step=660/global_step=1580, RunningAvgSamplesPerSec=85.08814365794275, CurrSamplesPerSec=84.823029886689, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:21:10,805] [INFO] [logging.py:96:log_dist] [Rank 0] step=1590, skipped=28, lr=[9.384363525021092e-06, 9.384363525021092e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:21:10,839] [INFO] [timer.py:215:stop] epoch=1/micro_step=670/global_step=1590, RunningAvgSamplesPerSec=85.08680722118055, CurrSamplesPerSec=84.5382929299546, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:21:18,348] [INFO] [logging.py:96:log_dist] [Rank 0] step=1600, skipped=28, lr=[9.380983466309844e-06, 9.380983466309844e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:21:18,382] [INFO] [timer.py:215:stop] epoch=1/micro_step=680/global_step=1600, RunningAvgSamplesPerSec=85.08573309564304, CurrSamplesPerSec=85.06376119126773, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:21:25,900] [INFO] [logging.py:96:log_dist] [Rank 0] step=1610, skipped=28, lr=[9.377582655305148e-06, 9.377582655305148e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:21:25,934] [INFO] [timer.py:215:stop] epoch=1/micro_step=690/global_step=1610, RunningAvgSamplesPerSec=85.08408907477278, CurrSamplesPerSec=84.9530908207081, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:21:33,444] [INFO] [logging.py:96:log_dist] [Rank 0] step=1620, skipped=28, lr=[9.374161107497545e-06, 9.374161107497545e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:21:33,477] [INFO] [timer.py:215:stop] epoch=1/micro_step=700/global_step=1620, RunningAvgSamplesPerSec=85.08299387365182, CurrSamplesPerSec=84.99298077303634, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:21:40,990] [INFO] [logging.py:96:log_dist] [Rank 0] step=1630, skipped=28, lr=[9.370718838472023e-06, 9.370718838472023e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:21:41,024] [INFO] [timer.py:215:stop] epoch=1/micro_step=710/global_step=1630, RunningAvgSamplesPerSec=85.08168634712847, CurrSamplesPerSec=84.943009945943, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:21:41,719] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:21:42,415] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:21:48,417] [INFO] [logging.py:96:log_dist] [Rank 0] step=1640, skipped=30, lr=[9.367950114508076e-06, 9.367950114508076e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:21:48,451] [INFO] [timer.py:215:stop] epoch=1/micro_step=720/global_step=1640, RunningAvgSamplesPerSec=85.08863375182821, CurrSamplesPerSec=84.72868024734728, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:21:55,961] [INFO] [logging.py:96:log_dist] [Rank 0] step=1650, skipped=30, lr=[9.36447058686571e-06, 9.36447058686571e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:21:55,994] [INFO] [timer.py:215:stop] epoch=1/micro_step=730/global_step=1650, RunningAvgSamplesPerSec=85.08759491569347, CurrSamplesPerSec=84.85126311796688, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:22:03,505] [INFO] [logging.py:96:log_dist] [Rank 0] step=1660, skipped=30, lr=[9.360970382145298e-06, 9.360970382145298e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:22:03,539] [INFO] [timer.py:215:stop] epoch=1/micro_step=740/global_step=1660, RunningAvgSamplesPerSec=85.08643279324464, CurrSamplesPerSec=84.84431701918449, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:22:11,051] [INFO] [logging.py:96:log_dist] [Rank 0] step=1670, skipped=30, lr=[9.357449516290109e-06, 9.357449516290109e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:22:11,085] [INFO] [timer.py:215:stop] epoch=1/micro_step=750/global_step=1670, RunningAvgSamplesPerSec=85.08521173204242, CurrSamplesPerSec=84.82461130874924, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:22:18,594] [INFO] [logging.py:96:log_dist] [Rank 0] step=1680, skipped=30, lr=[9.353908005337526e-06, 9.353908005337526e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:22:18,628] [INFO] [timer.py:215:stop] epoch=1/micro_step=760/global_step=1680, RunningAvgSamplesPerSec=85.08421033545577, CurrSamplesPerSec=85.02501666859352, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:22:26,156] [INFO] [logging.py:96:log_dist] [Rank 0] step=1690, skipped=30, lr=[9.350345865418965e-06, 9.350345865418965e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:22:26,189] [INFO] [timer.py:215:stop] epoch=1/micro_step=770/global_step=1690, RunningAvgSamplesPerSec=85.08200947507218, CurrSamplesPerSec=84.79952993653511, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:22:33,705] [INFO] [logging.py:96:log_dist] [Rank 0] step=1700, skipped=30, lr=[9.346763112759811e-06, 9.346763112759811e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:22:33,739] [INFO] [timer.py:215:stop] epoch=1/micro_step=780/global_step=1700, RunningAvgSamplesPerSec=85.08061652212638, CurrSamplesPerSec=85.07610865611132, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:22:41,248] [INFO] [logging.py:96:log_dist] [Rank 0] step=1710, skipped=30, lr=[9.343159763679335e-06, 9.343159763679335e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:22:41,282] [INFO] [timer.py:215:stop] epoch=1/micro_step=790/global_step=1710, RunningAvgSamplesPerSec=85.07965528611061, CurrSamplesPerSec=84.9208941710329, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:22:48,791] [INFO] [logging.py:96:log_dist] [Rank 0] step=1720, skipped=30, lr=[9.339535834590625e-06, 9.339535834590625e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:22:48,825] [INFO] [timer.py:215:stop] epoch=1/micro_step=800/global_step=1720, RunningAvgSamplesPerSec=85.07866669747025, CurrSamplesPerSec=84.67536020127615, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:22:56,335] [INFO] [logging.py:96:log_dist] [Rank 0] step=1730, skipped=30, lr=[9.335891342000508e-06, 9.335891342000508e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:22:56,369] [INFO] [timer.py:215:stop] epoch=1/micro_step=810/global_step=1730, RunningAvgSamplesPerSec=85.07769002593274, CurrSamplesPerSec=84.46724101419045, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:22:58,575] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:22:59,273] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:23:03,766] [INFO] [logging.py:96:log_dist] [Rank 0] step=1740, skipped=32, lr=[9.33296095335979e-06, 9.33296095335979e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:23:03,800] [INFO] [timer.py:215:stop] epoch=1/micro_step=820/global_step=1740, RunningAvgSamplesPerSec=85.0840423109208, CurrSamplesPerSec=84.87071287989566, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:23:11,313] [INFO] [logging.py:96:log_dist] [Rank 0] step=1750, skipped=32, lr=[9.329279488363285e-06, 9.329279488363285e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:23:11,347] [INFO] [timer.py:215:stop] epoch=1/micro_step=830/global_step=1750, RunningAvgSamplesPerSec=85.08285665422115, CurrSamplesPerSec=85.15839971575318, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:23:18,861] [INFO] [logging.py:96:log_dist] [Rank 0] step=1760, skipped=32, lr=[9.325577506582558e-06, 9.325577506582558e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:23:18,894] [INFO] [timer.py:215:stop] epoch=1/micro_step=840/global_step=1760, RunningAvgSamplesPerSec=85.08166146332411, CurrSamplesPerSec=85.19331822229726, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:23:26,397] [INFO] [logging.py:96:log_dist] [Rank 0] step=1770, skipped=32, lr=[9.321855024879961e-06, 9.321855024879961e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:23:26,431] [INFO] [timer.py:215:stop] epoch=1/micro_step=850/global_step=1770, RunningAvgSamplesPerSec=85.08115459650242, CurrSamplesPerSec=84.82064446327695, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:23:33,950] [INFO] [logging.py:96:log_dist] [Rank 0] step=1780, skipped=32, lr=[9.318112060211228e-06, 9.318112060211228e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:23:33,984] [INFO] [timer.py:215:stop] epoch=1/micro_step=860/global_step=1780, RunningAvgSamplesPerSec=85.07960185450301, CurrSamplesPerSec=85.04629753718304, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:23:41,502] [INFO] [logging.py:96:log_dist] [Rank 0] step=1790, skipped=32, lr=[9.314348629625388e-06, 9.314348629625388e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:23:41,536] [INFO] [timer.py:215:stop] epoch=1/micro_step=870/global_step=1790, RunningAvgSamplesPerSec=85.0781288209538, CurrSamplesPerSec=84.98291733400364, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:23:49,048] [INFO] [logging.py:96:log_dist] [Rank 0] step=1800, skipped=32, lr=[9.310564750264693e-06, 9.310564750264693e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:23:49,082] [INFO] [timer.py:215:stop] epoch=1/micro_step=880/global_step=1800, RunningAvgSamplesPerSec=85.07705570137175, CurrSamplesPerSec=84.81595441523427, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:23:56,597] [INFO] [logging.py:96:log_dist] [Rank 0] step=1810, skipped=32, lr=[9.30676043936454e-06, 9.30676043936454e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:23:56,631] [INFO] [timer.py:215:stop] epoch=1/micro_step=890/global_step=1810, RunningAvgSamplesPerSec=85.07574611268245, CurrSamplesPerSec=84.45852403996966, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:24:04,144] [INFO] [logging.py:96:log_dist] [Rank 0] step=1820, skipped=32, lr=[9.302935714253385e-06, 9.302935714253385e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:24:04,178] [INFO] [timer.py:215:stop] epoch=1/micro_step=900/global_step=1820, RunningAvgSamplesPerSec=85.07463165457236, CurrSamplesPerSec=85.001996839762, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:24:11,678] [INFO] [logging.py:96:log_dist] [Rank 0] step=1830, skipped=32, lr=[9.29909059235268e-06, 9.29909059235268e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:24:11,712] [INFO] [timer.py:215:stop] epoch=1/micro_step=910/global_step=1830, RunningAvgSamplesPerSec=85.07434168749354, CurrSamplesPerSec=85.1502687874446, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:24:15,427] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:24:16,123] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:24:19,108] [INFO] [logging.py:96:log_dist] [Rank 0] step=1840, skipped=34, lr=[9.295999820910157e-06, 9.295999820910157e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:24:19,142] [INFO] [timer.py:215:stop] epoch=1/micro_step=920/global_step=1840, RunningAvgSamplesPerSec=85.08041401757053, CurrSamplesPerSec=84.78937834283929, MemAllocated=5.0GB, MaxMemAllocated=23.99GB ***** Evaluating perplexity, Epoch 2/16 ***** ppl: 1.9457812309265137 Beginning of Epoch 3/16, Total Micro Batches 920 [2023-06-29 17:24:44,593] [INFO] [logging.py:96:log_dist] [Rank 0] step=1850, skipped=34, lr=[9.2921180289868e-06, 9.2921180289868e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:24:44,627] [INFO] [timer.py:215:stop] epoch=2/micro_step=10/global_step=1850, RunningAvgSamplesPerSec=85.07775223230811, CurrSamplesPerSec=84.55829199690918, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:24:52,131] [INFO] [logging.py:96:log_dist] [Rank 0] step=1860, skipped=34, lr=[9.288215889547945e-06, 9.288215889547945e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:24:52,164] [INFO] [timer.py:215:stop] epoch=2/micro_step=20/global_step=1860, RunningAvgSamplesPerSec=85.07724315304776, CurrSamplesPerSec=84.98633433083052, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:24:59,673] [INFO] [logging.py:96:log_dist] [Rank 0] step=1870, skipped=34, lr=[9.284293420367653e-06, 9.284293420367653e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:24:59,706] [INFO] [timer.py:215:stop] epoch=2/micro_step=30/global_step=1870, RunningAvgSamplesPerSec=85.07640929668239, CurrSamplesPerSec=84.57190529449757, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:25:07,216] [INFO] [logging.py:96:log_dist] [Rank 0] step=1880, skipped=34, lr=[9.280350639312594e-06, 9.280350639312594e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:25:07,249] [INFO] [timer.py:215:stop] epoch=2/micro_step=40/global_step=1880, RunningAvgSamplesPerSec=85.07555985737982, CurrSamplesPerSec=84.98149142446131, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:25:14,758] [INFO] [logging.py:96:log_dist] [Rank 0] step=1890, skipped=34, lr=[9.276387564341946e-06, 9.276387564341946e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:25:14,792] [INFO] [timer.py:215:stop] epoch=2/micro_step=50/global_step=1890, RunningAvgSamplesPerSec=85.0747535789975, CurrSamplesPerSec=85.2430968245107, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:25:22,302] [INFO] [logging.py:96:log_dist] [Rank 0] step=1900, skipped=34, lr=[9.272404213507338e-06, 9.272404213507338e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:25:22,335] [INFO] [timer.py:215:stop] epoch=2/micro_step=60/global_step=1900, RunningAvgSamplesPerSec=85.07391208716967, CurrSamplesPerSec=84.90958544986172, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:25:29,835] [INFO] [logging.py:96:log_dist] [Rank 0] step=1910, skipped=34, lr=[9.268400604952746e-06, 9.268400604952746e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:25:29,868] [INFO] [timer.py:215:stop] epoch=2/micro_step=70/global_step=1910, RunningAvgSamplesPerSec=85.07364841195785, CurrSamplesPerSec=85.11136448090828, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:25:37,363] [INFO] [logging.py:96:log_dist] [Rank 0] step=1920, skipped=34, lr=[9.264376756914422e-06, 9.264376756914422e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:25:37,396] [INFO] [timer.py:215:stop] epoch=2/micro_step=80/global_step=1920, RunningAvgSamplesPerSec=85.07371648392817, CurrSamplesPerSec=85.13047471413385, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:25:44,901] [INFO] [logging.py:96:log_dist] [Rank 0] step=1930, skipped=34, lr=[9.260332687720804e-06, 9.260332687720804e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:25:44,934] [INFO] [timer.py:215:stop] epoch=2/micro_step=90/global_step=1930, RunningAvgSamplesPerSec=85.0731986756574, CurrSamplesPerSec=85.17636890077026, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:25:50,159] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:25:50,857] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:25:52,336] [INFO] [logging.py:96:log_dist] [Rank 0] step=1940, skipped=36, lr=[9.257082885509618e-06, 9.257082885509618e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:25:52,369] [INFO] [timer.py:215:stop] epoch=2/micro_step=100/global_step=1940, RunningAvgSamplesPerSec=85.07874243164554, CurrSamplesPerSec=84.63072458796839, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:25:59,870] [INFO] [logging.py:96:log_dist] [Rank 0] step=1950, skipped=36, lr=[9.253002464718097e-06, 9.253002464718097e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:25:59,903] [INFO] [timer.py:215:stop] epoch=2/micro_step=110/global_step=1950, RunningAvgSamplesPerSec=85.07840407114402, CurrSamplesPerSec=85.07551546448931, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:26:07,410] [INFO] [logging.py:96:log_dist] [Rank 0] step=1960, skipped=36, lr=[9.248901874580661e-06, 9.248901874580661e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:26:07,443] [INFO] [timer.py:215:stop] epoch=2/micro_step=120/global_step=1960, RunningAvgSamplesPerSec=85.07774626927372, CurrSamplesPerSec=84.76565612867267, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:26:14,945] [INFO] [logging.py:96:log_dist] [Rank 0] step=1970, skipped=36, lr=[9.244781133775306e-06, 9.244781133775306e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:26:14,979] [INFO] [timer.py:215:stop] epoch=2/micro_step=130/global_step=1970, RunningAvgSamplesPerSec=85.07734871857056, CurrSamplesPerSec=85.18463997583157, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:26:22,478] [INFO] [logging.py:96:log_dist] [Rank 0] step=1980, skipped=36, lr=[9.240640261071813e-06, 9.240640261071813e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:26:22,511] [INFO] [timer.py:215:stop] epoch=2/micro_step=140/global_step=1980, RunningAvgSamplesPerSec=85.07715021025163, CurrSamplesPerSec=84.87830741366575, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:26:30,001] [INFO] [logging.py:96:log_dist] [Rank 0] step=1990, skipped=36, lr=[9.236479275331666e-06, 9.236479275331666e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:26:30,035] [INFO] [timer.py:215:stop] epoch=2/micro_step=150/global_step=1990, RunningAvgSamplesPerSec=85.07745836876848, CurrSamplesPerSec=84.9922541882602, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:26:37,543] [INFO] [logging.py:96:log_dist] [Rank 0] step=2000, skipped=36, lr=[9.232298195507963e-06, 9.232298195507963e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:26:37,576] [INFO] [timer.py:215:stop] epoch=2/micro_step=160/global_step=2000, RunningAvgSamplesPerSec=85.07671161318453, CurrSamplesPerSec=84.79457436945525, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:26:45,087] [INFO] [logging.py:96:log_dist] [Rank 0] step=2010, skipped=36, lr=[9.228097040645329e-06, 9.228097040645329e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:26:45,121] [INFO] [timer.py:215:stop] epoch=2/micro_step=170/global_step=2010, RunningAvgSamplesPerSec=85.07581524168253, CurrSamplesPerSec=85.09622814189159, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:26:52,633] [INFO] [logging.py:96:log_dist] [Rank 0] step=2020, skipped=36, lr=[9.223875829879829e-06, 9.223875829879829e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:26:52,667] [INFO] [timer.py:215:stop] epoch=2/micro_step=180/global_step=2020, RunningAvgSamplesPerSec=85.07484785990016, CurrSamplesPerSec=84.91495738838115, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:27:00,201] [INFO] [logging.py:96:log_dist] [Rank 0] step=2030, skipped=36, lr=[9.219634582438881e-06, 9.219634582438881e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:27:00,235] [INFO] [timer.py:215:stop] epoch=2/micro_step=190/global_step=2030, RunningAvgSamplesPerSec=85.07268075402732, CurrSamplesPerSec=84.97543856761223, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:27:06,967] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:27:07,664] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:27:07,665] [INFO] [logging.py:96:log_dist] [Rank 0] step=2040, skipped=38, lr=[9.216227171058895e-06, 9.216227171058895e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:27:07,666] [INFO] [timer.py:215:stop] epoch=2/micro_step=200/global_step=2040, RunningAvgSamplesPerSec=85.07809910227364, CurrSamplesPerSec=91.82177708453055, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:27:15,189] [INFO] [logging.py:96:log_dist] [Rank 0] step=2050, skipped=38, lr=[9.211949906346505e-06, 9.211949906346505e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:27:15,223] [INFO] [timer.py:215:stop] epoch=2/micro_step=210/global_step=2050, RunningAvgSamplesPerSec=85.0765356364953, CurrSamplesPerSec=84.91353375847758, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:27:22,733] [INFO] [logging.py:96:log_dist] [Rank 0] step=2060, skipped=38, lr=[9.2076526592807e-06, 9.2076526592807e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:27:22,767] [INFO] [timer.py:215:stop] epoch=2/micro_step=220/global_step=2060, RunningAvgSamplesPerSec=85.07569722320419, CurrSamplesPerSec=85.17282850666443, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:27:30,278] [INFO] [logging.py:96:log_dist] [Rank 0] step=2070, skipped=38, lr=[9.203335449435236e-06, 9.203335449435236e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:27:30,311] [INFO] [timer.py:215:stop] epoch=2/micro_step=230/global_step=2070, RunningAvgSamplesPerSec=85.07481496506858, CurrSamplesPerSec=84.8199476233329, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:27:37,823] [INFO] [logging.py:96:log_dist] [Rank 0] step=2080, skipped=38, lr=[9.198998296474807e-06, 9.198998296474807e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:27:37,857] [INFO] [timer.py:215:stop] epoch=2/micro_step=240/global_step=2080, RunningAvgSamplesPerSec=85.07391133494778, CurrSamplesPerSec=85.02663256336957, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:27:45,373] [INFO] [logging.py:96:log_dist] [Rank 0] step=2090, skipped=38, lr=[9.194641220154943e-06, 9.194641220154943e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:27:45,407] [INFO] [timer.py:215:stop] epoch=2/micro_step=250/global_step=2090, RunningAvgSamplesPerSec=85.07273782977508, CurrSamplesPerSec=85.15718402974397, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:27:52,909] [INFO] [logging.py:96:log_dist] [Rank 0] step=2100, skipped=38, lr=[9.190264240321921e-06, 9.190264240321921e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:27:52,943] [INFO] [timer.py:215:stop] epoch=2/micro_step=260/global_step=2100, RunningAvgSamplesPerSec=85.07233347887112, CurrSamplesPerSec=85.0480489690395, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:28:00,454] [INFO] [logging.py:96:log_dist] [Rank 0] step=2110, skipped=38, lr=[9.185867376912686e-06, 9.185867376912686e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:28:00,487] [INFO] [timer.py:215:stop] epoch=2/micro_step=270/global_step=2110, RunningAvgSamplesPerSec=85.07153497846826, CurrSamplesPerSec=84.63843635062295, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:28:08,003] [INFO] [logging.py:96:log_dist] [Rank 0] step=2120, skipped=38, lr=[9.181450649954749e-06, 9.181450649954749e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:28:08,036] [INFO] [timer.py:215:stop] epoch=2/micro_step=280/global_step=2120, RunningAvgSamplesPerSec=85.07048205803103, CurrSamplesPerSec=84.46035764425743, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:28:15,556] [INFO] [logging.py:96:log_dist] [Rank 0] step=2130, skipped=38, lr=[9.17701407956609e-06, 9.17701407956609e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:28:15,589] [INFO] [timer.py:215:stop] epoch=2/micro_step=290/global_step=2130, RunningAvgSamplesPerSec=85.06920456023393, CurrSamplesPerSec=84.99470309768571, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:28:23,108] [INFO] [logging.py:96:log_dist] [Rank 0] step=2140, skipped=38, lr=[9.172557685955084e-06, 9.172557685955084e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:28:23,142] [INFO] [timer.py:215:stop] epoch=2/micro_step=300/global_step=2140, RunningAvgSamplesPerSec=85.06794679265896, CurrSamplesPerSec=84.93137287161743, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:28:23,838] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:28:24,537] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:28:30,537] [INFO] [logging.py:96:log_dist] [Rank 0] step=2150, skipped=40, lr=[9.16897831198386e-06, 9.16897831198386e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:28:30,571] [INFO] [timer.py:215:stop] epoch=2/micro_step=310/global_step=2150, RunningAvgSamplesPerSec=85.07323678228964, CurrSamplesPerSec=84.58520313680165, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:28:38,083] [INFO] [logging.py:96:log_dist] [Rank 0] step=2160, skipped=40, lr=[9.164486287785888e-06, 9.164486287785888e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:28:38,117] [INFO] [timer.py:215:stop] epoch=2/micro_step=320/global_step=2160, RunningAvgSamplesPerSec=85.07234316173181, CurrSamplesPerSec=85.18258557196327, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:28:45,642] [INFO] [logging.py:96:log_dist] [Rank 0] step=2170, skipped=40, lr=[9.15997449742908e-06, 9.15997449742908e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:28:45,674] [INFO] [timer.py:215:stop] epoch=2/micro_step=330/global_step=2170, RunningAvgSamplesPerSec=85.07085942888024, CurrSamplesPerSec=84.63518069462937, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:28:53,188] [INFO] [logging.py:96:log_dist] [Rank 0] step=2180, skipped=40, lr=[9.15544296146443e-06, 9.15544296146443e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:28:53,221] [INFO] [timer.py:215:stop] epoch=2/micro_step=340/global_step=2180, RunningAvgSamplesPerSec=85.06991171230304, CurrSamplesPerSec=84.99933219108138, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:29:00,733] [INFO] [logging.py:96:log_dist] [Rank 0] step=2190, skipped=40, lr=[9.15089170053288e-06, 9.15089170053288e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:29:00,767] [INFO] [timer.py:215:stop] epoch=2/micro_step=350/global_step=2190, RunningAvgSamplesPerSec=85.0691047512185, CurrSamplesPerSec=85.18728922183566, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:29:08,276] [INFO] [logging.py:96:log_dist] [Rank 0] step=2200, skipped=40, lr=[9.146320735365205e-06, 9.146320735365205e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:29:08,310] [INFO] [timer.py:215:stop] epoch=2/micro_step=360/global_step=2200, RunningAvgSamplesPerSec=85.06837157527934, CurrSamplesPerSec=84.7933690488224, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:29:15,831] [INFO] [logging.py:96:log_dist] [Rank 0] step=2210, skipped=40, lr=[9.141730086781944e-06, 9.141730086781944e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:29:15,864] [INFO] [timer.py:215:stop] epoch=2/micro_step=370/global_step=2210, RunningAvgSamplesPerSec=85.06713251062023, CurrSamplesPerSec=85.08713813422594, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:29:23,386] [INFO] [logging.py:96:log_dist] [Rank 0] step=2220, skipped=40, lr=[9.137119775693286e-06, 9.137119775693286e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:29:23,419] [INFO] [timer.py:215:stop] epoch=2/micro_step=380/global_step=2220, RunningAvgSamplesPerSec=85.06581804697849, CurrSamplesPerSec=85.15153829853917, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:29:30,939] [INFO] [logging.py:96:log_dist] [Rank 0] step=2230, skipped=40, lr=[9.132489823098989e-06, 9.132489823098989e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:29:30,972] [INFO] [timer.py:215:stop] epoch=2/micro_step=390/global_step=2230, RunningAvgSamplesPerSec=85.06461330219965, CurrSamplesPerSec=84.66382304444201, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:29:38,478] [INFO] [logging.py:96:log_dist] [Rank 0] step=2240, skipped=40, lr=[9.127840250088267e-06, 9.127840250088267e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:29:38,512] [INFO] [timer.py:215:stop] epoch=2/micro_step=400/global_step=2240, RunningAvgSamplesPerSec=85.064085888457, CurrSamplesPerSec=84.36893395514203, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:29:40,717] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:29:41,414] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:29:45,897] [INFO] [logging.py:96:log_dist] [Rank 0] step=2250, skipped=42, lr=[9.124106479208876e-06, 9.124106479208876e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:29:45,931] [INFO] [timer.py:215:stop] epoch=2/micro_step=410/global_step=2250, RunningAvgSamplesPerSec=85.06968739131173, CurrSamplesPerSec=85.15645463479876, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:29:53,451] [INFO] [logging.py:96:log_dist] [Rank 0] step=2260, skipped=42, lr=[9.119421642878632e-06, 9.119421642878632e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:29:53,485] [INFO] [timer.py:215:stop] epoch=2/micro_step=420/global_step=2260, RunningAvgSamplesPerSec=85.06842251225774, CurrSamplesPerSec=85.11131050935118, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:30:00,990] [INFO] [logging.py:96:log_dist] [Rank 0] step=2270, skipped=42, lr=[9.114717245656921e-06, 9.114717245656921e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:30:01,023] [INFO] [timer.py:215:stop] epoch=2/micro_step=430/global_step=2270, RunningAvgSamplesPerSec=85.06797670964329, CurrSamplesPerSec=85.26903719703948, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:30:08,537] [INFO] [logging.py:96:log_dist] [Rank 0] step=2280, skipped=42, lr=[9.109993308972054e-06, 9.109993308972054e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:30:08,570] [INFO] [timer.py:215:stop] epoch=2/micro_step=440/global_step=2280, RunningAvgSamplesPerSec=85.067092020734, CurrSamplesPerSec=85.0055499645646, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:30:16,092] [INFO] [logging.py:96:log_dist] [Rank 0] step=2290, skipped=42, lr=[9.105249854341344e-06, 9.105249854341344e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:30:16,125] [INFO] [timer.py:215:stop] epoch=2/micro_step=450/global_step=2290, RunningAvgSamplesPerSec=85.06583681267733, CurrSamplesPerSec=84.48382949487295, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:30:23,651] [INFO] [logging.py:96:log_dist] [Rank 0] step=2300, skipped=42, lr=[9.100486903371005e-06, 9.100486903371005e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:30:23,687] [INFO] [timer.py:215:stop] epoch=2/micro_step=460/global_step=2300, RunningAvgSamplesPerSec=85.06423353888815, CurrSamplesPerSec=84.32983942690929, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:30:31,199] [INFO] [logging.py:96:log_dist] [Rank 0] step=2310, skipped=42, lr=[9.095704477756058e-06, 9.095704477756058e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:30:31,233] [INFO] [timer.py:215:stop] epoch=2/micro_step=470/global_step=2310, RunningAvgSamplesPerSec=85.06343090448166, CurrSamplesPerSec=85.00143159729336, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:30:38,753] [INFO] [logging.py:96:log_dist] [Rank 0] step=2320, skipped=42, lr=[9.090902599280228e-06, 9.090902599280228e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:30:38,787] [INFO] [timer.py:215:stop] epoch=2/micro_step=480/global_step=2320, RunningAvgSamplesPerSec=85.06221017843441, CurrSamplesPerSec=84.56820183045181, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:30:46,300] [INFO] [logging.py:96:log_dist] [Rank 0] step=2330, skipped=42, lr=[9.086081289815856e-06, 9.086081289815856e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:30:46,334] [INFO] [timer.py:215:stop] epoch=2/micro_step=490/global_step=2330, RunningAvgSamplesPerSec=85.06141054806871, CurrSamplesPerSec=84.64572245546123, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:30:53,846] [INFO] [logging.py:96:log_dist] [Rank 0] step=2340, skipped=42, lr=[9.081240571323775e-06, 9.081240571323775e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:30:53,880] [INFO] [timer.py:215:stop] epoch=2/micro_step=500/global_step=2340, RunningAvgSamplesPerSec=85.06063485807391, CurrSamplesPerSec=84.68877075531522, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:30:57,587] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:30:58,283] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:31:01,271] [INFO] [logging.py:96:log_dist] [Rank 0] step=2350, skipped=44, lr=[9.077354036844291e-06, 9.077354036844291e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:31:01,305] [INFO] [timer.py:215:stop] epoch=2/micro_step=510/global_step=2350, RunningAvgSamplesPerSec=85.06570825110485, CurrSamplesPerSec=84.64932593790196, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:31:08,809] [INFO] [logging.py:96:log_dist] [Rank 0] step=2360, skipped=44, lr=[9.072478437725792e-06, 9.072478437725792e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:31:08,842] [INFO] [timer.py:215:stop] epoch=2/micro_step=520/global_step=2360, RunningAvgSamplesPerSec=85.06529396239388, CurrSamplesPerSec=84.89323208211418, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:31:16,364] [INFO] [logging.py:96:log_dist] [Rank 0] step=2370, skipped=44, lr=[9.067583491539948e-06, 9.067583491539948e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:31:16,397] [INFO] [timer.py:215:stop] epoch=2/micro_step=530/global_step=2370, RunningAvgSamplesPerSec=85.0640756784287, CurrSamplesPerSec=84.48904132397362, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:31:23,917] [INFO] [logging.py:96:log_dist] [Rank 0] step=2380, skipped=44, lr=[9.062669220583011e-06, 9.062669220583011e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:31:23,950] [INFO] [timer.py:215:stop] epoch=2/micro_step=540/global_step=2380, RunningAvgSamplesPerSec=85.06299968488821, CurrSamplesPerSec=85.00573839614346, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:31:31,468] [INFO] [logging.py:96:log_dist] [Rank 0] step=2390, skipped=44, lr=[9.05773564723926e-06, 9.05773564723926e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:31:31,502] [INFO] [timer.py:215:stop] epoch=2/micro_step=550/global_step=2390, RunningAvgSamplesPerSec=85.0619707347992, CurrSamplesPerSec=85.20037567006291, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:31:39,007] [INFO] [logging.py:96:log_dist] [Rank 0] step=2400, skipped=44, lr=[9.05278279398089e-06, 9.05278279398089e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:31:39,041] [INFO] [timer.py:215:stop] epoch=2/micro_step=560/global_step=2400, RunningAvgSamplesPerSec=85.06153049818704, CurrSamplesPerSec=85.01685733813301, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:31:46,561] [INFO] [logging.py:96:log_dist] [Rank 0] step=2410, skipped=44, lr=[9.04781068336792e-06, 9.04781068336792e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:31:46,594] [INFO] [timer.py:215:stop] epoch=2/micro_step=570/global_step=2410, RunningAvgSamplesPerSec=85.06041495314973, CurrSamplesPerSec=84.72560483692791, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:31:54,113] [INFO] [logging.py:96:log_dist] [Rank 0] step=2420, skipped=44, lr=[9.04281933804808e-06, 9.04281933804808e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:31:54,147] [INFO] [timer.py:215:stop] epoch=2/micro_step=580/global_step=2420, RunningAvgSamplesPerSec=85.05937693779832, CurrSamplesPerSec=84.25292249033056, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:32:01,645] [INFO] [logging.py:96:log_dist] [Rank 0] step=2430, skipped=44, lr=[9.037808780756722e-06, 9.037808780756722e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:32:01,679] [INFO] [timer.py:215:stop] epoch=2/micro_step=590/global_step=2430, RunningAvgSamplesPerSec=85.05929831678404, CurrSamplesPerSec=84.92301657077181, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:32:09,183] [INFO] [logging.py:96:log_dist] [Rank 0] step=2440, skipped=44, lr=[9.032779034316696e-06, 9.032779034316696e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:32:09,217] [INFO] [timer.py:215:stop] epoch=2/micro_step=600/global_step=2440, RunningAvgSamplesPerSec=85.05895303642836, CurrSamplesPerSec=85.3533575623452, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:32:14,439] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:32:15,136] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:32:16,608] [INFO] [logging.py:96:log_dist] [Rank 0] step=2450, skipped=46, lr=[9.028741436370401e-06, 9.028741436370401e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:32:16,642] [INFO] [timer.py:215:stop] epoch=2/micro_step=610/global_step=2450, RunningAvgSamplesPerSec=85.06381152386592, CurrSamplesPerSec=85.12863889547812, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:32:24,143] [INFO] [logging.py:96:log_dist] [Rank 0] step=2460, skipped=46, lr=[9.023677207255308e-06, 9.023677207255308e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:32:24,176] [INFO] [timer.py:215:stop] epoch=2/micro_step=620/global_step=2460, RunningAvgSamplesPerSec=85.06363925510783, CurrSamplesPerSec=84.8637367674791, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:32:31,678] [INFO] [logging.py:96:log_dist] [Rank 0] step=2470, skipped=46, lr=[9.018593853360213e-06, 9.018593853360213e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:32:31,712] [INFO] [timer.py:215:stop] epoch=2/micro_step=630/global_step=2470, RunningAvgSamplesPerSec=85.06337305617565, CurrSamplesPerSec=85.06990751612518, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:32:39,228] [INFO] [logging.py:96:log_dist] [Rank 0] step=2480, skipped=46, lr=[9.013491397839557e-06, 9.013491397839557e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:32:39,261] [INFO] [timer.py:215:stop] epoch=2/micro_step=640/global_step=2480, RunningAvgSamplesPerSec=85.06246624266741, CurrSamplesPerSec=84.9616681684918, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:32:46,771] [INFO] [logging.py:96:log_dist] [Rank 0] step=2490, skipped=46, lr=[9.008369863934787e-06, 9.008369863934787e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:32:46,805] [INFO] [timer.py:215:stop] epoch=2/micro_step=650/global_step=2490, RunningAvgSamplesPerSec=85.06185166482283, CurrSamplesPerSec=84.89586323324492, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:32:54,320] [INFO] [logging.py:96:log_dist] [Rank 0] step=2500, skipped=46, lr=[9.003229274974254e-06, 9.003229274974254e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:32:54,353] [INFO] [timer.py:215:stop] epoch=2/micro_step=660/global_step=2500, RunningAvgSamplesPerSec=85.06101257325541, CurrSamplesPerSec=84.74553211049434, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:33:01,857] [INFO] [logging.py:96:log_dist] [Rank 0] step=2510, skipped=46, lr=[8.998069654373099e-06, 8.998069654373099e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:33:01,890] [INFO] [timer.py:215:stop] epoch=2/micro_step=670/global_step=2510, RunningAvgSamplesPerSec=85.0606945307051, CurrSamplesPerSec=84.85289923721503, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:33:09,397] [INFO] [logging.py:96:log_dist] [Rank 0] step=2520, skipped=46, lr=[8.99289102563316e-06, 8.99289102563316e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:33:09,430] [INFO] [timer.py:215:stop] epoch=2/micro_step=680/global_step=2520, RunningAvgSamplesPerSec=85.06027490658266, CurrSamplesPerSec=84.97075828284245, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:33:16,932] [INFO] [logging.py:96:log_dist] [Rank 0] step=2530, skipped=46, lr=[8.987693412342847e-06, 8.987693412342847e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:33:16,966] [INFO] [timer.py:215:stop] epoch=2/micro_step=690/global_step=2530, RunningAvgSamplesPerSec=85.06004621260097, CurrSamplesPerSec=85.09924958335723, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:33:24,476] [INFO] [logging.py:96:log_dist] [Rank 0] step=2540, skipped=46, lr=[8.982476838177047e-06, 8.982476838177047e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:33:24,510] [INFO] [timer.py:215:stop] epoch=2/micro_step=700/global_step=2540, RunningAvgSamplesPerSec=85.05943508779207, CurrSamplesPerSec=84.85448177323707, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:33:31,247] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:33:31,943] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:33:31,944] [INFO] [logging.py:96:log_dist] [Rank 0] step=2550, skipped=48, lr=[8.978289942978722e-06, 8.978289942978722e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:33:31,945] [INFO] [timer.py:215:stop] epoch=2/micro_step=710/global_step=2550, RunningAvgSamplesPerSec=85.0636843762906, CurrSamplesPerSec=91.93646809893715, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:33:39,447] [INFO] [logging.py:96:log_dist] [Rank 0] step=2560, skipped=48, lr=[8.973039299173377e-06, 8.973039299173377e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:33:39,481] [INFO] [timer.py:215:stop] epoch=2/micro_step=720/global_step=2560, RunningAvgSamplesPerSec=85.06340371997422, CurrSamplesPerSec=84.82619278977796, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:33:46,982] [INFO] [logging.py:96:log_dist] [Rank 0] step=2570, skipped=48, lr=[8.967769761241352e-06, 8.967769761241352e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:33:47,015] [INFO] [timer.py:215:stop] epoch=2/micro_step=730/global_step=2570, RunningAvgSamplesPerSec=85.06320494138622, CurrSamplesPerSec=84.94233797269041, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:33:54,513] [INFO] [logging.py:96:log_dist] [Rank 0] step=2580, skipped=48, lr=[8.962481353185147e-06, 8.962481353185147e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:33:54,546] [INFO] [timer.py:215:stop] epoch=2/micro_step=740/global_step=2580, RunningAvgSamplesPerSec=85.06315217433428, CurrSamplesPerSec=85.22171736540938, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:34:02,052] [INFO] [logging.py:96:log_dist] [Rank 0] step=2590, skipped=48, lr=[8.957174099093217e-06, 8.957174099093217e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:34:02,086] [INFO] [timer.py:215:stop] epoch=2/micro_step=750/global_step=2590, RunningAvgSamplesPerSec=85.06272976602807, CurrSamplesPerSec=84.76951075442163, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:34:09,577] [INFO] [logging.py:96:log_dist] [Rank 0] step=2600, skipped=48, lr=[8.95184802313986e-06, 8.95184802313986e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:34:09,610] [INFO] [timer.py:215:stop] epoch=2/micro_step=760/global_step=2600, RunningAvgSamplesPerSec=85.06296511570721, CurrSamplesPerSec=85.21571139688959, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:34:17,113] [INFO] [logging.py:96:log_dist] [Rank 0] step=2610, skipped=48, lr=[8.946503149585103e-06, 8.946503149585103e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:34:17,147] [INFO] [timer.py:215:stop] epoch=2/micro_step=770/global_step=2610, RunningAvgSamplesPerSec=85.0626797865722, CurrSamplesPerSec=84.92078671057668, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:34:24,646] [INFO] [logging.py:96:log_dist] [Rank 0] step=2620, skipped=48, lr=[8.941139502774598e-06, 8.941139502774598e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:34:24,680] [INFO] [timer.py:215:stop] epoch=2/micro_step=780/global_step=2620, RunningAvgSamplesPerSec=85.06254791749139, CurrSamplesPerSec=85.39699298494648, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:34:32,177] [INFO] [logging.py:96:log_dist] [Rank 0] step=2630, skipped=48, lr=[8.935757107139506e-06, 8.935757107139506e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:34:32,211] [INFO] [timer.py:215:stop] epoch=2/micro_step=790/global_step=2630, RunningAvgSamplesPerSec=85.06250427596217, CurrSamplesPerSec=85.03401259692961, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:34:39,707] [INFO] [logging.py:96:log_dist] [Rank 0] step=2640, skipped=48, lr=[8.93035598719639e-06, 8.93035598719639e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:34:39,741] [INFO] [timer.py:215:stop] epoch=2/micro_step=800/global_step=2640, RunningAvgSamplesPerSec=85.06247457557697, CurrSamplesPerSec=84.91025690500607, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:34:47,250] [INFO] [logging.py:96:log_dist] [Rank 0] step=2650, skipped=48, lr=[8.924936167547103e-06, 8.924936167547103e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:34:47,284] [INFO] [timer.py:215:stop] epoch=2/micro_step=810/global_step=2650, RunningAvgSamplesPerSec=85.06192079762089, CurrSamplesPerSec=84.73550043214448, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:34:47,980] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:34:48,682] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:34:54,681] [INFO] [logging.py:96:log_dist] [Rank 0] step=2660, skipped=50, lr=[8.920586864626051e-06, 8.920586864626051e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:34:54,714] [INFO] [timer.py:215:stop] epoch=2/micro_step=820/global_step=2660, RunningAvgSamplesPerSec=85.06616493358614, CurrSamplesPerSec=84.9762186652012, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:35:02,210] [INFO] [logging.py:96:log_dist] [Rank 0] step=2670, skipped=50, lr=[8.915133447774127e-06, 8.915133447774127e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:35:02,243] [INFO] [timer.py:215:stop] epoch=2/micro_step=830/global_step=2670, RunningAvgSamplesPerSec=85.0662179111143, CurrSamplesPerSec=84.9979326445526, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:35:09,742] [INFO] [logging.py:96:log_dist] [Rank 0] step=2680, skipped=50, lr=[8.909661400553994e-06, 8.909661400553994e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:35:09,776] [INFO] [timer.py:215:stop] epoch=2/micro_step=840/global_step=2680, RunningAvgSamplesPerSec=85.06605666797589, CurrSamplesPerSec=85.13757575815242, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:35:17,271] [INFO] [logging.py:96:log_dist] [Rank 0] step=2690, skipped=50, lr=[8.90417074789057e-06, 8.90417074789057e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:35:17,304] [INFO] [timer.py:215:stop] epoch=2/micro_step=850/global_step=2690, RunningAvgSamplesPerSec=85.06611186027658, CurrSamplesPerSec=85.17642295486975, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:35:24,808] [INFO] [logging.py:96:log_dist] [Rank 0] step=2700, skipped=50, lr=[8.898661514793523e-06, 8.898661514793523e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:35:24,842] [INFO] [timer.py:215:stop] epoch=2/micro_step=860/global_step=2700, RunningAvgSamplesPerSec=85.06574697144137, CurrSamplesPerSec=85.24271785465686, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:35:32,344] [INFO] [logging.py:96:log_dist] [Rank 0] step=2710, skipped=50, lr=[8.893133726357158e-06, 8.893133726357158e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:35:32,378] [INFO] [timer.py:215:stop] epoch=2/micro_step=870/global_step=2710, RunningAvgSamplesPerSec=85.06551485557443, CurrSamplesPerSec=85.01960386857782, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:35:39,878] [INFO] [logging.py:96:log_dist] [Rank 0] step=2720, skipped=50, lr=[8.887587407760289e-06, 8.887587407760289e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:35:39,912] [INFO] [timer.py:215:stop] epoch=2/micro_step=880/global_step=2720, RunningAvgSamplesPerSec=85.06533492066929, CurrSamplesPerSec=85.05114783289213, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:35:47,411] [INFO] [logging.py:96:log_dist] [Rank 0] step=2730, skipped=50, lr=[8.882022584266147e-06, 8.882022584266147e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:35:47,445] [INFO] [timer.py:215:stop] epoch=2/micro_step=890/global_step=2730, RunningAvgSamplesPerSec=85.06519763966925, CurrSamplesPerSec=84.59479936757667, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:35:54,945] [INFO] [logging.py:96:log_dist] [Rank 0] step=2740, skipped=50, lr=[8.876439281222242e-06, 8.876439281222242e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:35:54,978] [INFO] [timer.py:215:stop] epoch=2/micro_step=900/global_step=2740, RunningAvgSamplesPerSec=85.06501669377116, CurrSamplesPerSec=85.01720737626388, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:36:02,473] [INFO] [logging.py:96:log_dist] [Rank 0] step=2750, skipped=50, lr=[8.870837524060258e-06, 8.870837524060258e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:36:02,506] [INFO] [timer.py:215:stop] epoch=2/micro_step=910/global_step=2750, RunningAvgSamplesPerSec=85.06512937761634, CurrSamplesPerSec=84.7694572155522, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:36:04,710] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:36:05,408] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:36:09,902] [INFO] [logging.py:96:log_dist] [Rank 0] step=2760, skipped=52, lr=[8.866342848509415e-06, 8.866342848509415e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:36:09,936] [INFO] [timer.py:215:stop] epoch=2/micro_step=920/global_step=2760, RunningAvgSamplesPerSec=85.06925978894348, CurrSamplesPerSec=84.81193478684654, MemAllocated=5.0GB, MaxMemAllocated=23.99GB ***** Evaluating perplexity, Epoch 3/16 ***** ppl: 1.9029735326766968 Beginning of Epoch 4/16, Total Micro Batches 920 [2023-06-29 17:36:35,354] [INFO] [logging.py:96:log_dist] [Rank 0] step=2770, skipped=52, lr=[8.860707938290982e-06, 8.860707938290982e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:36:35,387] [INFO] [timer.py:215:stop] epoch=3/micro_step=10/global_step=2770, RunningAvgSamplesPerSec=85.06841146477885, CurrSamplesPerSec=84.61175803098764, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:36:42,888] [INFO] [logging.py:96:log_dist] [Rank 0] step=2780, skipped=52, lr=[8.85505464561001e-06, 8.85505464561001e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:36:42,921] [INFO] [timer.py:215:stop] epoch=3/micro_step=20/global_step=2780, RunningAvgSamplesPerSec=85.06824930548734, CurrSamplesPerSec=85.13114966736956, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:36:50,421] [INFO] [logging.py:96:log_dist] [Rank 0] step=2790, skipped=52, lr=[8.849382996216985e-06, 8.849382996216985e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:36:50,455] [INFO] [timer.py:215:stop] epoch=3/micro_step=30/global_step=2790, RunningAvgSamplesPerSec=85.06809276767916, CurrSamplesPerSec=84.75939310944106, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:36:57,957] [INFO] [logging.py:96:log_dist] [Rank 0] step=2800, skipped=52, lr=[8.843693015946007e-06, 8.843693015946007e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:36:57,991] [INFO] [timer.py:215:stop] epoch=3/micro_step=40/global_step=2800, RunningAvgSamplesPerSec=85.06781662507427, CurrSamplesPerSec=84.83595102793117, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:37:05,500] [INFO] [logging.py:96:log_dist] [Rank 0] step=2810, skipped=52, lr=[8.837984730714672e-06, 8.837984730714672e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:37:05,533] [INFO] [timer.py:215:stop] epoch=3/micro_step=50/global_step=2810, RunningAvgSamplesPerSec=85.06729289816477, CurrSamplesPerSec=85.43115584798318, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:37:13,039] [INFO] [logging.py:96:log_dist] [Rank 0] step=2820, skipped=52, lr=[8.832258166523955e-06, 8.832258166523955e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:37:13,072] [INFO] [timer.py:215:stop] epoch=3/micro_step=60/global_step=2820, RunningAvgSamplesPerSec=85.0669348218702, CurrSamplesPerSec=85.03129207051724, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:37:20,583] [INFO] [logging.py:96:log_dist] [Rank 0] step=2830, skipped=52, lr=[8.826513349458089e-06, 8.826513349458089e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:37:20,618] [INFO] [timer.py:215:stop] epoch=3/micro_step=70/global_step=2830, RunningAvgSamplesPerSec=85.06629233043404, CurrSamplesPerSec=84.62240066377568, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:37:28,102] [INFO] [logging.py:96:log_dist] [Rank 0] step=2840, skipped=52, lr=[8.820750305684452e-06, 8.820750305684452e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:37:28,136] [INFO] [timer.py:215:stop] epoch=3/micro_step=80/global_step=2840, RunningAvgSamplesPerSec=85.06677211113606, CurrSamplesPerSec=84.99935910588938, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:37:35,631] [INFO] [logging.py:96:log_dist] [Rank 0] step=2850, skipped=52, lr=[8.81496906145344e-06, 8.81496906145344e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:37:35,664] [INFO] [timer.py:215:stop] epoch=3/micro_step=90/global_step=2850, RunningAvgSamplesPerSec=85.0668226096533, CurrSamplesPerSec=84.75535207976812, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:37:39,377] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:37:40,074] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:37:43,058] [INFO] [logging.py:96:log_dist] [Rank 0] step=2860, skipped=54, lr=[8.810330979432513e-06, 8.810330979432513e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:37:43,091] [INFO] [timer.py:215:stop] epoch=3/micro_step=100/global_step=2860, RunningAvgSamplesPerSec=85.0708726738673, CurrSamplesPerSec=84.7090014777938, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:37:50,596] [INFO] [logging.py:96:log_dist] [Rank 0] step=2870, skipped=54, lr=[8.804517040793774e-06, 8.804517040793774e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:37:50,630] [INFO] [timer.py:215:stop] epoch=3/micro_step=110/global_step=2870, RunningAvgSamplesPerSec=85.07049527140973, CurrSamplesPerSec=85.08044999204456, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:37:58,123] [INFO] [logging.py:96:log_dist] [Rank 0] step=2880, skipped=54, lr=[8.798684975639427e-06, 8.798684975639427e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:37:58,156] [INFO] [timer.py:215:stop] epoch=3/micro_step=120/global_step=2880, RunningAvgSamplesPerSec=85.0706222021448, CurrSamplesPerSec=84.94516033179931, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:38:05,653] [INFO] [logging.py:96:log_dist] [Rank 0] step=2890, skipped=54, lr=[8.792834810534262e-06, 8.792834810534262e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:38:05,687] [INFO] [timer.py:215:stop] epoch=3/micro_step=130/global_step=2890, RunningAvgSamplesPerSec=85.07055895555098, CurrSamplesPerSec=84.97409360540445, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:38:13,199] [INFO] [logging.py:96:log_dist] [Rank 0] step=2900, skipped=54, lr=[8.786966572125507e-06, 8.786966572125507e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:38:13,232] [INFO] [timer.py:215:stop] epoch=3/micro_step=140/global_step=2900, RunningAvgSamplesPerSec=85.06993995499947, CurrSamplesPerSec=85.01806902086626, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:38:20,728] [INFO] [logging.py:96:log_dist] [Rank 0] step=2910, skipped=54, lr=[8.781080287142716e-06, 8.781080287142716e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:38:20,761] [INFO] [timer.py:215:stop] epoch=3/micro_step=150/global_step=2910, RunningAvgSamplesPerSec=85.06995805133585, CurrSamplesPerSec=85.21219478128373, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:38:28,273] [INFO] [logging.py:96:log_dist] [Rank 0] step=2920, skipped=54, lr=[8.775175982397645e-06, 8.775175982397645e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:38:28,307] [INFO] [timer.py:215:stop] epoch=3/micro_step=160/global_step=2920, RunningAvgSamplesPerSec=85.06934243259526, CurrSamplesPerSec=84.52953466717427, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:38:35,796] [INFO] [logging.py:96:log_dist] [Rank 0] step=2930, skipped=54, lr=[8.769253684784129e-06, 8.769253684784129e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:38:35,830] [INFO] [timer.py:215:stop] epoch=3/micro_step=170/global_step=2930, RunningAvgSamplesPerSec=85.06956453873094, CurrSamplesPerSec=85.04336068258876, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:38:43,339] [INFO] [logging.py:96:log_dist] [Rank 0] step=2940, skipped=54, lr=[8.763313421277957e-06, 8.763313421277957e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:38:43,369] [INFO] [timer.py:215:stop] epoch=3/micro_step=180/global_step=2940, RunningAvgSamplesPerSec=85.06920385846882, CurrSamplesPerSec=84.78391516664124, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:38:50,865] [INFO] [logging.py:96:log_dist] [Rank 0] step=2950, skipped=54, lr=[8.757355218936757e-06, 8.757355218936757e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:38:50,899] [INFO] [timer.py:215:stop] epoch=3/micro_step=190/global_step=2950, RunningAvgSamplesPerSec=85.06918975715519, CurrSamplesPerSec=85.29195800529604, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:38:56,116] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:38:56,814] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:38:58,286] [INFO] [logging.py:96:log_dist] [Rank 0] step=2960, skipped=56, lr=[8.752575759337464e-06, 8.752575759337464e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:38:58,320] [INFO] [timer.py:215:stop] epoch=3/micro_step=200/global_step=2960, RunningAvgSamplesPerSec=85.07332411995242, CurrSamplesPerSec=84.91329201474461, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:39:05,824] [INFO] [logging.py:96:log_dist] [Rank 0] step=2970, skipped=56, lr=[8.746585335539165e-06, 8.746585335539165e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:39:05,858] [INFO] [timer.py:215:stop] epoch=3/micro_step=210/global_step=2970, RunningAvgSamplesPerSec=85.07296752183284, CurrSamplesPerSec=84.77756912533359, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:39:13,346] [INFO] [logging.py:96:log_dist] [Rank 0] step=2980, skipped=56, lr=[8.740577049101491e-06, 8.740577049101491e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:39:13,379] [INFO] [timer.py:215:stop] epoch=3/micro_step=220/global_step=2980, RunningAvgSamplesPerSec=85.07324895219385, CurrSamplesPerSec=85.18482920222442, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:39:20,884] [INFO] [logging.py:96:log_dist] [Rank 0] step=2990, skipped=56, lr=[8.73455092739191e-06, 8.73455092739191e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:39:20,918] [INFO] [timer.py:215:stop] epoch=3/micro_step=230/global_step=2990, RunningAvgSamplesPerSec=85.07287648648418, CurrSamplesPerSec=84.6080244082712, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:39:28,418] [INFO] [logging.py:96:log_dist] [Rank 0] step=3000, skipped=56, lr=[8.728506997859123e-06, 8.728506997859123e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:39:28,451] [INFO] [timer.py:215:stop] epoch=3/micro_step=240/global_step=3000, RunningAvgSamplesPerSec=85.07272184786332, CurrSamplesPerSec=84.78943190687535, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:39:35,960] [INFO] [logging.py:96:log_dist] [Rank 0] step=3010, skipped=56, lr=[8.72244528803295e-06, 8.72244528803295e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:39:35,994] [INFO] [timer.py:215:stop] epoch=3/micro_step=250/global_step=3010, RunningAvgSamplesPerSec=85.07220359111453, CurrSamplesPerSec=85.29477654385714, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:39:43,484] [INFO] [logging.py:96:log_dist] [Rank 0] step=3020, skipped=56, lr=[8.7163658255242e-06, 8.7163658255242e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:39:43,518] [INFO] [timer.py:215:stop] epoch=3/micro_step=260/global_step=3020, RunningAvgSamplesPerSec=85.07239325049451, CurrSamplesPerSec=85.06443508704326, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:39:51,019] [INFO] [logging.py:96:log_dist] [Rank 0] step=3030, skipped=56, lr=[8.710268638024543e-06, 8.710268638024543e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:39:51,053] [INFO] [timer.py:215:stop] epoch=3/micro_step=270/global_step=3030, RunningAvgSamplesPerSec=85.07219298387562, CurrSamplesPerSec=84.6124247840052, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:39:58,542] [INFO] [logging.py:96:log_dist] [Rank 0] step=3040, skipped=56, lr=[8.704153753306384e-06, 8.704153753306384e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:39:58,576] [INFO] [timer.py:215:stop] epoch=3/micro_step=280/global_step=3040, RunningAvgSamplesPerSec=85.07240881661208, CurrSamplesPerSec=84.99583341117257, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:40:06,076] [INFO] [logging.py:96:log_dist] [Rank 0] step=3050, skipped=56, lr=[8.698021199222738e-06, 8.698021199222738e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:40:06,110] [INFO] [timer.py:215:stop] epoch=3/micro_step=290/global_step=3050, RunningAvgSamplesPerSec=85.07221010483329, CurrSamplesPerSec=85.22212320462704, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:40:12,843] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:40:13,544] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:40:13,544] [INFO] [logging.py:96:log_dist] [Rank 0] step=3060, skipped=58, lr=[8.693102452781284e-06, 8.693102452781284e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:40:13,545] [INFO] [timer.py:215:stop] epoch=3/micro_step=300/global_step=3060, RunningAvgSamplesPerSec=85.07569760225213, CurrSamplesPerSec=91.45096061927698, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:40:21,055] [INFO] [logging.py:96:log_dist] [Rank 0] step=3070, skipped=58, lr=[8.68693816428619e-06, 8.68693816428619e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:40:21,088] [INFO] [timer.py:215:stop] epoch=3/micro_step=310/global_step=3070, RunningAvgSamplesPerSec=85.07514966301117, CurrSamplesPerSec=85.04559698463837, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:40:28,603] [INFO] [logging.py:96:log_dist] [Rank 0] step=3080, skipped=58, lr=[8.680756284841818e-06, 8.680756284841818e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:40:28,637] [INFO] [timer.py:215:stop] epoch=3/micro_step=320/global_step=3080, RunningAvgSamplesPerSec=85.07442251946226, CurrSamplesPerSec=85.20197119385713, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:40:36,155] [INFO] [logging.py:96:log_dist] [Rank 0] step=3090, skipped=58, lr=[8.674556842606344e-06, 8.674556842606344e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:40:36,189] [INFO] [timer.py:215:stop] epoch=3/micro_step=330/global_step=3090, RunningAvgSamplesPerSec=85.073558626454, CurrSamplesPerSec=84.67912648172842, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:40:43,700] [INFO] [logging.py:96:log_dist] [Rank 0] step=3100, skipped=58, lr=[8.668339865817942e-06, 8.668339865817942e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:40:43,734] [INFO] [timer.py:215:stop] epoch=3/micro_step=340/global_step=3100, RunningAvgSamplesPerSec=85.0729478015435, CurrSamplesPerSec=85.06384205819703, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:40:51,238] [INFO] [logging.py:96:log_dist] [Rank 0] step=3110, skipped=58, lr=[8.662105382794651e-06, 8.662105382794651e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:40:51,272] [INFO] [timer.py:215:stop] epoch=3/micro_step=350/global_step=3110, RunningAvgSamplesPerSec=85.07261227102157, CurrSamplesPerSec=85.01106866397289, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:40:58,772] [INFO] [logging.py:96:log_dist] [Rank 0] step=3120, skipped=58, lr=[8.655853421934254e-06, 8.655853421934254e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:40:58,806] [INFO] [timer.py:215:stop] epoch=3/micro_step=360/global_step=3120, RunningAvgSamplesPerSec=85.07243716838269, CurrSamplesPerSec=84.67314332758721, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:41:06,311] [INFO] [logging.py:96:log_dist] [Rank 0] step=3130, skipped=58, lr=[8.649584011714141e-06, 8.649584011714141e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:41:06,344] [INFO] [timer.py:215:stop] epoch=3/micro_step=370/global_step=3130, RunningAvgSamplesPerSec=85.0720729596735, CurrSamplesPerSec=84.60933113874034, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:41:13,853] [INFO] [logging.py:96:log_dist] [Rank 0] step=3140, skipped=58, lr=[8.643297180691187e-06, 8.643297180691187e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:41:13,887] [INFO] [timer.py:215:stop] epoch=3/micro_step=380/global_step=3140, RunningAvgSamplesPerSec=85.07157386660562, CurrSamplesPerSec=84.87226924605503, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:41:21,386] [INFO] [logging.py:96:log_dist] [Rank 0] step=3150, skipped=58, lr=[8.636992957501612e-06, 8.636992957501612e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:41:21,419] [INFO] [timer.py:215:stop] epoch=3/micro_step=390/global_step=3150, RunningAvgSamplesPerSec=85.07146181011836, CurrSamplesPerSec=85.08236463500715, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:41:28,927] [INFO] [logging.py:96:log_dist] [Rank 0] step=3160, skipped=58, lr=[8.630671370860863e-06, 8.630671370860863e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:41:28,961] [INFO] [timer.py:215:stop] epoch=3/micro_step=400/global_step=3160, RunningAvgSamplesPerSec=85.07099997093934, CurrSamplesPerSec=84.68465630899381, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:41:29,659] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:41:30,358] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:41:36,352] [INFO] [logging.py:96:log_dist] [Rank 0] step=3170, skipped=60, lr=[8.625601619210692e-06, 8.625601619210692e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:41:36,386] [INFO] [timer.py:215:stop] epoch=3/micro_step=410/global_step=3170, RunningAvgSamplesPerSec=85.07472667441102, CurrSamplesPerSec=85.2534115122002, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:41:43,891] [INFO] [logging.py:96:log_dist] [Rank 0] step=3180, skipped=60, lr=[8.61924885097312e-06, 8.61924885097312e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:41:43,925] [INFO] [timer.py:215:stop] epoch=3/micro_step=420/global_step=3180, RunningAvgSamplesPerSec=85.07435376920583, CurrSamplesPerSec=84.96352368472694, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:41:51,428] [INFO] [logging.py:96:log_dist] [Rank 0] step=3190, skipped=60, lr=[8.612878800107956e-06, 8.612878800107956e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:41:51,460] [INFO] [timer.py:215:stop] epoch=3/micro_step=430/global_step=3190, RunningAvgSamplesPerSec=85.07411661343669, CurrSamplesPerSec=84.89167494704458, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:41:58,964] [INFO] [logging.py:96:log_dist] [Rank 0] step=3200, skipped=60, lr=[8.606491495630485e-06, 8.606491495630485e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:41:58,997] [INFO] [timer.py:215:stop] epoch=3/micro_step=440/global_step=3200, RunningAvgSamplesPerSec=85.07383168881435, CurrSamplesPerSec=84.93029801408315, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:42:06,495] [INFO] [logging.py:96:log_dist] [Rank 0] step=3210, skipped=60, lr=[8.600086966634588e-06, 8.600086966634588e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:42:06,528] [INFO] [timer.py:215:stop] epoch=3/micro_step=450/global_step=3210, RunningAvgSamplesPerSec=85.07374130979078, CurrSamplesPerSec=84.82096608557444, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:42:14,030] [INFO] [logging.py:96:log_dist] [Rank 0] step=3220, skipped=60, lr=[8.593665242292592e-06, 8.593665242292592e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:42:14,064] [INFO] [timer.py:215:stop] epoch=3/micro_step=460/global_step=3220, RunningAvgSamplesPerSec=85.07350667960107, CurrSamplesPerSec=85.07902080769388, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:42:21,568] [INFO] [logging.py:96:log_dist] [Rank 0] step=3230, skipped=60, lr=[8.587226351855153e-06, 8.587226351855153e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:42:21,601] [INFO] [timer.py:215:stop] epoch=3/micro_step=470/global_step=3230, RunningAvgSamplesPerSec=85.07320466373362, CurrSamplesPerSec=84.94241860891935, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:42:29,118] [INFO] [logging.py:96:log_dist] [Rank 0] step=3240, skipped=60, lr=[8.580770324651124e-06, 8.580770324651124e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:42:29,151] [INFO] [timer.py:215:stop] epoch=3/micro_step=480/global_step=3240, RunningAvgSamplesPerSec=85.07247338481805, CurrSamplesPerSec=84.71242321583824, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:42:36,659] [INFO] [logging.py:96:log_dist] [Rank 0] step=3250, skipped=60, lr=[8.574297190087406e-06, 8.574297190087406e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:42:36,693] [INFO] [timer.py:215:stop] epoch=3/micro_step=490/global_step=3250, RunningAvgSamplesPerSec=85.07200883297729, CurrSamplesPerSec=84.91484994294944, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:42:44,202] [INFO] [logging.py:96:log_dist] [Rank 0] step=3260, skipped=60, lr=[8.567806977648827e-06, 8.567806977648827e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:42:44,236] [INFO] [timer.py:215:stop] epoch=3/micro_step=500/global_step=3260, RunningAvgSamplesPerSec=85.0715363636349, CurrSamplesPerSec=85.01246864312496, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:42:46,435] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:42:47,134] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:42:51,622] [INFO] [logging.py:96:log_dist] [Rank 0] step=3270, skipped=62, lr=[8.562602531491531e-06, 8.562602531491531e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:42:51,656] [INFO] [timer.py:215:stop] epoch=3/micro_step=510/global_step=3270, RunningAvgSamplesPerSec=85.07533385195082, CurrSamplesPerSec=85.02752133167694, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:42:59,153] [INFO] [logging.py:96:log_dist] [Rank 0] step=3280, skipped=62, lr=[8.556081653428184e-06, 8.556081653428184e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:42:59,187] [INFO] [timer.py:215:stop] epoch=3/micro_step=520/global_step=3280, RunningAvgSamplesPerSec=85.07525286739522, CurrSamplesPerSec=85.20034862780173, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:43:06,694] [INFO] [logging.py:96:log_dist] [Rank 0] step=3290, skipped=62, lr=[8.549543780460902e-06, 8.549543780460902e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:43:06,728] [INFO] [timer.py:215:stop] epoch=3/micro_step=530/global_step=3290, RunningAvgSamplesPerSec=85.07482501711287, CurrSamplesPerSec=85.02081562960146, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:43:14,224] [INFO] [logging.py:96:log_dist] [Rank 0] step=3300, skipped=62, lr=[8.542988942369392e-06, 8.542988942369392e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:43:14,257] [INFO] [timer.py:215:stop] epoch=3/micro_step=540/global_step=3300, RunningAvgSamplesPerSec=85.07478548126292, CurrSamplesPerSec=85.23936141161109, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:43:21,758] [INFO] [logging.py:96:log_dist] [Rank 0] step=3310, skipped=62, lr=[8.536417169010639e-06, 8.536417169010639e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:43:21,792] [INFO] [timer.py:215:stop] epoch=3/micro_step=550/global_step=3310, RunningAvgSamplesPerSec=85.07459193239528, CurrSamplesPerSec=85.08654478878883, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:43:29,303] [INFO] [logging.py:96:log_dist] [Rank 0] step=3320, skipped=62, lr=[8.529828490318763e-06, 8.529828490318763e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:43:29,336] [INFO] [timer.py:215:stop] epoch=3/micro_step=560/global_step=3320, RunningAvgSamplesPerSec=85.07406497695192, CurrSamplesPerSec=85.00043571218677, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:43:36,833] [INFO] [logging.py:96:log_dist] [Rank 0] step=3330, skipped=62, lr=[8.523222936304894e-06, 8.523222936304894e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:43:36,867] [INFO] [timer.py:215:stop] epoch=3/micro_step=570/global_step=3330, RunningAvgSamplesPerSec=85.07399039332036, CurrSamplesPerSec=85.05192932116994, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:43:44,371] [INFO] [logging.py:96:log_dist] [Rank 0] step=3340, skipped=62, lr=[8.516600537057021e-06, 8.516600537057021e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:43:44,405] [INFO] [timer.py:215:stop] epoch=3/micro_step=580/global_step=3340, RunningAvgSamplesPerSec=85.07369833148479, CurrSamplesPerSec=84.99914378790258, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:43:51,904] [INFO] [logging.py:96:log_dist] [Rank 0] step=3350, skipped=62, lr=[8.509961322739866e-06, 8.509961322739866e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:43:51,938] [INFO] [timer.py:215:stop] epoch=3/micro_step=590/global_step=3350, RunningAvgSamplesPerSec=85.0735272343823, CurrSamplesPerSec=85.10197446009866, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:43:59,440] [INFO] [logging.py:96:log_dist] [Rank 0] step=3360, skipped=62, lr=[8.503305323594745e-06, 8.503305323594745e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:43:59,473] [INFO] [timer.py:215:stop] epoch=3/micro_step=600/global_step=3360, RunningAvgSamplesPerSec=85.07331891444343, CurrSamplesPerSec=85.32782696123978, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:44:03,184] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:44:03,883] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:44:06,865] [INFO] [logging.py:96:log_dist] [Rank 0] step=3370, skipped=64, lr=[8.497968459573483e-06, 8.497968459573483e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:44:06,899] [INFO] [timer.py:215:stop] epoch=3/micro_step=610/global_step=3370, RunningAvgSamplesPerSec=85.07679006493419, CurrSamplesPerSec=85.16866689299165, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:44:14,399] [INFO] [logging.py:96:log_dist] [Rank 0] step=3380, skipped=64, lr=[8.491282324190084e-06, 8.491282324190084e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:44:14,433] [INFO] [timer.py:215:stop] epoch=3/micro_step=620/global_step=3380, RunningAvgSamplesPerSec=85.0765998837195, CurrSamplesPerSec=84.90980031435251, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:44:21,933] [INFO] [logging.py:96:log_dist] [Rank 0] step=3390, skipped=64, lr=[8.484579489060685e-06, 8.484579489060685e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:44:21,967] [INFO] [timer.py:215:stop] epoch=3/micro_step=630/global_step=3390, RunningAvgSamplesPerSec=85.07641912681228, CurrSamplesPerSec=85.1125248859424, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:44:29,457] [INFO] [logging.py:96:log_dist] [Rank 0] step=3400, skipped=64, lr=[8.477859984716394e-06, 8.477859984716394e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:44:29,491] [INFO] [timer.py:215:stop] epoch=3/micro_step=640/global_step=3400, RunningAvgSamplesPerSec=85.07655434877394, CurrSamplesPerSec=85.08821696531307, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:44:36,995] [INFO] [logging.py:96:log_dist] [Rank 0] step=3410, skipped=64, lr=[8.471123841764245e-06, 8.471123841764245e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:44:37,029] [INFO] [timer.py:215:stop] epoch=3/micro_step=650/global_step=3410, RunningAvgSamplesPerSec=85.07625560238536, CurrSamplesPerSec=84.84214492057372, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:44:44,523] [INFO] [logging.py:96:log_dist] [Rank 0] step=3420, skipped=64, lr=[8.464371090887049e-06, 8.464371090887049e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:44:44,557] [INFO] [timer.py:215:stop] epoch=3/micro_step=660/global_step=3420, RunningAvgSamplesPerSec=85.07625191443762, CurrSamplesPerSec=84.8718935662386, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:44:52,057] [INFO] [logging.py:96:log_dist] [Rank 0] step=3430, skipped=64, lr=[8.45760176284328e-06, 8.45760176284328e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:44:52,091] [INFO] [timer.py:215:stop] epoch=3/micro_step=670/global_step=3430, RunningAvgSamplesPerSec=85.07606147011127, CurrSamplesPerSec=85.10213633945561, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:44:59,586] [INFO] [logging.py:96:log_dist] [Rank 0] step=3440, skipped=64, lr=[8.450815888466909e-06, 8.450815888466909e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:44:59,620] [INFO] [timer.py:215:stop] epoch=3/micro_step=680/global_step=3440, RunningAvgSamplesPerSec=85.0760517803807, CurrSamplesPerSec=85.30464289595797, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:45:07,112] [INFO] [logging.py:96:log_dist] [Rank 0] step=3450, skipped=64, lr=[8.444013498667281e-06, 8.444013498667281e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:45:07,146] [INFO] [timer.py:215:stop] epoch=3/micro_step=690/global_step=3450, RunningAvgSamplesPerSec=85.0761267438312, CurrSamplesPerSec=85.2484569018629, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:45:14,649] [INFO] [logging.py:96:log_dist] [Rank 0] step=3460, skipped=64, lr=[8.437194624428967e-06, 8.437194624428967e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:45:14,683] [INFO] [timer.py:215:stop] epoch=3/micro_step=700/global_step=3460, RunningAvgSamplesPerSec=85.0758550149198, CurrSamplesPerSec=84.84396840327295, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:45:19,900] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:45:20,596] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:45:22,070] [INFO] [logging.py:96:log_dist] [Rank 0] step=3470, skipped=66, lr=[8.43172767711203e-06, 8.43172767711203e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:45:22,103] [INFO] [timer.py:215:stop] epoch=3/micro_step=710/global_step=3470, RunningAvgSamplesPerSec=85.07937807595356, CurrSamplesPerSec=85.13360658753204, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:45:29,594] [INFO] [logging.py:96:log_dist] [Rank 0] step=3480, skipped=66, lr=[8.42487920920478e-06, 8.42487920920478e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:45:29,627] [INFO] [timer.py:215:stop] epoch=3/micro_step=720/global_step=3480, RunningAvgSamplesPerSec=85.07950091350452, CurrSamplesPerSec=85.05403132642049, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:45:37,121] [INFO] [logging.py:96:log_dist] [Rank 0] step=3490, skipped=66, lr=[8.418014344014644e-06, 8.418014344014644e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:45:37,154] [INFO] [timer.py:215:stop] epoch=3/micro_step=730/global_step=3490, RunningAvgSamplesPerSec=85.07954999706085, CurrSamplesPerSec=84.8909769395216, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:45:44,635] [INFO] [logging.py:96:log_dist] [Rank 0] step=3500, skipped=66, lr=[8.411133112810762e-06, 8.411133112810762e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:45:44,668] [INFO] [timer.py:215:stop] epoch=3/micro_step=740/global_step=3500, RunningAvgSamplesPerSec=85.08001783359062, CurrSamplesPerSec=85.32568427945239, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:45:52,165] [INFO] [logging.py:96:log_dist] [Rank 0] step=3510, skipped=66, lr=[8.404235546936829e-06, 8.404235546936829e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:45:52,198] [INFO] [timer.py:215:stop] epoch=3/micro_step=750/global_step=3510, RunningAvgSamplesPerSec=85.07996923085142, CurrSamplesPerSec=85.02550143057732, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:45:59,703] [INFO] [logging.py:96:log_dist] [Rank 0] step=3520, skipped=66, lr=[8.397321677810934e-06, 8.397321677810934e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:45:59,737] [INFO] [timer.py:215:stop] epoch=3/micro_step=760/global_step=3520, RunningAvgSamplesPerSec=85.07961064491474, CurrSamplesPerSec=85.31082408142484, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:46:07,249] [INFO] [logging.py:96:log_dist] [Rank 0] step=3530, skipped=66, lr=[8.390391536925431e-06, 8.390391536925431e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:46:07,282] [INFO] [timer.py:215:stop] epoch=3/micro_step=770/global_step=3530, RunningAvgSamplesPerSec=85.07906754602602, CurrSamplesPerSec=84.99144688641209, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:46:14,777] [INFO] [logging.py:96:log_dist] [Rank 0] step=3540, skipped=66, lr=[8.38344515584679e-06, 8.38344515584679e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:46:14,811] [INFO] [timer.py:215:stop] epoch=3/micro_step=780/global_step=3540, RunningAvgSamplesPerSec=85.07905434283253, CurrSamplesPerSec=85.18872204598568, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:46:22,316] [INFO] [logging.py:96:log_dist] [Rank 0] step=3550, skipped=66, lr=[8.376482566215455e-06, 8.376482566215455e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:46:22,349] [INFO] [timer.py:215:stop] epoch=3/micro_step=790/global_step=3550, RunningAvgSamplesPerSec=85.07874173154329, CurrSamplesPerSec=84.93841386123201, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:46:29,842] [INFO] [logging.py:96:log_dist] [Rank 0] step=3560, skipped=66, lr=[8.3695037997457e-06, 8.3695037997457e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:46:29,877] [INFO] [timer.py:215:stop] epoch=3/micro_step=800/global_step=3560, RunningAvgSamplesPerSec=85.07876494136427, CurrSamplesPerSec=85.00213142241017, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:46:36,588] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:46:37,286] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:46:37,287] [INFO] [logging.py:96:log_dist] [Rank 0] step=3570, skipped=68, lr=[8.363909160605268e-06, 8.363909160605268e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:46:37,287] [INFO] [timer.py:215:stop] epoch=3/micro_step=810/global_step=3570, RunningAvgSamplesPerSec=85.08248892108445, CurrSamplesPerSec=91.79841487402982, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:46:44,788] [INFO] [logging.py:96:log_dist] [Rank 0] step=3580, skipped=68, lr=[8.356901355981433e-06, 8.356901355981433e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:46:44,822] [INFO] [timer.py:215:stop] epoch=3/micro_step=820/global_step=3580, RunningAvgSamplesPerSec=85.08227067878825, CurrSamplesPerSec=85.04422285739993, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:46:52,316] [INFO] [logging.py:96:log_dist] [Rank 0] step=3590, skipped=68, lr=[8.349877463710679e-06, 8.349877463710679e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:46:52,350] [INFO] [timer.py:215:stop] epoch=3/micro_step=830/global_step=3590, RunningAvgSamplesPerSec=85.08227856185856, CurrSamplesPerSec=85.16520819889223, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:46:59,842] [INFO] [logging.py:96:log_dist] [Rank 0] step=3600, skipped=68, lr=[8.342837515786516e-06, 8.342837515786516e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:46:59,876] [INFO] [timer.py:215:stop] epoch=3/micro_step=840/global_step=3600, RunningAvgSamplesPerSec=85.08232309712263, CurrSamplesPerSec=85.20443220365614, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:47:07,385] [INFO] [logging.py:96:log_dist] [Rank 0] step=3610, skipped=68, lr=[8.335781544275574e-06, 8.335781544275574e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:47:07,419] [INFO] [timer.py:215:stop] epoch=3/micro_step=850/global_step=3610, RunningAvgSamplesPerSec=85.08186461225158, CurrSamplesPerSec=85.02205435435008, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:47:14,941] [INFO] [logging.py:96:log_dist] [Rank 0] step=3620, skipped=68, lr=[8.32870958131748e-06, 8.32870958131748e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:47:14,975] [INFO] [timer.py:215:stop] epoch=3/micro_step=860/global_step=3620, RunningAvgSamplesPerSec=85.08099089204764, CurrSamplesPerSec=84.69798963501692, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:47:22,481] [INFO] [logging.py:96:log_dist] [Rank 0] step=3630, skipped=68, lr=[8.321621659124696e-06, 8.321621659124696e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:47:22,515] [INFO] [timer.py:215:stop] epoch=3/micro_step=870/global_step=3630, RunningAvgSamplesPerSec=85.08060886842384, CurrSamplesPerSec=84.85166543652217, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:47:30,027] [INFO] [logging.py:96:log_dist] [Rank 0] step=3640, skipped=68, lr=[8.31451780998238e-06, 8.31451780998238e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:47:30,060] [INFO] [timer.py:215:stop] epoch=3/micro_step=880/global_step=3640, RunningAvgSamplesPerSec=85.08007945678526, CurrSamplesPerSec=84.76156097289883, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:47:37,571] [INFO] [logging.py:96:log_dist] [Rank 0] step=3650, skipped=68, lr=[8.307398066248235e-06, 8.307398066248235e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:47:37,604] [INFO] [timer.py:215:stop] epoch=3/micro_step=890/global_step=3650, RunningAvgSamplesPerSec=85.07958110988812, CurrSamplesPerSec=84.47670415690702, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:47:45,104] [INFO] [logging.py:96:log_dist] [Rank 0] step=3660, skipped=68, lr=[8.300262460352361e-06, 8.300262460352361e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:47:45,138] [INFO] [timer.py:215:stop] epoch=3/micro_step=900/global_step=3660, RunningAvgSamplesPerSec=85.07941914099217, CurrSamplesPerSec=84.9626631453809, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:47:52,623] [INFO] [logging.py:96:log_dist] [Rank 0] step=3670, skipped=68, lr=[8.293111024797115e-06, 8.293111024797115e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:47:52,657] [INFO] [timer.py:215:stop] epoch=3/micro_step=910/global_step=3670, RunningAvgSamplesPerSec=85.07971309640944, CurrSamplesPerSec=85.39884039620591, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:47:53,352] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:47:54,050] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:48:00,041] [INFO] [logging.py:96:log_dist] [Rank 0] step=3680, skipped=70, lr=[8.287378500885789e-06, 8.287378500885789e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:48:00,074] [INFO] [timer.py:215:stop] epoch=3/micro_step=920/global_step=3680, RunningAvgSamplesPerSec=85.08313509749061, CurrSamplesPerSec=84.8448801740165, MemAllocated=5.0GB, MaxMemAllocated=23.99GB ***** Evaluating perplexity, Epoch 4/16 ***** ppl: 1.8734874725341797 Beginning of Epoch 5/16, Total Micro Batches 920 [2023-06-29 17:48:25,485] [INFO] [logging.py:96:log_dist] [Rank 0] step=3690, skipped=70, lr=[8.280198654079664e-06, 8.280198654079664e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:48:25,518] [INFO] [timer.py:215:stop] epoch=4/micro_step=10/global_step=3690, RunningAvgSamplesPerSec=85.08216266509393, CurrSamplesPerSec=84.28289440220739, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:48:33,025] [INFO] [logging.py:96:log_dist] [Rank 0] step=3700, skipped=70, lr=[8.273003069003873e-06, 8.273003069003873e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:48:33,058] [INFO] [timer.py:215:stop] epoch=4/micro_step=20/global_step=3700, RunningAvgSamplesPerSec=85.08180867100728, CurrSamplesPerSec=85.15821060676093, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:48:40,566] [INFO] [logging.py:96:log_dist] [Rank 0] step=3710, skipped=70, lr=[8.265791778433975e-06, 8.265791778433975e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:48:40,600] [INFO] [timer.py:215:stop] epoch=4/micro_step=30/global_step=3710, RunningAvgSamplesPerSec=85.0813998337943, CurrSamplesPerSec=84.78603072487535, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:48:48,106] [INFO] [logging.py:96:log_dist] [Rank 0] step=3720, skipped=70, lr=[8.258564815217059e-06, 8.258564815217059e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:48:48,140] [INFO] [timer.py:215:stop] epoch=4/micro_step=40/global_step=3720, RunningAvgSamplesPerSec=85.08105482823527, CurrSamplesPerSec=84.91737498248257, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:48:55,642] [INFO] [logging.py:96:log_dist] [Rank 0] step=3730, skipped=70, lr=[8.251322212271614e-06, 8.251322212271614e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:48:55,675] [INFO] [timer.py:215:stop] epoch=4/micro_step=50/global_step=3730, RunningAvgSamplesPerSec=85.08082360714042, CurrSamplesPerSec=85.17807163786205, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:49:03,177] [INFO] [logging.py:96:log_dist] [Rank 0] step=3740, skipped=70, lr=[8.244064002587355e-06, 8.244064002587355e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:49:03,211] [INFO] [timer.py:215:stop] epoch=4/micro_step=60/global_step=3740, RunningAvgSamplesPerSec=85.08059262888517, CurrSamplesPerSec=85.10755965381905, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:49:10,711] [INFO] [logging.py:96:log_dist] [Rank 0] step=3750, skipped=70, lr=[8.236790219225093e-06, 8.236790219225093e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:49:10,745] [INFO] [timer.py:215:stop] epoch=4/micro_step=70/global_step=3750, RunningAvgSamplesPerSec=85.08042347913121, CurrSamplesPerSec=85.15518497710725, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:49:18,244] [INFO] [logging.py:96:log_dist] [Rank 0] step=3760, skipped=70, lr=[8.229500895316573e-06, 8.229500895316573e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:49:18,277] [INFO] [timer.py:215:stop] epoch=4/micro_step=80/global_step=3760, RunningAvgSamplesPerSec=85.08030592621456, CurrSamplesPerSec=84.90254923871386, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:49:25,796] [INFO] [logging.py:96:log_dist] [Rank 0] step=3770, skipped=70, lr=[8.222196064064329e-06, 8.222196064064329e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:49:25,830] [INFO] [timer.py:215:stop] epoch=4/micro_step=90/global_step=3770, RunningAvgSamplesPerSec=85.07960305582749, CurrSamplesPerSec=84.81062179079693, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:49:28,032] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:49:28,729] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:49:33,223] [INFO] [logging.py:96:log_dist] [Rank 0] step=3780, skipped=72, lr=[8.216341056132252e-06, 8.216341056132252e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:49:33,257] [INFO] [timer.py:215:stop] epoch=4/micro_step=100/global_step=3780, RunningAvgSamplesPerSec=85.08265260479791, CurrSamplesPerSec=84.3970248633367, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:49:40,765] [INFO] [logging.py:96:log_dist] [Rank 0] step=3790, skipped=72, lr=[8.209008395557055e-06, 8.209008395557055e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:49:40,798] [INFO] [timer.py:215:stop] epoch=4/micro_step=110/global_step=3790, RunningAvgSamplesPerSec=85.08226789274046, CurrSamplesPerSec=85.20943579377469, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:49:48,293] [INFO] [logging.py:96:log_dist] [Rank 0] step=3800, skipped=72, lr=[8.20166032098052e-06, 8.20166032098052e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:49:48,326] [INFO] [timer.py:215:stop] epoch=4/micro_step=120/global_step=3800, RunningAvgSamplesPerSec=85.08229115278591, CurrSamplesPerSec=85.00503851163297, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:49:55,820] [INFO] [logging.py:96:log_dist] [Rank 0] step=3810, skipped=72, lr=[8.194296865872786e-06, 8.194296865872786e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:49:55,853] [INFO] [timer.py:215:stop] epoch=4/micro_step=130/global_step=3810, RunningAvgSamplesPerSec=85.08233980630209, CurrSamplesPerSec=85.01365327688463, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:50:03,358] [INFO] [logging.py:96:log_dist] [Rank 0] step=3820, skipped=72, lr=[8.186918063774048e-06, 8.186918063774048e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:50:03,392] [INFO] [timer.py:215:stop] epoch=4/micro_step=140/global_step=3820, RunningAvgSamplesPerSec=85.08204226338393, CurrSamplesPerSec=85.18912757100563, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:50:10,882] [INFO] [logging.py:96:log_dist] [Rank 0] step=3830, skipped=72, lr=[8.179523948294408e-06, 8.179523948294408e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:50:10,915] [INFO] [timer.py:215:stop] epoch=4/micro_step=150/global_step=3830, RunningAvgSamplesPerSec=85.08217838558126, CurrSamplesPerSec=85.22834655669301, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:50:18,408] [INFO] [logging.py:96:log_dist] [Rank 0] step=3840, skipped=72, lr=[8.172114553113722e-06, 8.172114553113722e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:50:18,442] [INFO] [timer.py:215:stop] epoch=4/micro_step=160/global_step=3840, RunningAvgSamplesPerSec=85.08224053451583, CurrSamplesPerSec=84.85032438950763, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:50:25,947] [INFO] [logging.py:96:log_dist] [Rank 0] step=3850, skipped=72, lr=[8.164689911981435e-06, 8.164689911981435e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:50:25,980] [INFO] [timer.py:215:stop] epoch=4/micro_step=170/global_step=3850, RunningAvgSamplesPerSec=85.08195176513561, CurrSamplesPerSec=85.16899115996647, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:50:33,483] [INFO] [logging.py:96:log_dist] [Rank 0] step=3860, skipped=72, lr=[8.15725005871645e-06, 8.15725005871645e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:50:33,516] [INFO] [timer.py:215:stop] epoch=4/micro_step=180/global_step=3860, RunningAvgSamplesPerSec=85.08172276937773, CurrSamplesPerSec=85.11719381199829, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:50:41,020] [INFO] [logging.py:96:log_dist] [Rank 0] step=3870, skipped=72, lr=[8.14979502720695e-06, 8.14979502720695e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:50:41,053] [INFO] [timer.py:215:stop] epoch=4/micro_step=190/global_step=3870, RunningAvgSamplesPerSec=85.08149156363719, CurrSamplesPerSec=85.03805330396042, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:50:44,763] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:50:45,460] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:50:48,444] [INFO] [logging.py:96:log_dist] [Rank 0] step=3880, skipped=74, lr=[8.143820096480303e-06, 8.143820096480303e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:50:48,478] [INFO] [timer.py:215:stop] epoch=4/micro_step=200/global_step=3880, RunningAvgSamplesPerSec=85.08451133805168, CurrSamplesPerSec=84.85568883191847, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:50:55,990] [INFO] [logging.py:96:log_dist] [Rank 0] step=3890, skipped=74, lr=[8.13633782974949e-06, 8.13633782974949e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:50:56,023] [INFO] [timer.py:215:stop] epoch=4/micro_step=210/global_step=3890, RunningAvgSamplesPerSec=85.0840218173507, CurrSamplesPerSec=85.05683416171465, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:51:03,515] [INFO] [logging.py:96:log_dist] [Rank 0] step=3900, skipped=74, lr=[8.1288404800284e-06, 8.1288404800284e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:51:03,549] [INFO] [timer.py:215:stop] epoch=4/micro_step=220/global_step=3900, RunningAvgSamplesPerSec=85.0841005953452, CurrSamplesPerSec=85.43085677040817, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:51:11,046] [INFO] [logging.py:96:log_dist] [Rank 0] step=3910, skipped=74, lr=[8.121328081467107e-06, 8.121328081467107e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:51:11,080] [INFO] [timer.py:215:stop] epoch=4/micro_step=230/global_step=3910, RunningAvgSamplesPerSec=85.0840220174607, CurrSamplesPerSec=85.03837657711256, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:51:18,588] [INFO] [logging.py:96:log_dist] [Rank 0] step=3920, skipped=74, lr=[8.11380066828424e-06, 8.11380066828424e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:51:18,621] [INFO] [timer.py:215:stop] epoch=4/micro_step=240/global_step=3920, RunningAvgSamplesPerSec=85.08364786390383, CurrSamplesPerSec=84.9749812756846, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:51:26,120] [INFO] [logging.py:96:log_dist] [Rank 0] step=3930, skipped=74, lr=[8.106258274766821e-06, 8.106258274766821e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:51:26,154] [INFO] [timer.py:215:stop] epoch=4/micro_step=250/global_step=3930, RunningAvgSamplesPerSec=85.08352328821867, CurrSamplesPerSec=84.99392266068031, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:51:33,655] [INFO] [logging.py:96:log_dist] [Rank 0] step=3940, skipped=74, lr=[8.098700935270097e-06, 8.098700935270097e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:51:33,689] [INFO] [timer.py:215:stop] epoch=4/micro_step=260/global_step=3940, RunningAvgSamplesPerSec=85.08331842091785, CurrSamplesPerSec=84.53483200134029, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:51:41,191] [INFO] [logging.py:96:log_dist] [Rank 0] step=3950, skipped=74, lr=[8.091128684217402e-06, 8.091128684217402e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:51:41,225] [INFO] [timer.py:215:stop] epoch=4/micro_step=270/global_step=3950, RunningAvgSamplesPerSec=85.08310336946883, CurrSamplesPerSec=84.92307030378765, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:51:48,733] [INFO] [logging.py:96:log_dist] [Rank 0] step=3960, skipped=74, lr=[8.083541556099988e-06, 8.083541556099988e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:51:48,767] [INFO] [timer.py:215:stop] epoch=4/micro_step=280/global_step=3960, RunningAvgSamplesPerSec=85.08271851729353, CurrSamplesPerSec=85.01766512278427, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:51:56,265] [INFO] [logging.py:96:log_dist] [Rank 0] step=3970, skipped=74, lr=[8.075939585476871e-06, 8.075939585476871e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:51:56,298] [INFO] [timer.py:215:stop] epoch=4/micro_step=290/global_step=3970, RunningAvgSamplesPerSec=85.08261148810115, CurrSamplesPerSec=85.11247091291361, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:52:01,519] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:52:02,218] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:52:03,687] [INFO] [logging.py:96:log_dist] [Rank 0] step=3980, skipped=76, lr=[8.069847345641095e-06, 8.069847345641095e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:52:03,721] [INFO] [timer.py:215:stop] epoch=4/micro_step=300/global_step=3980, RunningAvgSamplesPerSec=85.08561545734885, CurrSamplesPerSec=85.25609211523424, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:52:11,219] [INFO] [logging.py:96:log_dist] [Rank 0] step=3990, skipped=76, lr=[8.062218745812137e-06, 8.062218745812137e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:52:11,253] [INFO] [timer.py:215:stop] epoch=4/micro_step=310/global_step=3990, RunningAvgSamplesPerSec=85.08551051176904, CurrSamplesPerSec=84.82892701472902, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:52:18,760] [INFO] [logging.py:96:log_dist] [Rank 0] step=4000, skipped=76, lr=[8.054575400601889e-06, 8.054575400601889e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:52:18,793] [INFO] [timer.py:215:stop] epoch=4/micro_step=320/global_step=4000, RunningAvgSamplesPerSec=85.08515657520636, CurrSamplesPerSec=85.09660581034142, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:52:26,300] [INFO] [logging.py:96:log_dist] [Rank 0] step=4010, skipped=76, lr=[8.046917344825433e-06, 8.046917344825433e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:52:26,333] [INFO] [timer.py:215:stop] epoch=4/micro_step=330/global_step=4010, RunningAvgSamplesPerSec=85.08481874002351, CurrSamplesPerSec=84.97643386705667, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:52:33,839] [INFO] [logging.py:96:log_dist] [Rank 0] step=4020, skipped=76, lr=[8.03924461336486e-06, 8.03924461336486e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:52:33,872] [INFO] [timer.py:215:stop] epoch=4/micro_step=340/global_step=4020, RunningAvgSamplesPerSec=85.08450396007458, CurrSamplesPerSec=84.97194175442617, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:52:41,387] [INFO] [logging.py:96:log_dist] [Rank 0] step=4030, skipped=76, lr=[8.031557241169105e-06, 8.031557241169105e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:52:41,421] [INFO] [timer.py:215:stop] epoch=4/micro_step=350/global_step=4030, RunningAvgSamplesPerSec=85.08391884446189, CurrSamplesPerSec=84.75698450231899, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:52:48,913] [INFO] [logging.py:96:log_dist] [Rank 0] step=4040, skipped=76, lr=[8.023855263253791e-06, 8.023855263253791e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:52:48,947] [INFO] [timer.py:215:stop] epoch=4/micro_step=360/global_step=4040, RunningAvgSamplesPerSec=85.08397632552777, CurrSamplesPerSec=84.95707006450061, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:52:56,448] [INFO] [logging.py:96:log_dist] [Rank 0] step=4050, skipped=76, lr=[8.016138714701073e-06, 8.016138714701073e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:52:56,482] [INFO] [timer.py:215:stop] epoch=4/micro_step=370/global_step=4050, RunningAvgSamplesPerSec=85.08376494477064, CurrSamplesPerSec=84.90117973014858, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:53:03,983] [INFO] [logging.py:96:log_dist] [Rank 0] step=4060, skipped=76, lr=[8.008407630659467e-06, 8.008407630659467e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:53:04,017] [INFO] [timer.py:215:stop] epoch=4/micro_step=380/global_step=4060, RunningAvgSamplesPerSec=85.0835923941987, CurrSamplesPerSec=85.39772650632, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:53:11,515] [INFO] [logging.py:96:log_dist] [Rank 0] step=4070, skipped=76, lr=[8.000662046343707e-06, 8.000662046343707e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:53:11,548] [INFO] [timer.py:215:stop] epoch=4/micro_step=390/global_step=4070, RunningAvgSamplesPerSec=85.08351752725754, CurrSamplesPerSec=84.98049600653921, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:53:18,268] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:53:18,966] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:53:18,966] [INFO] [logging.py:96:log_dist] [Rank 0] step=4080, skipped=78, lr=[7.994455162400175e-06, 7.994455162400175e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:53:18,967] [INFO] [timer.py:215:stop] epoch=4/micro_step=400/global_step=4080, RunningAvgSamplesPerSec=85.08655129652304, CurrSamplesPerSec=91.82196553778878, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:53:26,466] [INFO] [logging.py:96:log_dist] [Rank 0] step=4090, skipped=78, lr=[7.986683566542777e-06, 7.986683566542777e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:53:26,499] [INFO] [timer.py:215:stop] epoch=4/micro_step=410/global_step=4090, RunningAvgSamplesPerSec=85.08643891428555, CurrSamplesPerSec=84.73905806226362, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:53:34,001] [INFO] [logging.py:96:log_dist] [Rank 0] step=4100, skipped=78, lr=[7.978897569363325e-06, 7.978897569363325e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:53:34,035] [INFO] [timer.py:215:stop] epoch=4/micro_step=420/global_step=4100, RunningAvgSamplesPerSec=85.08622871771874, CurrSamplesPerSec=85.00471549189982, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:53:41,521] [INFO] [logging.py:96:log_dist] [Rank 0] step=4110, skipped=78, lr=[7.971097206326683e-06, 7.971097206326683e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:53:41,554] [INFO] [timer.py:215:stop] epoch=4/micro_step=430/global_step=4110, RunningAvgSamplesPerSec=85.08644583084639, CurrSamplesPerSec=85.02006164090483, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:53:49,062] [INFO] [logging.py:96:log_dist] [Rank 0] step=4120, skipped=78, lr=[7.963282512963134e-06, 7.963282512963134e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:53:49,096] [INFO] [timer.py:215:stop] epoch=4/micro_step=440/global_step=4120, RunningAvgSamplesPerSec=85.08607533682968, CurrSamplesPerSec=84.85394531373478, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:53:56,593] [INFO] [logging.py:96:log_dist] [Rank 0] step=4130, skipped=78, lr=[7.95545352486825e-06, 7.95545352486825e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:53:56,627] [INFO] [timer.py:215:stop] epoch=4/micro_step=450/global_step=4130, RunningAvgSamplesPerSec=85.08599169710303, CurrSamplesPerSec=85.06621422319516, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:54:04,140] [INFO] [logging.py:96:log_dist] [Rank 0] step=4140, skipped=78, lr=[7.947610277702705e-06, 7.947610277702705e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:54:04,173] [INFO] [timer.py:215:stop] epoch=4/micro_step=460/global_step=4140, RunningAvgSamplesPerSec=85.08549205052097, CurrSamplesPerSec=84.82085887787093, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:54:11,666] [INFO] [logging.py:96:log_dist] [Rank 0] step=4150, skipped=78, lr=[7.939752807192133e-06, 7.939752807192133e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:54:11,699] [INFO] [timer.py:215:stop] epoch=4/micro_step=470/global_step=4150, RunningAvgSamplesPerSec=85.08554064058099, CurrSamplesPerSec=84.98582310910149, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:54:19,195] [INFO] [logging.py:96:log_dist] [Rank 0] step=4160, skipped=78, lr=[7.931881149126938e-06, 7.931881149126938e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:54:19,229] [INFO] [timer.py:215:stop] epoch=4/micro_step=480/global_step=4160, RunningAvgSamplesPerSec=85.08549987074689, CurrSamplesPerSec=84.94335937623609, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:54:26,727] [INFO] [logging.py:96:log_dist] [Rank 0] step=4170, skipped=78, lr=[7.923995339362163e-06, 7.923995339362163e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:54:26,760] [INFO] [timer.py:215:stop] epoch=4/micro_step=490/global_step=4170, RunningAvgSamplesPerSec=85.08540475611348, CurrSamplesPerSec=84.86204657438454, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:54:34,255] [INFO] [logging.py:96:log_dist] [Rank 0] step=4180, skipped=78, lr=[7.91609541381731e-06, 7.91609541381731e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:54:34,289] [INFO] [timer.py:215:stop] epoch=4/micro_step=500/global_step=4180, RunningAvgSamplesPerSec=85.08540181071562, CurrSamplesPerSec=85.04697115628326, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:54:34,985] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:54:35,680] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:54:41,666] [INFO] [logging.py:96:log_dist] [Rank 0] step=4190, skipped=80, lr=[7.909765334198717e-06, 7.909765334198717e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:54:41,700] [INFO] [timer.py:215:stop] epoch=4/micro_step=510/global_step=4190, RunningAvgSamplesPerSec=85.08855972788632, CurrSamplesPerSec=84.93629069181343, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:54:49,195] [INFO] [logging.py:96:log_dist] [Rank 0] step=4200, skipped=80, lr=[7.901840090971978e-06, 7.901840090971978e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:54:49,228] [INFO] [timer.py:215:stop] epoch=4/micro_step=520/global_step=4200, RunningAvgSamplesPerSec=85.08856082598513, CurrSamplesPerSec=85.16815347533131, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:54:56,732] [INFO] [logging.py:96:log_dist] [Rank 0] step=4210, skipped=80, lr=[7.893900832881286e-06, 7.893900832881286e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:54:56,765] [INFO] [timer.py:215:stop] epoch=4/micro_step=530/global_step=4210, RunningAvgSamplesPerSec=85.08829963495012, CurrSamplesPerSec=85.12159332663823, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:55:04,267] [INFO] [logging.py:96:log_dist] [Rank 0] step=4220, skipped=80, lr=[7.88594759608959e-06, 7.88594759608959e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:55:04,301] [INFO] [timer.py:215:stop] epoch=4/micro_step=540/global_step=4220, RunningAvgSamplesPerSec=85.0880927691816, CurrSamplesPerSec=85.42030882809206, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:55:11,796] [INFO] [logging.py:96:log_dist] [Rank 0] step=4230, skipped=80, lr=[7.87798041682352e-06, 7.87798041682352e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:55:11,830] [INFO] [timer.py:215:stop] epoch=4/micro_step=550/global_step=4230, RunningAvgSamplesPerSec=85.08806812046393, CurrSamplesPerSec=85.25744601921915, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:55:19,326] [INFO] [logging.py:96:log_dist] [Rank 0] step=4240, skipped=80, lr=[7.869999331373206e-06, 7.869999331373206e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:55:19,359] [INFO] [timer.py:215:stop] epoch=4/micro_step=560/global_step=4240, RunningAvgSamplesPerSec=85.08801737440828, CurrSamplesPerSec=85.03964275395744, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:55:26,859] [INFO] [logging.py:96:log_dist] [Rank 0] step=4250, skipped=80, lr=[7.862004376092122e-06, 7.862004376092122e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:55:26,893] [INFO] [timer.py:215:stop] epoch=4/micro_step=570/global_step=4250, RunningAvgSamplesPerSec=85.0878749388658, CurrSamplesPerSec=84.87377199857973, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:55:34,382] [INFO] [logging.py:96:log_dist] [Rank 0] step=4260, skipped=80, lr=[7.853995587396918e-06, 7.853995587396918e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:55:34,415] [INFO] [timer.py:215:stop] epoch=4/micro_step=580/global_step=4260, RunningAvgSamplesPerSec=85.0880231955557, CurrSamplesPerSec=84.99230800892872, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:55:41,910] [INFO] [logging.py:96:log_dist] [Rank 0] step=4270, skipped=80, lr=[7.845973001767257e-06, 7.845973001767257e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:55:41,943] [INFO] [timer.py:215:stop] epoch=4/micro_step=590/global_step=4270, RunningAvgSamplesPerSec=85.0880127423596, CurrSamplesPerSec=84.7972797694223, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:55:49,439] [INFO] [logging.py:96:log_dist] [Rank 0] step=4280, skipped=80, lr=[7.837936655745642e-06, 7.837936655745642e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:55:49,472] [INFO] [timer.py:215:stop] epoch=4/micro_step=600/global_step=4280, RunningAvgSamplesPerSec=85.08798343691608, CurrSamplesPerSec=85.27331697979187, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:55:51,676] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:55:52,373] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:55:56,862] [INFO] [logging.py:96:log_dist] [Rank 0] step=4290, skipped=82, lr=[7.831497696042727e-06, 7.831497696042727e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:55:56,895] [INFO] [timer.py:215:stop] epoch=4/micro_step=610/global_step=4290, RunningAvgSamplesPerSec=85.09074967784777, CurrSamplesPerSec=85.15915616012121, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:56:04,403] [INFO] [logging.py:96:log_dist] [Rank 0] step=4300, skipped=82, lr=[7.823436673602674e-06, 7.823436673602674e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:56:04,436] [INFO] [timer.py:215:stop] epoch=4/micro_step=620/global_step=4300, RunningAvgSamplesPerSec=85.09041214974943, CurrSamplesPerSec=85.02442418923101, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:56:11,935] [INFO] [logging.py:96:log_dist] [Rank 0] step=4310, skipped=82, lr=[7.8153619934226e-06, 7.8153619934226e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:56:11,969] [INFO] [timer.py:215:stop] epoch=4/micro_step=630/global_step=4310, RunningAvgSamplesPerSec=85.09026784029392, CurrSamplesPerSec=85.0786702603748, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:56:19,469] [INFO] [logging.py:96:log_dist] [Rank 0] step=4320, skipped=82, lr=[7.807273692282295e-06, 7.807273692282295e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:56:19,502] [INFO] [timer.py:215:stop] epoch=4/micro_step=640/global_step=4320, RunningAvgSamplesPerSec=85.09012295043603, CurrSamplesPerSec=85.15310498143312, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:56:26,998] [INFO] [logging.py:96:log_dist] [Rank 0] step=4330, skipped=82, lr=[7.799171807023597e-06, 7.799171807023597e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:56:27,031] [INFO] [timer.py:215:stop] epoch=4/micro_step=650/global_step=4330, RunningAvgSamplesPerSec=85.09009349364526, CurrSamplesPerSec=85.16655921784023, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:56:34,524] [INFO] [logging.py:96:log_dist] [Rank 0] step=4340, skipped=82, lr=[7.791056374550221e-06, 7.791056374550221e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:56:34,557] [INFO] [timer.py:215:stop] epoch=4/micro_step=660/global_step=4340, RunningAvgSamplesPerSec=85.09014418806535, CurrSamplesPerSec=85.23297407278518, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:56:42,059] [INFO] [logging.py:96:log_dist] [Rank 0] step=4350, skipped=82, lr=[7.782927431827583e-06, 7.782927431827583e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:56:42,093] [INFO] [timer.py:215:stop] epoch=4/micro_step=670/global_step=4350, RunningAvgSamplesPerSec=85.08991833852015, CurrSamplesPerSec=85.15553615523416, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:56:49,586] [INFO] [logging.py:96:log_dist] [Rank 0] step=4360, skipped=82, lr=[7.77478501588264e-06, 7.77478501588264e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:56:49,619] [INFO] [timer.py:215:stop] epoch=4/micro_step=680/global_step=4360, RunningAvgSamplesPerSec=85.08996974664186, CurrSamplesPerSec=85.1976985694771, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:56:57,109] [INFO] [logging.py:96:log_dist] [Rank 0] step=4370, skipped=82, lr=[7.766629163803721e-06, 7.766629163803721e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:56:57,142] [INFO] [timer.py:215:stop] epoch=4/micro_step=690/global_step=4370, RunningAvgSamplesPerSec=85.09007764273932, CurrSamplesPerSec=85.33317057322711, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:57:04,643] [INFO] [logging.py:96:log_dist] [Rank 0] step=4380, skipped=82, lr=[7.75845991274035e-06, 7.75845991274035e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:57:04,676] [INFO] [timer.py:215:stop] epoch=4/micro_step=700/global_step=4380, RunningAvgSamplesPerSec=85.08991826211881, CurrSamplesPerSec=84.97901637434454, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:57:08,398] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:57:09,094] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:57:12,071] [INFO] [logging.py:96:log_dist] [Rank 0] step=4390, skipped=84, lr=[7.7519148896243e-06, 7.7519148896243e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:57:12,105] [INFO] [timer.py:215:stop] epoch=4/micro_step=710/global_step=4390, RunningAvgSamplesPerSec=85.09247572025112, CurrSamplesPerSec=84.84265441520189, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:57:19,603] [INFO] [logging.py:96:log_dist] [Rank 0] step=4400, skipped=84, lr=[7.743721614200437e-06, 7.743721614200437e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:57:19,637] [INFO] [timer.py:215:stop] epoch=4/micro_step=720/global_step=4400, RunningAvgSamplesPerSec=85.09234385707875, CurrSamplesPerSec=84.85271148262109, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:57:27,123] [INFO] [logging.py:96:log_dist] [Rank 0] step=4410, skipped=84, lr=[7.735515044134952e-06, 7.735515044134952e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:57:27,157] [INFO] [timer.py:215:stop] epoch=4/micro_step=730/global_step=4410, RunningAvgSamplesPerSec=85.09254128936476, CurrSamplesPerSec=85.1686128487359, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:57:34,641] [INFO] [logging.py:96:log_dist] [Rank 0] step=4420, skipped=84, lr=[7.727295216808389e-06, 7.727295216808389e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:57:34,675] [INFO] [timer.py:215:stop] epoch=4/micro_step=740/global_step=4420, RunningAvgSamplesPerSec=85.09278191644998, CurrSamplesPerSec=85.31833488490062, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:57:42,167] [INFO] [logging.py:96:log_dist] [Rank 0] step=4430, skipped=84, lr=[7.719062169661682e-06, 7.719062169661682e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:57:42,200] [INFO] [timer.py:215:stop] epoch=4/micro_step=750/global_step=4430, RunningAvgSamplesPerSec=85.09283353478702, CurrSamplesPerSec=85.04230993067306, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:57:49,690] [INFO] [logging.py:96:log_dist] [Rank 0] step=4440, skipped=84, lr=[7.710815940195977e-06, 7.710815940195977e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:57:49,724] [INFO] [timer.py:215:stop] epoch=4/micro_step=760/global_step=4440, RunningAvgSamplesPerSec=85.09291585749963, CurrSamplesPerSec=85.1872081200232, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:57:57,228] [INFO] [logging.py:96:log_dist] [Rank 0] step=4450, skipped=84, lr=[7.702556565972468e-06, 7.702556565972468e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:57:57,262] [INFO] [timer.py:215:stop] epoch=4/micro_step=770/global_step=4450, RunningAvgSamplesPerSec=85.09263913072797, CurrSamplesPerSec=84.86212705824333, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:58:04,747] [INFO] [logging.py:96:log_dist] [Rank 0] step=4460, skipped=84, lr=[7.694284084612225e-06, 7.694284084612225e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:58:04,781] [INFO] [timer.py:215:stop] epoch=4/micro_step=780/global_step=4460, RunningAvgSamplesPerSec=85.09283250171512, CurrSamplesPerSec=85.34497229039275, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:58:12,273] [INFO] [logging.py:96:log_dist] [Rank 0] step=4470, skipped=84, lr=[7.685998533796011e-06, 7.685998533796011e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:58:12,307] [INFO] [timer.py:215:stop] epoch=4/micro_step=790/global_step=4470, RunningAvgSamplesPerSec=85.09288294106285, CurrSamplesPerSec=84.86754665596794, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:58:19,808] [INFO] [logging.py:96:log_dist] [Rank 0] step=4480, skipped=84, lr=[7.677699951264129e-06, 7.677699951264129e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:58:19,841] [INFO] [timer.py:215:stop] epoch=4/micro_step=800/global_step=4480, RunningAvgSamplesPerSec=85.09270233382654, CurrSamplesPerSec=84.87312795526728, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:58:25,046] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:58:25,742] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:58:27,217] [INFO] [logging.py:96:log_dist] [Rank 0] step=4490, skipped=86, lr=[7.671051727802724e-06, 7.671051727802724e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:58:27,251] [INFO] [timer.py:215:stop] epoch=4/micro_step=810/global_step=4490, RunningAvgSamplesPerSec=85.09567019266068, CurrSamplesPerSec=84.95010663248014, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:58:34,741] [INFO] [logging.py:96:log_dist] [Rank 0] step=4500, skipped=86, lr=[7.66272978347756e-06, 7.66272978347756e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:58:34,775] [INFO] [timer.py:215:stop] epoch=4/micro_step=820/global_step=4500, RunningAvgSamplesPerSec=85.09574667525482, CurrSamplesPerSec=85.03533251899421, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:58:42,268] [INFO] [logging.py:96:log_dist] [Rank 0] step=4510, skipped=86, lr=[7.654394913424805e-06, 7.654394913424805e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:58:42,301] [INFO] [timer.py:215:stop] epoch=4/micro_step=830/global_step=4510, RunningAvgSamplesPerSec=85.09577317530383, CurrSamplesPerSec=84.97982344006327, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:58:49,793] [INFO] [logging.py:96:log_dist] [Rank 0] step=4520, skipped=86, lr=[7.646047155609408e-06, 7.646047155609408e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:58:49,826] [INFO] [timer.py:215:stop] epoch=4/micro_step=840/global_step=4520, RunningAvgSamplesPerSec=85.0958208737049, CurrSamplesPerSec=85.40006299172518, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:58:57,317] [INFO] [logging.py:96:log_dist] [Rank 0] step=4530, skipped=86, lr=[7.637686548055018e-06, 7.637686548055018e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:58:57,351] [INFO] [timer.py:215:stop] epoch=4/micro_step=850/global_step=4530, RunningAvgSamplesPerSec=85.09587286541577, CurrSamplesPerSec=85.22783241925539, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:59:04,848] [INFO] [logging.py:96:log_dist] [Rank 0] step=4540, skipped=86, lr=[7.6293131288438135e-06, 7.6293131288438135e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:59:04,882] [INFO] [timer.py:215:stop] epoch=4/micro_step=860/global_step=4540, RunningAvgSamplesPerSec=85.09576315871325, CurrSamplesPerSec=85.37385278136017, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:59:12,373] [INFO] [logging.py:96:log_dist] [Rank 0] step=4550, skipped=86, lr=[7.620926936116333e-06, 7.620926936116333e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:59:12,407] [INFO] [timer.py:215:stop] epoch=4/micro_step=870/global_step=4550, RunningAvgSamplesPerSec=85.09581795499872, CurrSamplesPerSec=84.64620290206051, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:59:19,911] [INFO] [logging.py:96:log_dist] [Rank 0] step=4560, skipped=86, lr=[7.612528008071294e-06, 7.612528008071294e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:59:19,944] [INFO] [timer.py:215:stop] epoch=4/micro_step=880/global_step=4560, RunningAvgSamplesPerSec=85.09557433809377, CurrSamplesPerSec=85.0614161171994, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:59:27,448] [INFO] [logging.py:96:log_dist] [Rank 0] step=4570, skipped=86, lr=[7.604116382965426e-06, 7.604116382965426e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:59:27,482] [INFO] [timer.py:215:stop] epoch=4/micro_step=890/global_step=4570, RunningAvgSamplesPerSec=85.09529604414186, CurrSamplesPerSec=84.94935389739132, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:59:34,979] [INFO] [logging.py:96:log_dist] [Rank 0] step=4580, skipped=86, lr=[7.595692099113291e-06, 7.595692099113291e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:59:35,013] [INFO] [timer.py:215:stop] epoch=4/micro_step=900/global_step=4580, RunningAvgSamplesPerSec=85.09519582869945, CurrSamplesPerSec=84.98240615338248, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:59:41,727] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 17:59:42,425] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 17:59:42,426] [INFO] [logging.py:96:log_dist] [Rank 0] step=4590, skipped=88, lr=[7.5889435835184686e-06, 7.5889435835184686e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:59:42,426] [INFO] [timer.py:215:stop] epoch=4/micro_step=910/global_step=4590, RunningAvgSamplesPerSec=85.09798908670577, CurrSamplesPerSec=91.69377640672774, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 17:59:49,912] [INFO] [logging.py:96:log_dist] [Rank 0] step=4600, skipped=88, lr=[7.580496610659687e-06, 7.580496610659687e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 17:59:49,946] [INFO] [timer.py:215:stop] epoch=4/micro_step=920/global_step=4600, RunningAvgSamplesPerSec=85.09816834855944, CurrSamplesPerSec=84.92183446162588, MemAllocated=5.0GB, MaxMemAllocated=23.99GB ***** Evaluating perplexity, Epoch 5/16 ***** ppl: 1.8492867946624756 Beginning of Epoch 6/16, Total Micro Batches 920 [2023-06-29 18:00:15,374] [INFO] [logging.py:96:log_dist] [Rank 0] step=4610, skipped=88, lr=[7.572037086641604e-06, 7.572037086641604e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:00:15,407] [INFO] [timer.py:215:stop] epoch=5/micro_step=10/global_step=4610, RunningAvgSamplesPerSec=85.09704304982604, CurrSamplesPerSec=84.70218555784989, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:00:22,927] [INFO] [logging.py:96:log_dist] [Rank 0] step=4620, skipped=88, lr=[7.5635650499969625e-06, 7.5635650499969625e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:00:22,961] [INFO] [timer.py:215:stop] epoch=5/micro_step=20/global_step=4620, RunningAvgSamplesPerSec=85.09638530277826, CurrSamplesPerSec=84.99055887209249, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:00:30,489] [INFO] [logging.py:96:log_dist] [Rank 0] step=4630, skipped=88, lr=[7.555080539315493e-06, 7.555080539315493e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:00:30,522] [INFO] [timer.py:215:stop] epoch=5/micro_step=30/global_step=4630, RunningAvgSamplesPerSec=85.09553491520474, CurrSamplesPerSec=84.8167851859435, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:00:38,030] [INFO] [logging.py:96:log_dist] [Rank 0] step=4640, skipped=88, lr=[7.5465835932437515e-06, 7.5465835932437515e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:00:38,064] [INFO] [timer.py:215:stop] epoch=5/micro_step=40/global_step=4640, RunningAvgSamplesPerSec=85.09516937709482, CurrSamplesPerSec=84.9765145680335, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:00:45,595] [INFO] [logging.py:96:log_dist] [Rank 0] step=4650, skipped=88, lr=[7.538074250484931e-06, 7.538074250484931e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:00:45,629] [INFO] [timer.py:215:stop] epoch=5/micro_step=50/global_step=4650, RunningAvgSamplesPerSec=85.09424995849935, CurrSamplesPerSec=85.04605503691918, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:00:53,140] [INFO] [logging.py:96:log_dist] [Rank 0] step=4660, skipped=88, lr=[7.529552549798694e-06, 7.529552549798694e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:00:53,174] [INFO] [timer.py:215:stop] epoch=5/micro_step=60/global_step=4660, RunningAvgSamplesPerSec=85.09382279360861, CurrSamplesPerSec=85.17588241696211, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:01:00,667] [INFO] [logging.py:96:log_dist] [Rank 0] step=4670, skipped=88, lr=[7.521018530000993e-06, 7.521018530000993e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:01:00,701] [INFO] [timer.py:215:stop] epoch=5/micro_step=70/global_step=4670, RunningAvgSamplesPerSec=85.09381111892186, CurrSamplesPerSec=85.03600596440553, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:01:08,194] [INFO] [logging.py:96:log_dist] [Rank 0] step=4680, skipped=88, lr=[7.51247222996389e-06, 7.51247222996389e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:01:08,228] [INFO] [timer.py:215:stop] epoch=5/micro_step=80/global_step=4680, RunningAvgSamplesPerSec=85.09381556477716, CurrSamplesPerSec=84.59189361232787, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:01:15,731] [INFO] [logging.py:96:log_dist] [Rank 0] step=4690, skipped=88, lr=[7.503913688615389e-06, 7.503913688615389e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:01:15,765] [INFO] [timer.py:215:stop] epoch=5/micro_step=90/global_step=4690, RunningAvgSamplesPerSec=85.09357905582661, CurrSamplesPerSec=84.80041396343134, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:01:16,459] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:01:17,155] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:01:23,154] [INFO] [logging.py:96:log_dist] [Rank 0] step=4700, skipped=90, lr=[7.497058067987595e-06, 7.497058067987595e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:01:23,188] [INFO] [timer.py:215:stop] epoch=5/micro_step=100/global_step=4700, RunningAvgSamplesPerSec=85.0960718203359, CurrSamplesPerSec=84.44287520840543, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:01:30,686] [INFO] [logging.py:96:log_dist] [Rank 0] step=4710, skipped=90, lr=[7.488477590555002e-06, 7.488477590555002e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:01:30,720] [INFO] [timer.py:215:stop] epoch=5/micro_step=110/global_step=4710, RunningAvgSamplesPerSec=85.09593679651144, CurrSamplesPerSec=84.8293827360246, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:01:38,221] [INFO] [logging.py:96:log_dist] [Rank 0] step=4720, skipped=90, lr=[7.479884981105479e-06, 7.479884981105479e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:01:38,254] [INFO] [timer.py:215:stop] epoch=5/micro_step=120/global_step=4720, RunningAvgSamplesPerSec=85.09577939480202, CurrSamplesPerSec=84.84536288410845, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:01:45,747] [INFO] [logging.py:96:log_dist] [Rank 0] step=4730, skipped=90, lr=[7.471280278777963e-06, 7.471280278777963e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:01:45,780] [INFO] [timer.py:215:stop] epoch=5/micro_step=130/global_step=4730, RunningAvgSamplesPerSec=85.0957896211744, CurrSamplesPerSec=85.11152639599027, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:01:53,274] [INFO] [logging.py:96:log_dist] [Rank 0] step=4740, skipped=90, lr=[7.462663522766476e-06, 7.462663522766476e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:01:53,308] [INFO] [timer.py:215:stop] epoch=5/micro_step=140/global_step=4740, RunningAvgSamplesPerSec=85.09578532005386, CurrSamplesPerSec=85.28886866740719, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:02:00,795] [INFO] [logging.py:96:log_dist] [Rank 0] step=4750, skipped=90, lr=[7.45403475231994e-06, 7.45403475231994e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:02:00,829] [INFO] [timer.py:215:stop] epoch=5/micro_step=150/global_step=4750, RunningAvgSamplesPerSec=85.09590635479059, CurrSamplesPerSec=85.22169030959897, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:02:08,331] [INFO] [logging.py:96:log_dist] [Rank 0] step=4760, skipped=90, lr=[7.445394006742005e-06, 7.445394006742005e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:02:08,364] [INFO] [timer.py:215:stop] epoch=5/micro_step=160/global_step=4760, RunningAvgSamplesPerSec=85.09571641363428, CurrSamplesPerSec=84.71440153425162, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:02:15,861] [INFO] [logging.py:96:log_dist] [Rank 0] step=4770, skipped=90, lr=[7.436741325390867e-06, 7.436741325390867e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:02:15,895] [INFO] [timer.py:215:stop] epoch=5/micro_step=170/global_step=4770, RunningAvgSamplesPerSec=85.09563632737238, CurrSamplesPerSec=85.03832269808318, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:02:23,384] [INFO] [logging.py:96:log_dist] [Rank 0] step=4780, skipped=90, lr=[7.428076747679087e-06, 7.428076747679087e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:02:23,417] [INFO] [timer.py:215:stop] epoch=5/micro_step=180/global_step=4780, RunningAvgSamplesPerSec=85.0957465477775, CurrSamplesPerSec=85.17823380684992, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:02:30,909] [INFO] [logging.py:96:log_dist] [Rank 0] step=4790, skipped=90, lr=[7.419400313073417e-06, 7.419400313073417e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:02:30,942] [INFO] [timer.py:215:stop] epoch=5/micro_step=190/global_step=4790, RunningAvgSamplesPerSec=85.09577233222356, CurrSamplesPerSec=85.48241059128564, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:02:33,144] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:02:33,840] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:02:38,335] [INFO] [logging.py:96:log_dist] [Rank 0] step=4800, skipped=92, lr=[7.412450654981417e-06, 7.412450654981417e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:02:38,369] [INFO] [timer.py:215:stop] epoch=5/micro_step=200/global_step=4800, RunningAvgSamplesPerSec=85.09814473767099, CurrSamplesPerSec=84.9640077457696, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:02:45,873] [INFO] [logging.py:96:log_dist] [Rank 0] step=4810, skipped=92, lr=[7.403752977595229e-06, 7.403752977595229e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:02:45,906] [INFO] [timer.py:215:stop] epoch=5/micro_step=210/global_step=4810, RunningAvgSamplesPerSec=85.09788565619954, CurrSamplesPerSec=85.20375608789936, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:02:53,406] [INFO] [logging.py:96:log_dist] [Rank 0] step=4820, skipped=92, lr=[7.395043554108795e-06, 7.395043554108795e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:02:53,439] [INFO] [timer.py:215:stop] epoch=5/micro_step=220/global_step=4820, RunningAvgSamplesPerSec=85.09775609827663, CurrSamplesPerSec=85.00245442250456, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:03:00,935] [INFO] [logging.py:96:log_dist] [Rank 0] step=4830, skipped=92, lr=[7.386322424193133e-06, 7.386322424193133e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:03:00,969] [INFO] [timer.py:215:stop] epoch=5/micro_step=230/global_step=4830, RunningAvgSamplesPerSec=85.09768130543837, CurrSamplesPerSec=85.01421868190893, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:03:08,474] [INFO] [logging.py:96:log_dist] [Rank 0] step=4840, skipped=92, lr=[7.377589627572588e-06, 7.377589627572588e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:03:08,508] [INFO] [timer.py:215:stop] epoch=5/micro_step=240/global_step=4840, RunningAvgSamplesPerSec=85.0973683392475, CurrSamplesPerSec=84.73453751596455, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:03:16,005] [INFO] [logging.py:96:log_dist] [Rank 0] step=4850, skipped=92, lr=[7.368845204024645e-06, 7.368845204024645e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:03:16,039] [INFO] [timer.py:215:stop] epoch=5/micro_step=250/global_step=4850, RunningAvgSamplesPerSec=85.09727635927402, CurrSamplesPerSec=85.2573106268856, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:03:23,533] [INFO] [logging.py:96:log_dist] [Rank 0] step=4860, skipped=92, lr=[7.360089193379744e-06, 7.360089193379744e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:03:23,566] [INFO] [timer.py:215:stop] epoch=5/micro_step=260/global_step=4860, RunningAvgSamplesPerSec=85.09726875951587, CurrSamplesPerSec=85.12828793916225, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:03:31,059] [INFO] [logging.py:96:log_dist] [Rank 0] step=4870, skipped=92, lr=[7.351321635521108e-06, 7.351321635521108e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:03:31,092] [INFO] [timer.py:215:stop] epoch=5/micro_step=270/global_step=4870, RunningAvgSamplesPerSec=85.09728148458542, CurrSamplesPerSec=84.94585923064533, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:03:38,589] [INFO] [logging.py:96:log_dist] [Rank 0] step=4880, skipped=92, lr=[7.342542570384559e-06, 7.342542570384559e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:03:38,623] [INFO] [timer.py:215:stop] epoch=5/micro_step=280/global_step=4880, RunningAvgSamplesPerSec=85.09719610512285, CurrSamplesPerSec=84.94370880940409, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:03:46,118] [INFO] [logging.py:96:log_dist] [Rank 0] step=4890, skipped=92, lr=[7.333752037958332e-06, 7.333752037958332e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:03:46,152] [INFO] [timer.py:215:stop] epoch=5/micro_step=290/global_step=4890, RunningAvgSamplesPerSec=85.09714582818391, CurrSamplesPerSec=85.10977234874798, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:03:49,868] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:03:50,563] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:03:53,544] [INFO] [logging.py:96:log_dist] [Rank 0] step=4900, skipped=94, lr=[7.326711382474223e-06, 7.326711382474223e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:03:53,578] [INFO] [timer.py:215:stop] epoch=5/micro_step=300/global_step=4900, RunningAvgSamplesPerSec=85.09947099394115, CurrSamplesPerSec=85.16455972502206, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:04:01,077] [INFO] [logging.py:96:log_dist] [Rank 0] step=4910, skipped=94, lr=[7.317900309863533e-06, 7.317900309863533e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:04:01,111] [INFO] [timer.py:215:stop] epoch=5/micro_step=310/global_step=4910, RunningAvgSamplesPerSec=85.09931825997212, CurrSamplesPerSec=85.02383171812558, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:04:08,632] [INFO] [logging.py:96:log_dist] [Rank 0] step=4920, skipped=94, lr=[7.309077882207519e-06, 7.309077882207519e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:04:08,666] [INFO] [timer.py:215:stop] epoch=5/micro_step=320/global_step=4920, RunningAvgSamplesPerSec=85.09866960174806, CurrSamplesPerSec=84.9883792792519, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:04:16,186] [INFO] [logging.py:96:log_dist] [Rank 0] step=4930, skipped=94, lr=[7.300244139691927e-06, 7.300244139691927e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:04:16,220] [INFO] [timer.py:215:stop] epoch=5/micro_step=330/global_step=4930, RunningAvgSamplesPerSec=85.09803260206233, CurrSamplesPerSec=84.83235844837967, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:04:23,730] [INFO] [logging.py:96:log_dist] [Rank 0] step=4940, skipped=94, lr=[7.291399122554046e-06, 7.291399122554046e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:04:23,764] [INFO] [timer.py:215:stop] epoch=5/micro_step=340/global_step=4940, RunningAvgSamplesPerSec=85.0976382315645, CurrSamplesPerSec=84.98184117142948, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:04:31,282] [INFO] [logging.py:96:log_dist] [Rank 0] step=4950, skipped=94, lr=[7.28254287108252e-06, 7.28254287108252e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:04:31,316] [INFO] [timer.py:215:stop] epoch=5/micro_step=350/global_step=4950, RunningAvgSamplesPerSec=85.09705597248376, CurrSamplesPerSec=84.68452272977501, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:04:38,810] [INFO] [logging.py:96:log_dist] [Rank 0] step=4960, skipped=94, lr=[7.273675425617163e-06, 7.273675425617163e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:04:38,843] [INFO] [timer.py:215:stop] epoch=5/micro_step=360/global_step=4960, RunningAvgSamplesPerSec=85.09703668466733, CurrSamplesPerSec=85.07025799123804, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:04:46,336] [INFO] [logging.py:96:log_dist] [Rank 0] step=4970, skipped=94, lr=[7.264796826548777e-06, 7.264796826548777e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:04:46,370] [INFO] [timer.py:215:stop] epoch=5/micro_step=370/global_step=4970, RunningAvgSamplesPerSec=85.09704564583603, CurrSamplesPerSec=85.30044128568028, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:04:53,866] [INFO] [logging.py:96:log_dist] [Rank 0] step=4980, skipped=94, lr=[7.25590711431897e-06, 7.25590711431897e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:04:53,899] [INFO] [timer.py:215:stop] epoch=5/micro_step=380/global_step=4980, RunningAvgSamplesPerSec=85.09697790038874, CurrSamplesPerSec=85.4111494554963, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:05:01,410] [INFO] [logging.py:96:log_dist] [Rank 0] step=4990, skipped=94, lr=[7.247006329419968e-06, 7.247006329419968e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:05:01,444] [INFO] [timer.py:215:stop] epoch=5/micro_step=390/global_step=4990, RunningAvgSamplesPerSec=85.09657666519863, CurrSamplesPerSec=84.64617621043968, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:05:06,657] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:05:07,356] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:05:08,829] [INFO] [logging.py:96:log_dist] [Rank 0] step=5000, skipped=96, lr=[7.239877756421927e-06, 7.239877756421927e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:05:08,863] [INFO] [timer.py:215:stop] epoch=5/micro_step=400/global_step=5000, RunningAvgSamplesPerSec=85.09902304985064, CurrSamplesPerSec=84.85737877176634, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:05:16,358] [INFO] [logging.py:96:log_dist] [Rank 0] step=5010, skipped=96, lr=[7.23095714291966e-06, 7.23095714291966e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:05:16,392] [INFO] [timer.py:215:stop] epoch=5/micro_step=410/global_step=5010, RunningAvgSamplesPerSec=85.09895811514123, CurrSamplesPerSec=85.20386426569937, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:05:23,883] [INFO] [logging.py:96:log_dist] [Rank 0] step=5020, skipped=96, lr=[7.2220255703941615e-06, 7.2220255703941615e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:05:23,917] [INFO] [timer.py:215:stop] epoch=5/micro_step=420/global_step=5020, RunningAvgSamplesPerSec=85.0989905878015, CurrSamplesPerSec=84.8839706537574, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:05:31,403] [INFO] [logging.py:96:log_dist] [Rank 0] step=5030, skipped=96, lr=[7.2130830795283315e-06, 7.2130830795283315e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:05:31,437] [INFO] [timer.py:215:stop] epoch=5/micro_step=430/global_step=5030, RunningAvgSamplesPerSec=85.09913728200947, CurrSamplesPerSec=85.02579767673306, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:05:38,940] [INFO] [logging.py:96:log_dist] [Rank 0] step=5040, skipped=96, lr=[7.2041297110548e-06, 7.2041297110548e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:05:38,974] [INFO] [timer.py:215:stop] epoch=5/micro_step=440/global_step=5040, RunningAvgSamplesPerSec=85.09890233928638, CurrSamplesPerSec=85.10731680406533, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:05:46,459] [INFO] [logging.py:96:log_dist] [Rank 0] step=5050, skipped=96, lr=[7.1951655057557455e-06, 7.1951655057557455e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:05:46,493] [INFO] [timer.py:215:stop] epoch=5/micro_step=450/global_step=5050, RunningAvgSamplesPerSec=85.09907412978157, CurrSamplesPerSec=85.15980455170427, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:05:53,988] [INFO] [logging.py:96:log_dist] [Rank 0] step=5060, skipped=96, lr=[7.186190504462706e-06, 7.186190504462706e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:05:54,021] [INFO] [timer.py:215:stop] epoch=5/micro_step=460/global_step=5060, RunningAvgSamplesPerSec=85.09902125674654, CurrSamplesPerSec=85.29808314384238, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:06:01,516] [INFO] [logging.py:96:log_dist] [Rank 0] step=5070, skipped=96, lr=[7.1772047480564e-06, 7.1772047480564e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:06:01,549] [INFO] [timer.py:215:stop] epoch=5/micro_step=470/global_step=5070, RunningAvgSamplesPerSec=85.09899429280865, CurrSamplesPerSec=84.95260688404994, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:06:09,052] [INFO] [logging.py:96:log_dist] [Rank 0] step=5080, skipped=96, lr=[7.168208277466528e-06, 7.168208277466528e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:06:09,086] [INFO] [timer.py:215:stop] epoch=5/micro_step=480/global_step=5080, RunningAvgSamplesPerSec=85.09877790462515, CurrSamplesPerSec=84.64300002774814, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:06:16,592] [INFO] [logging.py:96:log_dist] [Rank 0] step=5090, skipped=96, lr=[7.159201133671599e-06, 7.159201133671599e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:06:16,626] [INFO] [timer.py:215:stop] epoch=5/micro_step=490/global_step=5090, RunningAvgSamplesPerSec=85.09849104276012, CurrSamplesPerSec=85.04745616864292, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:06:23,341] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:06:24,038] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:06:24,039] [INFO] [logging.py:96:log_dist] [Rank 0] step=5100, skipped=98, lr=[7.151987761496608e-06, 7.151987761496608e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:06:24,040] [INFO] [timer.py:215:stop] epoch=5/micro_step=500/global_step=5100, RunningAvgSamplesPerSec=85.10099593610214, CurrSamplesPerSec=91.77547244962675, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:06:31,533] [INFO] [logging.py:96:log_dist] [Rank 0] step=5110, skipped=98, lr=[7.142961509353471e-06, 7.142961509353471e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:06:31,567] [INFO] [timer.py:215:stop] epoch=5/micro_step=510/global_step=5110, RunningAvgSamplesPerSec=85.10097456448294, CurrSamplesPerSec=84.98972469372596, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:06:39,067] [INFO] [logging.py:96:log_dist] [Rank 0] step=5120, skipped=98, lr=[7.133924699003135e-06, 7.133924699003135e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:06:39,101] [INFO] [timer.py:215:stop] epoch=5/micro_step=520/global_step=5120, RunningAvgSamplesPerSec=85.10081059475256, CurrSamplesPerSec=85.10769457094774, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:06:46,601] [INFO] [logging.py:96:log_dist] [Rank 0] step=5130, skipped=98, lr=[7.124877371607849e-06, 7.124877371607849e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:06:46,634] [INFO] [timer.py:215:stop] epoch=5/micro_step=530/global_step=5130, RunningAvgSamplesPerSec=85.10066739945144, CurrSamplesPerSec=85.09239769418649, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:06:54,128] [INFO] [logging.py:96:log_dist] [Rank 0] step=5140, skipped=98, lr=[7.115819568377772e-06, 7.115819568377772e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:06:54,162] [INFO] [timer.py:215:stop] epoch=5/micro_step=540/global_step=5140, RunningAvgSamplesPerSec=85.10063407983596, CurrSamplesPerSec=85.45786375919091, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:07:01,651] [INFO] [logging.py:96:log_dist] [Rank 0] step=5150, skipped=98, lr=[7.106751330570777e-06, 7.106751330570777e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:07:01,685] [INFO] [timer.py:215:stop] epoch=5/micro_step=550/global_step=5150, RunningAvgSamplesPerSec=85.10071381580622, CurrSamplesPerSec=85.47291130873138, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:07:09,181] [INFO] [logging.py:96:log_dist] [Rank 0] step=5160, skipped=98, lr=[7.097672699492267e-06, 7.097672699492267e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:07:09,214] [INFO] [timer.py:215:stop] epoch=5/micro_step=560/global_step=5160, RunningAvgSamplesPerSec=85.10065345173844, CurrSamplesPerSec=84.87269859848912, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:07:16,719] [INFO] [logging.py:96:log_dist] [Rank 0] step=5170, skipped=98, lr=[7.088583716494987e-06, 7.088583716494987e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:07:16,753] [INFO] [timer.py:215:stop] epoch=5/micro_step=570/global_step=5170, RunningAvgSamplesPerSec=85.10037961487122, CurrSamplesPerSec=84.92033000667188, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:07:24,248] [INFO] [logging.py:96:log_dist] [Rank 0] step=5180, skipped=98, lr=[7.07948442297883e-06, 7.07948442297883e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:07:24,282] [INFO] [timer.py:215:stop] epoch=5/micro_step=580/global_step=5180, RunningAvgSamplesPerSec=85.1003305660305, CurrSamplesPerSec=84.79757442736818, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:07:31,780] [INFO] [logging.py:96:log_dist] [Rank 0] step=5190, skipped=98, lr=[7.07037486039066e-06, 7.07037486039066e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:07:31,813] [INFO] [timer.py:215:stop] epoch=5/micro_step=590/global_step=5190, RunningAvgSamplesPerSec=85.100228996882, CurrSamplesPerSec=84.92669743956417, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:07:39,315] [INFO] [logging.py:96:log_dist] [Rank 0] step=5200, skipped=98, lr=[7.0612550702241075e-06, 7.0612550702241075e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:07:39,347] [INFO] [timer.py:215:stop] epoch=5/micro_step=600/global_step=5200, RunningAvgSamplesPerSec=85.10006980273566, CurrSamplesPerSec=85.3549316805578, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:07:40,044] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:07:40,741] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:07:46,732] [INFO] [logging.py:96:log_dist] [Rank 0] step=5210, skipped=100, lr=[7.053951902147903e-06, 7.053951902147903e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:07:46,766] [INFO] [timer.py:215:stop] epoch=5/micro_step=610/global_step=5210, RunningAvgSamplesPerSec=85.1024286594513, CurrSamplesPerSec=85.08538509205536, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:07:54,264] [INFO] [logging.py:96:log_dist] [Rank 0] step=5220, skipped=100, lr=[7.04481380705281e-06, 7.04481380705281e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:07:54,297] [INFO] [timer.py:215:stop] epoch=5/micro_step=620/global_step=5220, RunningAvgSamplesPerSec=85.10231176115968, CurrSamplesPerSec=84.87594571692362, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:08:01,785] [INFO] [logging.py:96:log_dist] [Rank 0] step=5230, skipped=100, lr=[7.03566560080875e-06, 7.03566560080875e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:08:01,818] [INFO] [timer.py:215:stop] epoch=5/micro_step=630/global_step=5230, RunningAvgSamplesPerSec=85.10242380481621, CurrSamplesPerSec=85.03554802036528, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:08:09,316] [INFO] [logging.py:96:log_dist] [Rank 0] step=5240, skipped=100, lr=[7.026507325085379e-06, 7.026507325085379e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:08:09,350] [INFO] [timer.py:215:stop] epoch=5/micro_step=640/global_step=5240, RunningAvgSamplesPerSec=85.10230712006333, CurrSamplesPerSec=85.1226730316317, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:08:16,856] [INFO] [logging.py:96:log_dist] [Rank 0] step=5250, skipped=100, lr=[7.017339021598217e-06, 7.017339021598217e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:08:16,890] [INFO] [timer.py:215:stop] epoch=5/micro_step=650/global_step=5250, RunningAvgSamplesPerSec=85.10201004080771, CurrSamplesPerSec=84.78755720587989, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:08:24,389] [INFO] [logging.py:96:log_dist] [Rank 0] step=5260, skipped=100, lr=[7.008160732108462e-06, 7.008160732108462e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:08:24,423] [INFO] [timer.py:215:stop] epoch=5/micro_step=660/global_step=5260, RunningAvgSamplesPerSec=85.10185894158454, CurrSamplesPerSec=85.09242466796465, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:08:31,927] [INFO] [logging.py:96:log_dist] [Rank 0] step=5270, skipped=100, lr=[6.998972498422798e-06, 6.998972498422798e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:08:31,961] [INFO] [timer.py:215:stop] epoch=5/micro_step=670/global_step=5270, RunningAvgSamplesPerSec=85.10162617676292, CurrSamplesPerSec=84.90507354673828, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:08:39,461] [INFO] [logging.py:96:log_dist] [Rank 0] step=5280, skipped=100, lr=[6.989774362393201e-06, 6.989774362393201e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:08:39,495] [INFO] [timer.py:215:stop] epoch=5/micro_step=680/global_step=5280, RunningAvgSamplesPerSec=85.10145365215449, CurrSamplesPerSec=85.00541537109115, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:08:46,986] [INFO] [logging.py:96:log_dist] [Rank 0] step=5290, skipped=100, lr=[6.980566365916755e-06, 6.980566365916755e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:08:47,020] [INFO] [timer.py:215:stop] epoch=5/micro_step=690/global_step=5290, RunningAvgSamplesPerSec=85.10149291244893, CurrSamplesPerSec=85.15496886893038, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:08:54,513] [INFO] [logging.py:96:log_dist] [Rank 0] step=5300, skipped=100, lr=[6.971348550935457e-06, 6.971348550935457e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:08:54,547] [INFO] [timer.py:215:stop] epoch=5/micro_step=700/global_step=5300, RunningAvgSamplesPerSec=85.10147110417562, CurrSamplesPerSec=85.03002614232919, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:08:56,747] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:08:57,443] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:09:01,934] [INFO] [logging.py:96:log_dist] [Rank 0] step=5310, skipped=102, lr=[6.963967257840505e-06, 6.963967257840505e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:09:01,967] [INFO] [timer.py:215:stop] epoch=5/micro_step=710/global_step=5310, RunningAvgSamplesPerSec=85.10373416363004, CurrSamplesPerSec=84.43652699677492, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:09:09,472] [INFO] [logging.py:96:log_dist] [Rank 0] step=5320, skipped=102, lr=[6.954731875386939e-06, 6.954731875386939e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:09:09,506] [INFO] [timer.py:215:stop] epoch=5/micro_step=720/global_step=5320, RunningAvgSamplesPerSec=85.1034789147199, CurrSamplesPerSec=84.92237177988522, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:09:16,992] [INFO] [logging.py:96:log_dist] [Rank 0] step=5330, skipped=102, lr=[6.94548679210343e-06, 6.94548679210343e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:09:17,025] [INFO] [timer.py:215:stop] epoch=5/micro_step=730/global_step=5330, RunningAvgSamplesPerSec=85.10361996348345, CurrSamplesPerSec=85.04349539625066, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:09:24,525] [INFO] [logging.py:96:log_dist] [Rank 0] step=5340, skipped=102, lr=[6.9362320501009e-06, 6.9362320501009e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:09:24,559] [INFO] [timer.py:215:stop] epoch=5/micro_step=740/global_step=5340, RunningAvgSamplesPerSec=85.1034613543412, CurrSamplesPerSec=85.29653822505588, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:09:32,051] [INFO] [logging.py:96:log_dist] [Rank 0] step=5350, skipped=102, lr=[6.9269676915342725e-06, 6.9269676915342725e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:09:32,084] [INFO] [timer.py:215:stop] epoch=5/micro_step=750/global_step=5350, RunningAvgSamplesPerSec=85.10346655019377, CurrSamplesPerSec=85.00614218091123, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:09:39,589] [INFO] [logging.py:96:log_dist] [Rank 0] step=5360, skipped=102, lr=[6.917693758602269e-06, 6.917693758602269e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:09:39,623] [INFO] [timer.py:215:stop] epoch=5/micro_step=760/global_step=5360, RunningAvgSamplesPerSec=85.10320665807681, CurrSamplesPerSec=85.09390825209957, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:09:47,120] [INFO] [logging.py:96:log_dist] [Rank 0] step=5370, skipped=102, lr=[6.908410293547225e-06, 6.908410293547225e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:09:47,154] [INFO] [timer.py:215:stop] epoch=5/micro_step=770/global_step=5370, RunningAvgSamplesPerSec=85.1031109257076, CurrSamplesPerSec=84.7106588476391, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:09:54,653] [INFO] [logging.py:96:log_dist] [Rank 0] step=5380, skipped=102, lr=[6.899117338654896e-06, 6.899117338654896e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:09:54,687] [INFO] [timer.py:215:stop] epoch=5/micro_step=780/global_step=5380, RunningAvgSamplesPerSec=85.10295452993068, CurrSamplesPerSec=85.16615390765516, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:10:02,185] [INFO] [logging.py:96:log_dist] [Rank 0] step=5390, skipped=102, lr=[6.889814936254255e-06, 6.889814936254255e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:10:02,219] [INFO] [timer.py:215:stop] epoch=5/micro_step=790/global_step=5390, RunningAvgSamplesPerSec=85.10282508972492, CurrSamplesPerSec=84.67851210105569, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:10:09,711] [INFO] [logging.py:96:log_dist] [Rank 0] step=5400, skipped=102, lr=[6.880503128717318e-06, 6.880503128717318e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:10:09,745] [INFO] [timer.py:215:stop] epoch=5/micro_step=800/global_step=5400, RunningAvgSamplesPerSec=85.10282888697003, CurrSamplesPerSec=84.89103063200045, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:10:13,452] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:10:14,152] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:10:17,132] [INFO] [logging.py:96:log_dist] [Rank 0] step=5410, skipped=104, lr=[6.87304693949098e-06, 6.87304693949098e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:10:17,166] [INFO] [timer.py:215:stop] epoch=5/micro_step=810/global_step=5410, RunningAvgSamplesPerSec=85.10504056913007, CurrSamplesPerSec=84.77430275713792, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:10:24,668] [INFO] [logging.py:96:log_dist] [Rank 0] step=5420, skipped=104, lr=[6.863718309622797e-06, 6.863718309622797e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:10:24,702] [INFO] [timer.py:215:stop] epoch=5/micro_step=820/global_step=5420, RunningAvgSamplesPerSec=85.10482094850211, CurrSamplesPerSec=84.87283277201564, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:10:32,204] [INFO] [logging.py:96:log_dist] [Rank 0] step=5430, skipped=104, lr=[6.854380393487243e-06, 6.854380393487243e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:10:32,238] [INFO] [timer.py:215:stop] epoch=5/micro_step=830/global_step=5430, RunningAvgSamplesPerSec=85.10461906379585, CurrSamplesPerSec=85.2349226509513, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:10:39,740] [INFO] [logging.py:96:log_dist] [Rank 0] step=5440, skipped=104, lr=[6.845033233618091e-06, 6.845033233618091e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:10:39,773] [INFO] [timer.py:215:stop] epoch=5/micro_step=840/global_step=5440, RunningAvgSamplesPerSec=85.1044249680657, CurrSamplesPerSec=85.3200704210803, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:10:47,273] [INFO] [logging.py:96:log_dist] [Rank 0] step=5450, skipped=104, lr=[6.83567687259122e-06, 6.83567687259122e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:10:47,306] [INFO] [timer.py:215:stop] epoch=5/micro_step=850/global_step=5450, RunningAvgSamplesPerSec=85.10428895566609, CurrSamplesPerSec=85.26516409311479, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:10:54,808] [INFO] [logging.py:96:log_dist] [Rank 0] step=5460, skipped=104, lr=[6.826311353024422e-06, 6.826311353024422e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:10:54,842] [INFO] [timer.py:215:stop] epoch=5/micro_step=860/global_step=5460, RunningAvgSamplesPerSec=85.10408634501715, CurrSamplesPerSec=85.17850408986858, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:11:02,340] [INFO] [logging.py:96:log_dist] [Rank 0] step=5470, skipped=104, lr=[6.816936717577205e-06, 6.816936717577205e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:11:02,374] [INFO] [timer.py:215:stop] epoch=5/micro_step=870/global_step=5470, RunningAvgSamplesPerSec=85.10397214517911, CurrSamplesPerSec=84.56140854902465, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:11:09,881] [INFO] [logging.py:96:log_dist] [Rank 0] step=5480, skipped=104, lr=[6.807553008950597e-06, 6.807553008950597e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:11:09,915] [INFO] [timer.py:215:stop] epoch=5/micro_step=880/global_step=5480, RunningAvgSamplesPerSec=85.10366382670705, CurrSamplesPerSec=84.95698940045189, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:11:17,407] [INFO] [logging.py:96:log_dist] [Rank 0] step=5490, skipped=104, lr=[6.7981602698869555e-06, 6.7981602698869555e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:11:17,441] [INFO] [timer.py:215:stop] epoch=5/micro_step=890/global_step=5490, RunningAvgSamplesPerSec=85.10366576302806, CurrSamplesPerSec=84.77076893732213, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:11:24,939] [INFO] [logging.py:96:log_dist] [Rank 0] step=5500, skipped=104, lr=[6.788758543169772e-06, 6.788758543169772e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:11:24,972] [INFO] [timer.py:215:stop] epoch=5/micro_step=900/global_step=5500, RunningAvgSamplesPerSec=85.103553246857, CurrSamplesPerSec=85.17401761384333, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:11:30,180] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:11:30,877] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:11:32,349] [INFO] [logging.py:96:log_dist] [Rank 0] step=5510, skipped=106, lr=[6.781230719462121e-06, 6.781230719462121e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:11:32,382] [INFO] [timer.py:215:stop] epoch=5/micro_step=910/global_step=5510, RunningAvgSamplesPerSec=85.10594064734768, CurrSamplesPerSec=85.18896536053435, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:11:39,871] [INFO] [logging.py:96:log_dist] [Rank 0] step=5520, skipped=106, lr=[6.77181292291351e-06, 6.77181292291351e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:11:39,904] [INFO] [timer.py:215:stop] epoch=5/micro_step=920/global_step=5520, RunningAvgSamplesPerSec=85.10602147230867, CurrSamplesPerSec=84.83675537847397, MemAllocated=5.0GB, MaxMemAllocated=23.99GB ***** Evaluating perplexity, Epoch 6/16 ***** ppl: 1.8237462043762207 Beginning of Epoch 7/16, Total Micro Batches 920 [2023-06-29 18:12:05,325] [INFO] [logging.py:96:log_dist] [Rank 0] step=5530, skipped=106, lr=[6.762386258722304e-06, 6.762386258722304e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:12:05,359] [INFO] [timer.py:215:stop] epoch=6/micro_step=10/global_step=5530, RunningAvgSamplesPerSec=85.10542377838861, CurrSamplesPerSec=84.56060941119694, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:12:12,867] [INFO] [logging.py:96:log_dist] [Rank 0] step=5540, skipped=106, lr=[6.752950769826516e-06, 6.752950769826516e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:12:12,901] [INFO] [timer.py:215:stop] epoch=6/micro_step=20/global_step=5540, RunningAvgSamplesPerSec=85.10509558879176, CurrSamplesPerSec=84.88813133533151, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:12:20,421] [INFO] [logging.py:96:log_dist] [Rank 0] step=5550, skipped=106, lr=[6.743506499204363e-06, 6.743506499204363e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:12:20,455] [INFO] [timer.py:215:stop] epoch=6/micro_step=30/global_step=5550, RunningAvgSamplesPerSec=85.10451942871903, CurrSamplesPerSec=84.67800457591986, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:12:27,971] [INFO] [logging.py:96:log_dist] [Rank 0] step=5560, skipped=106, lr=[6.73405348987406e-06, 6.73405348987406e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:12:28,004] [INFO] [timer.py:215:stop] epoch=6/micro_step=40/global_step=5560, RunningAvgSamplesPerSec=85.10405290037657, CurrSamplesPerSec=84.74301728005162, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:12:35,524] [INFO] [logging.py:96:log_dist] [Rank 0] step=5570, skipped=106, lr=[6.724591784893625e-06, 6.724591784893625e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:12:35,558] [INFO] [timer.py:215:stop] epoch=6/micro_step=50/global_step=5570, RunningAvgSamplesPerSec=85.10350442099389, CurrSamplesPerSec=85.11600629090356, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:12:43,072] [INFO] [logging.py:96:log_dist] [Rank 0] step=5580, skipped=106, lr=[6.715121427360688e-06, 6.715121427360688e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:12:43,106] [INFO] [timer.py:215:stop] epoch=6/micro_step=60/global_step=5580, RunningAvgSamplesPerSec=85.10306227722229, CurrSamplesPerSec=84.88096447960517, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:12:50,623] [INFO] [logging.py:96:log_dist] [Rank 0] step=5590, skipped=106, lr=[6.7056424604122874e-06, 6.7056424604122874e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:12:50,656] [INFO] [timer.py:215:stop] epoch=6/micro_step=70/global_step=5590, RunningAvgSamplesPerSec=85.10255991413362, CurrSamplesPerSec=85.02738666861785, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:12:58,167] [INFO] [logging.py:96:log_dist] [Rank 0] step=5600, skipped=106, lr=[6.696154927224676e-06, 6.696154927224676e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:12:58,201] [INFO] [timer.py:215:stop] epoch=6/micro_step=80/global_step=5600, RunningAvgSamplesPerSec=85.10219134040622, CurrSamplesPerSec=84.80108369304799, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:13:04,955] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:13:05,654] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:13:05,655] [INFO] [logging.py:96:log_dist] [Rank 0] step=5610, skipped=108, lr=[6.688558762021714e-06, 6.688558762021714e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:13:05,656] [INFO] [timer.py:215:stop] epoch=6/micro_step=90/global_step=5610, RunningAvgSamplesPerSec=85.10363932881879, CurrSamplesPerSec=91.56472658543133, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:13:13,176] [INFO] [logging.py:96:log_dist] [Rank 0] step=5620, skipped=108, lr=[6.679055918532112e-06, 6.679055918532112e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:13:13,210] [INFO] [timer.py:215:stop] epoch=6/micro_step=100/global_step=5620, RunningAvgSamplesPerSec=85.10307050986572, CurrSamplesPerSec=84.80448608287564, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:13:20,727] [INFO] [logging.py:96:log_dist] [Rank 0] step=5630, skipped=108, lr=[6.669544629903765e-06, 6.669544629903765e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:13:20,760] [INFO] [timer.py:215:stop] epoch=6/micro_step=110/global_step=5630, RunningAvgSamplesPerSec=85.10259182260293, CurrSamplesPerSec=84.94051025923052, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:13:28,270] [INFO] [logging.py:96:log_dist] [Rank 0] step=5640, skipped=108, lr=[6.660024939460153e-06, 6.660024939460153e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:13:28,303] [INFO] [timer.py:215:stop] epoch=6/micro_step=120/global_step=5640, RunningAvgSamplesPerSec=85.10226038612669, CurrSamplesPerSec=84.91490366563131, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:13:35,794] [INFO] [logging.py:96:log_dist] [Rank 0] step=5650, skipped=108, lr=[6.650496890563025e-06, 6.650496890563025e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:13:35,827] [INFO] [timer.py:215:stop] epoch=6/micro_step=130/global_step=5650, RunningAvgSamplesPerSec=85.10230093271932, CurrSamplesPerSec=85.21676643817156, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:13:43,328] [INFO] [logging.py:96:log_dist] [Rank 0] step=5660, skipped=108, lr=[6.640960526612202e-06, 6.640960526612202e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:13:43,361] [INFO] [timer.py:215:stop] epoch=6/micro_step=140/global_step=5660, RunningAvgSamplesPerSec=85.10215650527583, CurrSamplesPerSec=85.22637122040489, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:13:50,850] [INFO] [logging.py:96:log_dist] [Rank 0] step=5670, skipped=108, lr=[6.631415891045378e-06, 6.631415891045378e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:13:50,884] [INFO] [timer.py:215:stop] epoch=6/micro_step=150/global_step=5670, RunningAvgSamplesPerSec=85.10222531909979, CurrSamplesPerSec=84.95158525809971, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:13:58,397] [INFO] [logging.py:96:log_dist] [Rank 0] step=5680, skipped=108, lr=[6.621863027337929e-06, 6.621863027337929e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:13:58,430] [INFO] [timer.py:215:stop] epoch=6/micro_step=160/global_step=5680, RunningAvgSamplesPerSec=85.10182074940538, CurrSamplesPerSec=84.87559684104654, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:14:05,927] [INFO] [logging.py:96:log_dist] [Rank 0] step=5690, skipped=108, lr=[6.612301979002704e-06, 6.612301979002704e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:14:05,961] [INFO] [timer.py:215:stop] epoch=6/micro_step=170/global_step=5690, RunningAvgSamplesPerSec=85.10174047892032, CurrSamplesPerSec=85.06988055662073, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:14:13,461] [INFO] [logging.py:96:log_dist] [Rank 0] step=5700, skipped=108, lr=[6.602732789589832e-06, 6.602732789589832e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:14:13,495] [INFO] [timer.py:215:stop] epoch=6/micro_step=180/global_step=5700, RunningAvgSamplesPerSec=85.10158474569609, CurrSamplesPerSec=84.90402620923231, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:14:20,993] [INFO] [logging.py:96:log_dist] [Rank 0] step=5710, skipped=108, lr=[6.593155502686531e-06, 6.593155502686531e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:14:21,026] [INFO] [timer.py:215:stop] epoch=6/micro_step=190/global_step=5710, RunningAvgSamplesPerSec=85.10147702336022, CurrSamplesPerSec=85.02595926641538, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:14:21,723] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:14:22,420] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:14:28,421] [INFO] [logging.py:96:log_dist] [Rank 0] step=5720, skipped=110, lr=[6.58548787228494e-06, 6.58548787228494e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:14:28,455] [INFO] [timer.py:215:stop] epoch=6/micro_step=200/global_step=5720, RunningAvgSamplesPerSec=85.1034067372588, CurrSamplesPerSec=85.06704990374179, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:14:35,955] [INFO] [logging.py:96:log_dist] [Rank 0] step=5730, skipped=110, lr=[6.57589611985625e-06, 6.57589611985625e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:14:35,988] [INFO] [timer.py:215:stop] epoch=6/micro_step=210/global_step=5730, RunningAvgSamplesPerSec=85.10327736899805, CurrSamplesPerSec=85.00630369589243, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:14:43,478] [INFO] [logging.py:96:log_dist] [Rank 0] step=5740, skipped=110, lr=[6.566296392176917e-06, 6.566296392176917e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:14:43,511] [INFO] [timer.py:215:stop] epoch=6/micro_step=220/global_step=5740, RunningAvgSamplesPerSec=85.1033384967961, CurrSamplesPerSec=84.93005617488724, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:14:51,004] [INFO] [logging.py:96:log_dist] [Rank 0] step=5750, skipped=110, lr=[6.556688732973254e-06, 6.556688732973254e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:14:51,038] [INFO] [timer.py:215:stop] epoch=6/micro_step=230/global_step=5750, RunningAvgSamplesPerSec=85.10333889311327, CurrSamplesPerSec=84.81898278692772, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:14:58,540] [INFO] [logging.py:96:log_dist] [Rank 0] step=5760, skipped=110, lr=[6.547073186007704e-06, 6.547073186007704e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:14:58,574] [INFO] [timer.py:215:stop] epoch=6/micro_step=240/global_step=5760, RunningAvgSamplesPerSec=85.10314517547758, CurrSamplesPerSec=84.88966149500548, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:15:06,081] [INFO] [logging.py:96:log_dist] [Rank 0] step=5770, skipped=110, lr=[6.5374497950786375e-06, 6.5374497950786375e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:15:06,115] [INFO] [timer.py:215:stop] epoch=6/micro_step=250/global_step=5770, RunningAvgSamplesPerSec=85.10285430764635, CurrSamplesPerSec=85.22623592674941, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:15:13,608] [INFO] [logging.py:96:log_dist] [Rank 0] step=5780, skipped=110, lr=[6.527818604020154e-06, 6.527818604020154e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:15:13,641] [INFO] [timer.py:215:stop] epoch=6/micro_step=260/global_step=5780, RunningAvgSamplesPerSec=85.10285170622056, CurrSamplesPerSec=85.26543492810094, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:15:21,145] [INFO] [logging.py:96:log_dist] [Rank 0] step=5790, skipped=110, lr=[6.518179656701883e-06, 6.518179656701883e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:15:21,179] [INFO] [timer.py:215:stop] epoch=6/micro_step=270/global_step=5790, RunningAvgSamplesPerSec=85.10263231091157, CurrSamplesPerSec=84.74898355695396, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:15:28,667] [INFO] [logging.py:96:log_dist] [Rank 0] step=5800, skipped=110, lr=[6.50853299702878e-06, 6.50853299702878e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:15:28,700] [INFO] [timer.py:215:stop] epoch=6/micro_step=280/global_step=5800, RunningAvgSamplesPerSec=85.1027285156583, CurrSamplesPerSec=84.82420924670315, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:15:36,194] [INFO] [logging.py:96:log_dist] [Rank 0] step=5810, skipped=110, lr=[6.498878668940935e-06, 6.498878668940935e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:15:36,228] [INFO] [timer.py:215:stop] epoch=6/micro_step=290/global_step=5810, RunningAvgSamplesPerSec=85.10270652686128, CurrSamplesPerSec=85.18912757100563, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:15:38,434] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:15:39,128] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:15:43,613] [INFO] [logging.py:96:log_dist] [Rank 0] step=5820, skipped=112, lr=[6.4911497147620875e-06, 6.4911497147620875e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:15:43,647] [INFO] [timer.py:215:stop] epoch=6/micro_step=300/global_step=5820, RunningAvgSamplesPerSec=85.10478533953645, CurrSamplesPerSec=85.08589751229525, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:15:51,155] [INFO] [logging.py:96:log_dist] [Rank 0] step=5830, skipped=112, lr=[6.481481694368093e-06, 6.481481694368093e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:15:51,185] [INFO] [timer.py:215:stop] epoch=6/micro_step=310/global_step=5830, RunningAvgSamplesPerSec=85.10454773334519, CurrSamplesPerSec=84.78763754850797, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:15:58,687] [INFO] [logging.py:96:log_dist] [Rank 0] step=5840, skipped=112, lr=[6.471806128776786e-06, 6.471806128776786e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:15:58,721] [INFO] [timer.py:215:stop] epoch=6/micro_step=320/global_step=5840, RunningAvgSamplesPerSec=85.10437612658873, CurrSamplesPerSec=84.97390531409285, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:16:06,229] [INFO] [logging.py:96:log_dist] [Rank 0] step=5850, skipped=112, lr=[6.462123062059916e-06, 6.462123062059916e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:16:06,263] [INFO] [timer.py:215:stop] epoch=6/micro_step=330/global_step=5850, RunningAvgSamplesPerSec=85.10406504318105, CurrSamplesPerSec=85.06888306696833, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:16:13,771] [INFO] [logging.py:96:log_dist] [Rank 0] step=5860, skipped=112, lr=[6.452432538323406e-06, 6.452432538323406e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:16:13,804] [INFO] [timer.py:215:stop] epoch=6/micro_step=340/global_step=5860, RunningAvgSamplesPerSec=85.10379233598711, CurrSamplesPerSec=85.12008178566208, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:16:21,306] [INFO] [logging.py:96:log_dist] [Rank 0] step=5870, skipped=112, lr=[6.442734601707142e-06, 6.442734601707142e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:16:21,339] [INFO] [timer.py:215:stop] epoch=6/micro_step=350/global_step=5870, RunningAvgSamplesPerSec=85.10361430788817, CurrSamplesPerSec=84.86510506834469, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:16:28,853] [INFO] [logging.py:96:log_dist] [Rank 0] step=5880, skipped=112, lr=[6.433029296384776e-06, 6.433029296384776e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:16:28,887] [INFO] [timer.py:215:stop] epoch=6/micro_step=360/global_step=5880, RunningAvgSamplesPerSec=85.10321182374324, CurrSamplesPerSec=84.51838315397065, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:16:36,380] [INFO] [logging.py:96:log_dist] [Rank 0] step=5890, skipped=112, lr=[6.423316666563523e-06, 6.423316666563523e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:16:36,413] [INFO] [timer.py:215:stop] epoch=6/micro_step=370/global_step=5890, RunningAvgSamplesPerSec=85.10320500701815, CurrSamplesPerSec=85.20205232378217, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:16:43,905] [INFO] [logging.py:96:log_dist] [Rank 0] step=5900, skipped=112, lr=[6.41359675648396e-06, 6.41359675648396e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:16:43,938] [INFO] [timer.py:215:stop] epoch=6/micro_step=380/global_step=5900, RunningAvgSamplesPerSec=85.10322405040063, CurrSamplesPerSec=84.98245996157932, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:16:51,428] [INFO] [logging.py:96:log_dist] [Rank 0] step=5910, skipped=112, lr=[6.403869610419829e-06, 6.403869610419829e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:16:51,462] [INFO] [timer.py:215:stop] epoch=6/micro_step=390/global_step=5910, RunningAvgSamplesPerSec=85.10326617386653, CurrSamplesPerSec=85.01351865774924, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:16:55,174] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:16:55,871] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:16:58,850] [INFO] [logging.py:96:log_dist] [Rank 0] step=5920, skipped=114, lr=[6.396082713432634e-06, 6.396082713432634e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:16:58,883] [INFO] [timer.py:215:stop] epoch=6/micro_step=400/global_step=5920, RunningAvgSamplesPerSec=85.10527347766588, CurrSamplesPerSec=84.71509664103563, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:17:06,372] [INFO] [logging.py:96:log_dist] [Rank 0] step=5930, skipped=114, lr=[6.386342654271181e-06, 6.386342654271181e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:17:06,405] [INFO] [timer.py:215:stop] epoch=6/micro_step=410/global_step=5930, RunningAvgSamplesPerSec=85.10535662323053, CurrSamplesPerSec=85.40343210274038, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:17:13,914] [INFO] [logging.py:96:log_dist] [Rank 0] step=5940, skipped=114, lr=[6.376595483266332e-06, 6.376595483266332e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:17:13,948] [INFO] [timer.py:215:stop] epoch=6/micro_step=420/global_step=5940, RunningAvgSamplesPerSec=85.10504600191048, CurrSamplesPerSec=84.79511007851356, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:17:21,442] [INFO] [logging.py:96:log_dist] [Rank 0] step=5950, skipped=114, lr=[6.366841244815997e-06, 6.366841244815997e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:17:21,476] [INFO] [timer.py:215:stop] epoch=6/micro_step=430/global_step=5950, RunningAvgSamplesPerSec=85.10500657069785, CurrSamplesPerSec=85.15904809581728, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:17:28,979] [INFO] [logging.py:96:log_dist] [Rank 0] step=5960, skipped=114, lr=[6.35707998335028e-06, 6.35707998335028e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:17:29,013] [INFO] [timer.py:215:stop] epoch=6/micro_step=440/global_step=5960, RunningAvgSamplesPerSec=85.10479175508324, CurrSamplesPerSec=85.11525061291374, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:17:36,521] [INFO] [logging.py:96:log_dist] [Rank 0] step=5970, skipped=114, lr=[6.347311743331277e-06, 6.347311743331277e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:17:36,554] [INFO] [timer.py:215:stop] epoch=6/micro_step=450/global_step=5970, RunningAvgSamplesPerSec=85.10450156257227, CurrSamplesPerSec=85.03950805250084, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:17:44,059] [INFO] [logging.py:96:log_dist] [Rank 0] step=5980, skipped=114, lr=[6.337536569252866e-06, 6.337536569252866e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:17:44,093] [INFO] [timer.py:215:stop] epoch=6/micro_step=460/global_step=5980, RunningAvgSamplesPerSec=85.10427371199584, CurrSamplesPerSec=85.10777552143028, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:17:51,590] [INFO] [logging.py:96:log_dist] [Rank 0] step=5990, skipped=114, lr=[6.327754505640514e-06, 6.327754505640514e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:17:51,624] [INFO] [timer.py:215:stop] epoch=6/micro_step=470/global_step=5990, RunningAvgSamplesPerSec=85.1041700081011, CurrSamplesPerSec=84.75944663559862, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:17:59,124] [INFO] [logging.py:96:log_dist] [Rank 0] step=6000, skipped=114, lr=[6.317965597051064e-06, 6.317965597051064e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:17:59,158] [INFO] [timer.py:215:stop] epoch=6/micro_step=480/global_step=6000, RunningAvgSamplesPerSec=85.10401628663499, CurrSamplesPerSec=85.0154033644424, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:18:06,670] [INFO] [logging.py:96:log_dist] [Rank 0] step=6010, skipped=114, lr=[6.308169888072543e-06, 6.308169888072543e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:18:06,703] [INFO] [timer.py:215:stop] epoch=6/micro_step=490/global_step=6010, RunningAvgSamplesPerSec=85.10365949862666, CurrSamplesPerSec=84.587388753443, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:18:11,921] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:18:12,618] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:18:14,098] [INFO] [logging.py:96:log_dist] [Rank 0] step=6020, skipped=116, lr=[6.3003284545925255e-06, 6.3003284545925255e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:18:14,132] [INFO] [timer.py:215:stop] epoch=6/micro_step=500/global_step=6020, RunningAvgSamplesPerSec=85.10548856591969, CurrSamplesPerSec=84.28257684745675, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:18:21,627] [INFO] [logging.py:96:log_dist] [Rank 0] step=6030, skipped=116, lr=[6.290520617374243e-06, 6.290520617374243e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:18:21,661] [INFO] [timer.py:215:stop] epoch=6/micro_step=510/global_step=6030, RunningAvgSamplesPerSec=85.10542662788366, CurrSamplesPerSec=84.98827164793414, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:18:29,167] [INFO] [logging.py:96:log_dist] [Rank 0] step=6040, skipped=116, lr=[6.280706104777497e-06, 6.280706104777497e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:18:29,200] [INFO] [timer.py:215:stop] epoch=6/micro_step=520/global_step=6040, RunningAvgSamplesPerSec=85.10518202301186, CurrSamplesPerSec=84.45629193350732, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:18:36,708] [INFO] [logging.py:96:log_dist] [Rank 0] step=6050, skipped=116, lr=[6.2708849615069386e-06, 6.2708849615069386e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:18:36,741] [INFO] [timer.py:215:stop] epoch=6/micro_step=530/global_step=6050, RunningAvgSamplesPerSec=85.10488981481537, CurrSamplesPerSec=84.93940829353511, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:18:44,246] [INFO] [logging.py:96:log_dist] [Rank 0] step=6060, skipped=116, lr=[6.261057232297421e-06, 6.261057232297421e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:18:44,280] [INFO] [timer.py:215:stop] epoch=6/micro_step=540/global_step=6060, RunningAvgSamplesPerSec=85.10465703922435, CurrSamplesPerSec=85.30738094216612, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:18:51,782] [INFO] [logging.py:96:log_dist] [Rank 0] step=6070, skipped=116, lr=[6.251222961913795e-06, 6.251222961913795e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:18:51,816] [INFO] [timer.py:215:stop] epoch=6/micro_step=550/global_step=6070, RunningAvgSamplesPerSec=85.10446795426301, CurrSamplesPerSec=85.09185822221447, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:18:59,305] [INFO] [logging.py:96:log_dist] [Rank 0] step=6080, skipped=116, lr=[6.241382195150706e-06, 6.241382195150706e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:18:59,339] [INFO] [timer.py:215:stop] epoch=6/micro_step=560/global_step=6080, RunningAvgSamplesPerSec=85.10452111285122, CurrSamplesPerSec=84.76075804757721, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:19:06,840] [INFO] [logging.py:96:log_dist] [Rank 0] step=6090, skipped=116, lr=[6.23153497683239e-06, 6.23153497683239e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:19:06,874] [INFO] [timer.py:215:stop] epoch=6/micro_step=570/global_step=6090, RunningAvgSamplesPerSec=85.10435121208954, CurrSamplesPerSec=85.00829576446058, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:19:14,372] [INFO] [logging.py:96:log_dist] [Rank 0] step=6100, skipped=116, lr=[6.22168135181247e-06, 6.22168135181247e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:19:14,406] [INFO] [timer.py:215:stop] epoch=6/micro_step=580/global_step=6100, RunningAvgSamplesPerSec=85.10423658836224, CurrSamplesPerSec=85.04748311393622, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:19:21,916] [INFO] [logging.py:96:log_dist] [Rank 0] step=6110, skipped=116, lr=[6.21182136497375e-06, 6.21182136497375e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:19:21,949] [INFO] [timer.py:215:stop] epoch=6/micro_step=590/global_step=6110, RunningAvgSamplesPerSec=85.10391787907022, CurrSamplesPerSec=85.03910395069144, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:19:28,685] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:19:29,381] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:19:29,382] [INFO] [logging.py:96:log_dist] [Rank 0] step=6120, skipped=118, lr=[6.2039288251729886e-06, 6.2039288251729886e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:19:29,383] [INFO] [timer.py:215:stop] epoch=6/micro_step=600/global_step=6120, RunningAvgSamplesPerSec=85.10563526786422, CurrSamplesPerSec=91.9231194120163, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:19:36,881] [INFO] [logging.py:96:log_dist] [Rank 0] step=6130, skipped=118, lr=[6.194057500257468e-06, 6.194057500257468e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:19:36,915] [INFO] [timer.py:215:stop] epoch=6/micro_step=610/global_step=6130, RunningAvgSamplesPerSec=85.10551254142906, CurrSamplesPerSec=85.12626324768978, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:19:44,425] [INFO] [logging.py:96:log_dist] [Rank 0] step=6140, skipped=118, lr=[6.18417993934851e-06, 6.18417993934851e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:19:44,458] [INFO] [timer.py:215:stop] epoch=6/micro_step=620/global_step=6140, RunningAvgSamplesPerSec=85.10519178497096, CurrSamplesPerSec=85.04982741981374, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:19:51,961] [INFO] [logging.py:96:log_dist] [Rank 0] step=6150, skipped=118, lr=[6.1742961874379475e-06, 6.1742961874379475e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:19:51,995] [INFO] [timer.py:215:stop] epoch=6/micro_step=630/global_step=6150, RunningAvgSamplesPerSec=85.10498959053827, CurrSamplesPerSec=84.91981957870907, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:19:59,486] [INFO] [logging.py:96:log_dist] [Rank 0] step=6160, skipped=118, lr=[6.1644062895458145e-06, 6.1644062895458145e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:19:59,520] [INFO] [timer.py:215:stop] epoch=6/micro_step=640/global_step=6160, RunningAvgSamplesPerSec=85.1050057738545, CurrSamplesPerSec=85.10866598690184, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:20:07,015] [INFO] [logging.py:96:log_dist] [Rank 0] step=6170, skipped=118, lr=[6.154510290720134e-06, 6.154510290720134e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:20:07,049] [INFO] [timer.py:215:stop] epoch=6/micro_step=650/global_step=6170, RunningAvgSamplesPerSec=85.10493488332091, CurrSamplesPerSec=84.86368310952071, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:20:14,552] [INFO] [logging.py:96:log_dist] [Rank 0] step=6180, skipped=118, lr=[6.144608236036723e-06, 6.144608236036723e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:20:14,586] [INFO] [timer.py:215:stop] epoch=6/micro_step=660/global_step=6180, RunningAvgSamplesPerSec=85.10473797036374, CurrSamplesPerSec=84.77349959039975, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:20:22,085] [INFO] [logging.py:96:log_dist] [Rank 0] step=6190, skipped=118, lr=[6.134700170598984e-06, 6.134700170598984e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:20:22,119] [INFO] [timer.py:215:stop] epoch=6/micro_step=670/global_step=6190, RunningAvgSamplesPerSec=85.10459986976598, CurrSamplesPerSec=85.01755741727719, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:20:29,614] [INFO] [logging.py:96:log_dist] [Rank 0] step=6200, skipped=118, lr=[6.124786139537692e-06, 6.124786139537692e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:20:29,648] [INFO] [timer.py:215:stop] epoch=6/micro_step=680/global_step=6200, RunningAvgSamplesPerSec=85.10454661170833, CurrSamplesPerSec=85.22826537668521, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:20:37,145] [INFO] [logging.py:96:log_dist] [Rank 0] step=6210, skipped=118, lr=[6.114866188010802e-06, 6.114866188010802e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:20:37,178] [INFO] [timer.py:215:stop] epoch=6/micro_step=690/global_step=6210, RunningAvgSamplesPerSec=85.10447269813362, CurrSamplesPerSec=85.37154488092816, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:20:44,687] [INFO] [logging.py:96:log_dist] [Rank 0] step=6220, skipped=118, lr=[6.104940361203231e-06, 6.104940361203231e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:20:44,721] [INFO] [timer.py:215:stop] epoch=6/micro_step=700/global_step=6220, RunningAvgSamplesPerSec=85.10416517288198, CurrSamplesPerSec=84.73924531422932, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:20:45,416] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:20:46,115] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:20:52,120] [INFO] [logging.py:96:log_dist] [Rank 0] step=6230, skipped=120, lr=[6.096995499936438e-06, 6.096995499936438e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:20:52,154] [INFO] [timer.py:215:stop] epoch=6/micro_step=710/global_step=6230, RunningAvgSamplesPerSec=85.10585828328651, CurrSamplesPerSec=84.79095851035267, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:20:59,654] [INFO] [logging.py:96:log_dist] [Rank 0] step=6240, skipped=120, lr=[6.0870592115749305e-06, 6.0870592115749305e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:20:59,687] [INFO] [timer.py:215:stop] epoch=6/micro_step=720/global_step=6240, RunningAvgSamplesPerSec=85.10573084005071, CurrSamplesPerSec=85.0006241210928, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:21:07,178] [INFO] [logging.py:96:log_dist] [Rank 0] step=6250, skipped=120, lr=[6.077117174592231e-06, 6.077117174592231e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:21:07,212] [INFO] [timer.py:215:stop] epoch=6/micro_step=730/global_step=6250, RunningAvgSamplesPerSec=85.1057447447515, CurrSamplesPerSec=84.97116173536055, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:21:14,707] [INFO] [logging.py:96:log_dist] [Rank 0] step=6260, skipped=120, lr=[6.067169434273856e-06, 6.067169434273856e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:21:14,741] [INFO] [timer.py:215:stop] epoch=6/micro_step=740/global_step=6260, RunningAvgSamplesPerSec=85.10569157632146, CurrSamplesPerSec=85.126695173789, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:21:22,252] [INFO] [logging.py:96:log_dist] [Rank 0] step=6270, skipped=120, lr=[6.057216035931302e-06, 6.057216035931302e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:21:22,286] [INFO] [timer.py:215:stop] epoch=6/micro_step=750/global_step=6270, RunningAvgSamplesPerSec=85.10534854184826, CurrSamplesPerSec=84.48845629071018, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:21:29,781] [INFO] [logging.py:96:log_dist] [Rank 0] step=6280, skipped=120, lr=[6.047257024901837e-06, 6.047257024901837e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:21:29,815] [INFO] [timer.py:215:stop] epoch=6/micro_step=760/global_step=6280, RunningAvgSamplesPerSec=85.10528859296585, CurrSamplesPerSec=85.18555908332897, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:21:37,318] [INFO] [logging.py:96:log_dist] [Rank 0] step=6290, skipped=120, lr=[6.037292446548297e-06, 6.037292446548297e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:21:37,353] [INFO] [timer.py:215:stop] epoch=6/micro_step=770/global_step=6290, RunningAvgSamplesPerSec=85.1050792760557, CurrSamplesPerSec=84.81399814849335, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:21:44,855] [INFO] [logging.py:96:log_dist] [Rank 0] step=6300, skipped=120, lr=[6.0273223462588705e-06, 6.0273223462588705e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:21:44,889] [INFO] [timer.py:215:stop] epoch=6/micro_step=780/global_step=6300, RunningAvgSamplesPerSec=85.10490182218815, CurrSamplesPerSec=85.14162631378792, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:21:52,405] [INFO] [logging.py:96:log_dist] [Rank 0] step=6310, skipped=120, lr=[6.0173467694469044e-06, 6.0173467694469044e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:21:52,439] [INFO] [timer.py:215:stop] epoch=6/micro_step=790/global_step=6310, RunningAvgSamplesPerSec=85.10447316275908, CurrSamplesPerSec=84.88576909187387, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:21:59,937] [INFO] [logging.py:96:log_dist] [Rank 0] step=6320, skipped=120, lr=[6.007365761550688e-06, 6.007365761550688e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:21:59,971] [INFO] [timer.py:215:stop] epoch=6/micro_step=800/global_step=6320, RunningAvgSamplesPerSec=85.1043696482061, CurrSamplesPerSec=84.58456346549238, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:22:02,174] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:22:02,873] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:22:07,366] [INFO] [logging.py:96:log_dist] [Rank 0] step=6330, skipped=122, lr=[5.999377075403383e-06, 5.999377075403383e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:22:07,399] [INFO] [timer.py:215:stop] epoch=6/micro_step=810/global_step=6330, RunningAvgSamplesPerSec=85.1061148989863, CurrSamplesPerSec=84.6646775408349, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:22:14,897] [INFO] [logging.py:96:log_dist] [Rank 0] step=6340, skipped=122, lr=[5.989386406138838e-06, 5.989386406138838e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:22:14,931] [INFO] [timer.py:215:stop] epoch=6/micro_step=820/global_step=6340, RunningAvgSamplesPerSec=85.10601349319404, CurrSamplesPerSec=85.19093896323245, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:22:22,434] [INFO] [logging.py:96:log_dist] [Rank 0] step=6350, skipped=122, lr=[5.979390433148203e-06, 5.979390433148203e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:22:22,467] [INFO] [timer.py:215:stop] epoch=6/micro_step=830/global_step=6350, RunningAvgSamplesPerSec=85.10583385323449, CurrSamplesPerSec=84.91431271986944, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:22:29,971] [INFO] [logging.py:96:log_dist] [Rank 0] step=6360, skipped=122, lr=[5.969389201962667e-06, 5.969389201962667e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:22:30,005] [INFO] [timer.py:215:stop] epoch=6/micro_step=840/global_step=6360, RunningAvgSamplesPerSec=85.10561665269702, CurrSamplesPerSec=85.07427518146412, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:22:37,513] [INFO] [logging.py:96:log_dist] [Rank 0] step=6370, skipped=122, lr=[5.959382758137377e-06, 5.959382758137377e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:22:37,547] [INFO] [timer.py:215:stop] epoch=6/micro_step=850/global_step=6370, RunningAvgSamplesPerSec=85.10533243469199, CurrSamplesPerSec=84.98302495176014, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:22:45,054] [INFO] [logging.py:96:log_dist] [Rank 0] step=6380, skipped=122, lr=[5.949371147251223e-06, 5.949371147251223e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:22:45,088] [INFO] [timer.py:215:stop] epoch=6/micro_step=860/global_step=6380, RunningAvgSamplesPerSec=85.1050662389047, CurrSamplesPerSec=84.71290441992106, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:22:52,600] [INFO] [logging.py:96:log_dist] [Rank 0] step=6390, skipped=122, lr=[5.939354414906624e-06, 5.939354414906624e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:22:52,633] [INFO] [timer.py:215:stop] epoch=6/micro_step=870/global_step=6390, RunningAvgSamplesPerSec=85.10472660696702, CurrSamplesPerSec=84.78177293677297, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:23:00,157] [INFO] [logging.py:96:log_dist] [Rank 0] step=6400, skipped=122, lr=[5.9293326067293335e-06, 5.9293326067293335e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:23:00,190] [INFO] [timer.py:215:stop] epoch=6/micro_step=880/global_step=6400, RunningAvgSamplesPerSec=85.10417413308107, CurrSamplesPerSec=84.96424977835932, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:23:07,689] [INFO] [logging.py:96:log_dist] [Rank 0] step=6410, skipped=122, lr=[5.919305768368224e-06, 5.919305768368224e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:23:07,723] [INFO] [timer.py:215:stop] epoch=6/micro_step=890/global_step=6410, RunningAvgSamplesPerSec=85.10404960491547, CurrSamplesPerSec=84.94766029221427, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:23:15,222] [INFO] [logging.py:96:log_dist] [Rank 0] step=6420, skipped=122, lr=[5.909273945495077e-06, 5.909273945495077e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:23:15,255] [INFO] [timer.py:215:stop] epoch=6/micro_step=900/global_step=6420, RunningAvgSamplesPerSec=85.10393520997496, CurrSamplesPerSec=85.05853212294087, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:23:18,960] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:23:19,658] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:23:22,634] [INFO] [logging.py:96:log_dist] [Rank 0] step=6430, skipped=124, lr=[5.901244929053832e-06, 5.901244929053832e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:23:22,668] [INFO] [timer.py:215:stop] epoch=6/micro_step=910/global_step=6430, RunningAvgSamplesPerSec=85.10594874528421, CurrSamplesPerSec=85.14640646519703, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:23:30,175] [INFO] [logging.py:96:log_dist] [Rank 0] step=6440, skipped=124, lr=[5.8912042492242554e-06, 5.8912042492242554e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:23:30,209] [INFO] [timer.py:215:stop] epoch=6/micro_step=920/global_step=6440, RunningAvgSamplesPerSec=85.10570287880842, CurrSamplesPerSec=84.97557306617469, MemAllocated=5.0GB, MaxMemAllocated=23.99GB ***** Evaluating perplexity, Epoch 7/16 ***** ppl: 1.8075904846191406 Beginning of Epoch 8/16, Total Micro Batches 920 [2023-06-29 18:23:55,608] [INFO] [logging.py:96:log_dist] [Rank 0] step=6450, skipped=124, lr=[5.881158712883758e-06, 5.881158712883758e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:23:55,641] [INFO] [timer.py:215:stop] epoch=7/micro_step=10/global_step=6450, RunningAvgSamplesPerSec=85.1052690949166, CurrSamplesPerSec=84.73167564428128, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:24:03,158] [INFO] [logging.py:96:log_dist] [Rank 0] step=6460, skipped=124, lr=[5.8711083657892926e-06, 5.8711083657892926e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:24:03,191] [INFO] [timer.py:215:stop] epoch=7/micro_step=20/global_step=6460, RunningAvgSamplesPerSec=85.10484407063535, CurrSamplesPerSec=85.03775697239679, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:24:10,715] [INFO] [logging.py:96:log_dist] [Rank 0] step=6470, skipped=124, lr=[5.861053253719727e-06, 5.861053253719727e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:24:10,748] [INFO] [timer.py:215:stop] epoch=7/micro_step=30/global_step=6470, RunningAvgSamplesPerSec=85.10430928155148, CurrSamplesPerSec=84.565777436143, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:24:18,271] [INFO] [logging.py:96:log_dist] [Rank 0] step=6480, skipped=124, lr=[5.850993422475626e-06, 5.850993422475626e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:24:18,305] [INFO] [timer.py:215:stop] epoch=7/micro_step=40/global_step=6480, RunningAvgSamplesPerSec=85.1037719852334, CurrSamplesPerSec=84.76231038359036, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:24:25,820] [INFO] [logging.py:96:log_dist] [Rank 0] step=6490, skipped=124, lr=[5.840928917879057e-06, 5.840928917879057e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:24:25,854] [INFO] [timer.py:215:stop] epoch=7/micro_step=50/global_step=6490, RunningAvgSamplesPerSec=85.10337508539988, CurrSamplesPerSec=85.10299970642708, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:24:33,352] [INFO] [logging.py:96:log_dist] [Rank 0] step=6500, skipped=124, lr=[5.830859785773373e-06, 5.830859785773373e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:24:33,384] [INFO] [timer.py:215:stop] epoch=7/micro_step=60/global_step=6500, RunningAvgSamplesPerSec=85.103307795027, CurrSamplesPerSec=84.77722105772459, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:24:40,879] [INFO] [logging.py:96:log_dist] [Rank 0] step=6510, skipped=124, lr=[5.8207860720230026e-06, 5.8207860720230026e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:24:40,913] [INFO] [timer.py:215:stop] epoch=7/micro_step=70/global_step=6510, RunningAvgSamplesPerSec=85.10326316088106, CurrSamplesPerSec=85.20091651889138, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:24:48,409] [INFO] [logging.py:96:log_dist] [Rank 0] step=6520, skipped=124, lr=[5.810707822513246e-06, 5.810707822513246e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:24:48,443] [INFO] [timer.py:215:stop] epoch=7/micro_step=80/global_step=6520, RunningAvgSamplesPerSec=85.10319921682623, CurrSamplesPerSec=84.95763471713066, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:24:53,664] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:24:54,358] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:24:55,831] [INFO] [logging.py:96:log_dist] [Rank 0] step=6530, skipped=126, lr=[5.802641988006797e-06, 5.802641988006797e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:24:55,864] [INFO] [timer.py:215:stop] epoch=7/micro_step=90/global_step=6530, RunningAvgSamplesPerSec=85.10503161988548, CurrSamplesPerSec=85.00328885077687, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:25:03,382] [INFO] [logging.py:96:log_dist] [Rank 0] step=6540, skipped=126, lr=[5.792555689826908e-06, 5.792555689826908e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:25:03,416] [INFO] [timer.py:215:stop] epoch=7/micro_step=100/global_step=6540, RunningAvgSamplesPerSec=85.1045888036932, CurrSamplesPerSec=84.59136046790113, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:25:10,917] [INFO] [logging.py:96:log_dist] [Rank 0] step=6550, skipped=126, lr=[5.782464984475714e-06, 5.782464984475714e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:25:10,951] [INFO] [timer.py:215:stop] epoch=7/micro_step=110/global_step=6550, RunningAvgSamplesPerSec=85.10443904192957, CurrSamplesPerSec=85.1726933829495, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:25:18,451] [INFO] [logging.py:96:log_dist] [Rank 0] step=6560, skipped=126, lr=[5.7723699179159095e-06, 5.7723699179159095e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:25:18,485] [INFO] [timer.py:215:stop] epoch=7/micro_step=120/global_step=6560, RunningAvgSamplesPerSec=85.1042932797976, CurrSamplesPerSec=84.98929415645969, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:25:25,989] [INFO] [logging.py:96:log_dist] [Rank 0] step=6570, skipped=126, lr=[5.762270536130056e-06, 5.762270536130056e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:25:26,022] [INFO] [timer.py:215:stop] epoch=7/micro_step=130/global_step=6570, RunningAvgSamplesPerSec=85.10410913750677, CurrSamplesPerSec=85.13390358748875, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:25:33,521] [INFO] [logging.py:96:log_dist] [Rank 0] step=6580, skipped=126, lr=[5.752166885120367e-06, 5.752166885120367e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:25:33,555] [INFO] [timer.py:215:stop] epoch=7/micro_step=140/global_step=6580, RunningAvgSamplesPerSec=85.10399534661349, CurrSamplesPerSec=85.05877469436436, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:25:41,044] [INFO] [logging.py:96:log_dist] [Rank 0] step=6590, skipped=126, lr=[5.742059010908505e-06, 5.742059010908505e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:25:41,078] [INFO] [timer.py:215:stop] epoch=7/micro_step=150/global_step=6590, RunningAvgSamplesPerSec=85.10404201063878, CurrSamplesPerSec=85.20892188446828, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:25:48,574] [INFO] [logging.py:96:log_dist] [Rank 0] step=6600, skipped=126, lr=[5.73194695953537e-06, 5.73194695953537e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:25:48,608] [INFO] [timer.py:215:stop] epoch=7/micro_step=160/global_step=6600, RunningAvgSamplesPerSec=85.10398125478231, CurrSamplesPerSec=84.55557519018554, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:25:56,118] [INFO] [logging.py:96:log_dist] [Rank 0] step=6610, skipped=126, lr=[5.721830777060886e-06, 5.721830777060886e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:25:56,151] [INFO] [timer.py:215:stop] epoch=7/micro_step=170/global_step=6610, RunningAvgSamplesPerSec=85.10367985869803, CurrSamplesPerSec=85.10192050044986, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:26:03,662] [INFO] [logging.py:96:log_dist] [Rank 0] step=6620, skipped=126, lr=[5.711710509563793e-06, 5.711710509563793e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:26:03,695] [INFO] [timer.py:215:stop] epoch=7/micro_step=180/global_step=6620, RunningAvgSamplesPerSec=85.10337407156739, CurrSamplesPerSec=84.87425503747855, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:26:10,430] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:26:11,126] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:26:11,127] [INFO] [logging.py:96:log_dist] [Rank 0] step=6630, skipped=128, lr=[5.703611385326642e-06, 5.703611385326642e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:26:11,128] [INFO] [timer.py:215:stop] epoch=7/micro_step=190/global_step=6630, RunningAvgSamplesPerSec=85.10498838365257, CurrSamplesPerSec=91.9851732906776, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:26:18,640] [INFO] [logging.py:96:log_dist] [Rank 0] step=6640, skipped=128, lr=[5.693483880966548e-06, 5.693483880966548e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:26:18,673] [INFO] [timer.py:215:stop] epoch=7/micro_step=200/global_step=6640, RunningAvgSamplesPerSec=85.1046597056243, CurrSamplesPerSec=84.7141341885259, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:26:26,187] [INFO] [logging.py:96:log_dist] [Rank 0] step=6650, skipped=128, lr=[5.683352420702643e-06, 5.683352420702643e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:26:26,221] [INFO] [timer.py:215:stop] epoch=7/micro_step=210/global_step=6650, RunningAvgSamplesPerSec=85.10430734691998, CurrSamplesPerSec=84.5144715064226, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:26:33,724] [INFO] [logging.py:96:log_dist] [Rank 0] step=6660, skipped=128, lr=[5.673217050683262e-06, 5.673217050683262e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:26:33,757] [INFO] [timer.py:215:stop] epoch=7/micro_step=220/global_step=6660, RunningAvgSamplesPerSec=85.10413259417602, CurrSamplesPerSec=85.23381303301802, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:26:41,256] [INFO] [logging.py:96:log_dist] [Rank 0] step=6670, skipped=128, lr=[5.663077817074542e-06, 5.663077817074542e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:26:41,290] [INFO] [timer.py:215:stop] epoch=7/micro_step=230/global_step=6670, RunningAvgSamplesPerSec=85.10402152995839, CurrSamplesPerSec=84.70402975735375, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:26:48,803] [INFO] [logging.py:96:log_dist] [Rank 0] step=6680, skipped=128, lr=[5.652934766060224e-06, 5.652934766060224e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:26:48,837] [INFO] [timer.py:215:stop] epoch=7/micro_step=240/global_step=6680, RunningAvgSamplesPerSec=85.10366217598084, CurrSamplesPerSec=84.80748684220087, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:26:56,359] [INFO] [logging.py:96:log_dist] [Rank 0] step=6690, skipped=128, lr=[5.642787943841435e-06, 5.642787943841435e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:26:56,392] [INFO] [timer.py:215:stop] epoch=7/micro_step=250/global_step=6690, RunningAvgSamplesPerSec=85.10317291766275, CurrSamplesPerSec=84.94645061557473, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:27:03,900] [INFO] [logging.py:96:log_dist] [Rank 0] step=6700, skipped=128, lr=[5.632637396636479e-06, 5.632637396636479e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:27:03,934] [INFO] [timer.py:215:stop] epoch=7/micro_step=260/global_step=6700, RunningAvgSamplesPerSec=85.1029185918196, CurrSamplesPerSec=85.04427674390607, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:27:11,430] [INFO] [logging.py:96:log_dist] [Rank 0] step=6710, skipped=128, lr=[5.622483170680628e-06, 5.622483170680628e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:27:11,464] [INFO] [timer.py:215:stop] epoch=7/micro_step=270/global_step=6710, RunningAvgSamplesPerSec=85.1028608504356, CurrSamplesPerSec=84.93725819889083, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:27:18,961] [INFO] [logging.py:96:log_dist] [Rank 0] step=6720, skipped=128, lr=[5.612325312225912e-06, 5.612325312225912e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:27:18,995] [INFO] [timer.py:215:stop] epoch=7/micro_step=280/global_step=6720, RunningAvgSamplesPerSec=85.10277913204222, CurrSamplesPerSec=84.73111399122372, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:27:26,500] [INFO] [logging.py:96:log_dist] [Rank 0] step=6730, skipped=128, lr=[5.602163867540904e-06, 5.602163867540904e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:27:26,534] [INFO] [timer.py:215:stop] epoch=7/micro_step=290/global_step=6730, RunningAvgSamplesPerSec=85.1025662923816, CurrSamplesPerSec=85.1393039480833, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:27:27,230] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:27:27,928] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:27:33,922] [INFO] [logging.py:96:log_dist] [Rank 0] step=6740, skipped=130, lr=[5.594032160810001e-06, 5.594032160810001e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:27:33,956] [INFO] [timer.py:215:stop] epoch=7/micro_step=300/global_step=6740, RunningAvgSamplesPerSec=85.10432496252143, CurrSamplesPerSec=85.10343139648205, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:27:41,467] [INFO] [logging.py:96:log_dist] [Rank 0] step=6750, skipped=130, lr=[5.5838643775592805e-06, 5.5838643775592805e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:27:41,500] [INFO] [timer.py:215:stop] epoch=7/micro_step=310/global_step=6750, RunningAvgSamplesPerSec=85.1040239098883, CurrSamplesPerSec=85.14073515847913, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:27:49,001] [INFO] [logging.py:96:log_dist] [Rank 0] step=6760, skipped=130, lr=[5.5736931377165065e-06, 5.5736931377165065e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:27:49,034] [INFO] [timer.py:215:stop] epoch=7/micro_step=320/global_step=6760, RunningAvgSamplesPerSec=85.10389384883308, CurrSamplesPerSec=85.22031048604718, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:27:56,537] [INFO] [logging.py:96:log_dist] [Rank 0] step=6770, skipped=130, lr=[5.563518487611204e-06, 5.563518487611204e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:27:56,570] [INFO] [timer.py:215:stop] epoch=7/micro_step=330/global_step=6770, RunningAvgSamplesPerSec=85.10373684073083, CurrSamplesPerSec=85.03883455161879, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:28:04,069] [INFO] [logging.py:96:log_dist] [Rank 0] step=6780, skipped=130, lr=[5.553340473588432e-06, 5.553340473588432e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:28:04,103] [INFO] [timer.py:215:stop] epoch=7/micro_step=340/global_step=6780, RunningAvgSamplesPerSec=85.10362563614393, CurrSamplesPerSec=85.10291876502934, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:28:11,596] [INFO] [logging.py:96:log_dist] [Rank 0] step=6790, skipped=130, lr=[5.543159142008574e-06, 5.543159142008574e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:28:11,629] [INFO] [timer.py:215:stop] epoch=7/micro_step=350/global_step=6790, RunningAvgSamplesPerSec=85.10362407010875, CurrSamplesPerSec=85.204540383173, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:28:19,128] [INFO] [logging.py:96:log_dist] [Rank 0] step=6800, skipped=130, lr=[5.5329745392471205e-06, 5.5329745392471205e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:28:19,162] [INFO] [timer.py:215:stop] epoch=7/micro_step=360/global_step=6800, RunningAvgSamplesPerSec=85.10351256111527, CurrSamplesPerSec=84.94400448586869, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:28:26,666] [INFO] [logging.py:96:log_dist] [Rank 0] step=6810, skipped=130, lr=[5.522786711694468e-06, 5.522786711694468e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:28:26,699] [INFO] [timer.py:215:stop] epoch=7/micro_step=370/global_step=6810, RunningAvgSamplesPerSec=85.10332793203094, CurrSamplesPerSec=85.02450498138514, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:28:34,202] [INFO] [logging.py:96:log_dist] [Rank 0] step=6820, skipped=130, lr=[5.512595705755698e-06, 5.512595705755698e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:28:34,236] [INFO] [timer.py:215:stop] epoch=7/micro_step=380/global_step=6820, RunningAvgSamplesPerSec=85.10314388490984, CurrSamplesPerSec=84.99349208088134, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:28:41,747] [INFO] [logging.py:96:log_dist] [Rank 0] step=6830, skipped=130, lr=[5.50240156785037e-06, 5.50240156785037e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:28:41,781] [INFO] [timer.py:215:stop] epoch=7/micro_step=390/global_step=6830, RunningAvgSamplesPerSec=85.10283503802495, CurrSamplesPerSec=84.85534012254982, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:28:43,991] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:28:44,687] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:28:49,172] [INFO] [logging.py:96:log_dist] [Rank 0] step=6840, skipped=132, lr=[5.49424403371324e-06, 5.49424403371324e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:28:49,206] [INFO] [timer.py:215:stop] epoch=7/micro_step=400/global_step=6840, RunningAvgSamplesPerSec=85.10451716320485, CurrSamplesPerSec=84.85483047555104, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:28:56,699] [INFO] [logging.py:96:log_dist] [Rank 0] step=6850, skipped=132, lr=[5.4840443752907975e-06, 5.4840443752907975e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:28:56,732] [INFO] [timer.py:215:stop] epoch=7/micro_step=410/global_step=6850, RunningAvgSamplesPerSec=85.10450161019085, CurrSamplesPerSec=85.10631843630503, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:29:04,235] [INFO] [logging.py:96:log_dist] [Rank 0] step=6860, skipped=132, lr=[5.473841714951782e-06, 5.473841714951782e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:29:04,269] [INFO] [timer.py:215:stop] epoch=7/micro_step=420/global_step=6860, RunningAvgSamplesPerSec=85.10432997403808, CurrSamplesPerSec=84.68139709635432, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:29:11,759] [INFO] [logging.py:96:log_dist] [Rank 0] step=6870, skipped=132, lr=[5.463636099168839e-06, 5.463636099168839e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:29:11,792] [INFO] [timer.py:215:stop] epoch=7/micro_step=430/global_step=6870, RunningAvgSamplesPerSec=85.10438351229953, CurrSamplesPerSec=84.97027414487798, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:29:19,282] [INFO] [logging.py:96:log_dist] [Rank 0] step=6880, skipped=132, lr=[5.4534275744280765e-06, 5.4534275744280765e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:29:19,315] [INFO] [timer.py:215:stop] epoch=7/micro_step=440/global_step=6880, RunningAvgSamplesPerSec=85.10442910023437, CurrSamplesPerSec=85.39452083904622, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:29:26,813] [INFO] [logging.py:96:log_dist] [Rank 0] step=6890, skipped=132, lr=[5.44321618722885e-06, 5.44321618722885e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:29:26,847] [INFO] [timer.py:215:stop] epoch=7/micro_step=450/global_step=6890, RunningAvgSamplesPerSec=85.10433649953896, CurrSamplesPerSec=84.86212705824333, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:29:34,358] [INFO] [logging.py:96:log_dist] [Rank 0] step=6900, skipped=132, lr=[5.433001984083553e-06, 5.433001984083553e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:29:34,392] [INFO] [timer.py:215:stop] epoch=7/micro_step=460/global_step=6900, RunningAvgSamplesPerSec=85.1040197243895, CurrSamplesPerSec=85.09447472514374, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:29:41,884] [INFO] [logging.py:96:log_dist] [Rank 0] step=6910, skipped=132, lr=[5.42278501151741e-06, 5.42278501151741e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:29:41,918] [INFO] [timer.py:215:stop] epoch=7/micro_step=470/global_step=6910, RunningAvgSamplesPerSec=85.1040195054585, CurrSamplesPerSec=84.91415155427097, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:29:49,422] [INFO] [logging.py:96:log_dist] [Rank 0] step=6920, skipped=132, lr=[5.412565316068258e-06, 5.412565316068258e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:29:49,455] [INFO] [timer.py:215:stop] epoch=7/micro_step=480/global_step=6920, RunningAvgSamplesPerSec=85.10384222137273, CurrSamplesPerSec=84.72006966074473, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:29:56,960] [INFO] [logging.py:96:log_dist] [Rank 0] step=6930, skipped=132, lr=[5.402342944286334e-06, 5.402342944286334e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:29:56,994] [INFO] [timer.py:215:stop] epoch=7/micro_step=490/global_step=6930, RunningAvgSamplesPerSec=85.10363043071948, CurrSamplesPerSec=84.99677536204351, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:30:00,701] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:30:01,399] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:30:04,380] [INFO] [logging.py:96:log_dist] [Rank 0] step=6940, skipped=134, lr=[5.3941631511907465e-06, 5.3941631511907465e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:30:04,414] [INFO] [timer.py:215:stop] epoch=7/micro_step=500/global_step=6940, RunningAvgSamplesPerSec=85.10536992190309, CurrSamplesPerSec=84.87581153355455, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:30:11,904] [INFO] [logging.py:96:log_dist] [Rank 0] step=6950, skipped=134, lr=[5.383936079355214e-06, 5.383936079355214e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:30:11,938] [INFO] [timer.py:215:stop] epoch=7/micro_step=510/global_step=6950, RunningAvgSamplesPerSec=85.10539477397437, CurrSamplesPerSec=85.1613175037959, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:30:19,427] [INFO] [logging.py:96:log_dist] [Rank 0] step=6960, skipped=134, lr=[5.373706461591753e-06, 5.373706461591753e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:30:19,461] [INFO] [timer.py:215:stop] epoch=7/micro_step=520/global_step=6960, RunningAvgSamplesPerSec=85.10545269079998, CurrSamplesPerSec=85.30350435422591, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:30:26,957] [INFO] [logging.py:96:log_dist] [Rank 0] step=6970, skipped=134, lr=[5.3634743444958e-06, 5.3634743444958e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:30:26,991] [INFO] [timer.py:215:stop] epoch=7/micro_step=530/global_step=6970, RunningAvgSamplesPerSec=85.10539103584728, CurrSamplesPerSec=85.1323916092876, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:30:34,497] [INFO] [logging.py:96:log_dist] [Rank 0] step=6980, skipped=134, lr=[5.3532397746741776e-06, 5.3532397746741776e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:30:34,531] [INFO] [timer.py:215:stop] epoch=7/micro_step=540/global_step=6980, RunningAvgSamplesPerSec=85.10515859103536, CurrSamplesPerSec=85.0695570439001, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:30:42,034] [INFO] [logging.py:96:log_dist] [Rank 0] step=6990, skipped=134, lr=[5.343002798744872e-06, 5.343002798744872e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:30:42,067] [INFO] [timer.py:215:stop] epoch=7/micro_step=550/global_step=6990, RunningAvgSamplesPerSec=85.10500106658925, CurrSamplesPerSec=85.31339983975657, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:30:49,573] [INFO] [logging.py:96:log_dist] [Rank 0] step=7000, skipped=134, lr=[5.332763463336836e-06, 5.332763463336836e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:30:49,607] [INFO] [timer.py:215:stop] epoch=7/micro_step=560/global_step=7000, RunningAvgSamplesPerSec=85.10477387073533, CurrSamplesPerSec=85.04888429271581, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:30:57,101] [INFO] [logging.py:96:log_dist] [Rank 0] step=7010, skipped=134, lr=[5.322521815089769e-06, 5.322521815089769e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:30:57,135] [INFO] [timer.py:215:stop] epoch=7/micro_step=570/global_step=7010, RunningAvgSamplesPerSec=85.10474917394178, CurrSamplesPerSec=84.9655944289683, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:31:04,630] [INFO] [logging.py:96:log_dist] [Rank 0] step=7020, skipped=134, lr=[5.312277900653901e-06, 5.312277900653901e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:31:04,664] [INFO] [timer.py:215:stop] epoch=7/micro_step=580/global_step=7020, RunningAvgSamplesPerSec=85.10469637028922, CurrSamplesPerSec=84.94042962662473, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:31:12,158] [INFO] [logging.py:96:log_dist] [Rank 0] step=7030, skipped=134, lr=[5.30203176668979e-06, 5.30203176668979e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:31:12,191] [INFO] [timer.py:215:stop] epoch=7/micro_step=590/global_step=7030, RunningAvgSamplesPerSec=85.10467249139437, CurrSamplesPerSec=84.92640188357318, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:31:17,414] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:31:18,111] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:31:19,582] [INFO] [logging.py:96:log_dist] [Rank 0] step=7040, skipped=136, lr=[5.293833292820517e-06, 5.293833292820517e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:31:19,615] [INFO] [timer.py:215:stop] epoch=7/micro_step=600/global_step=7040, RunningAvgSamplesPerSec=85.10631752385011, CurrSamplesPerSec=85.20318815895597, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:31:27,108] [INFO] [logging.py:96:log_dist] [Rank 0] step=7050, skipped=136, lr=[5.2835832813223e-06, 5.2835832813223e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:31:27,141] [INFO] [timer.py:215:stop] epoch=7/micro_step=610/global_step=7050, RunningAvgSamplesPerSec=85.10631568368218, CurrSamplesPerSec=85.12348282835201, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:31:34,635] [INFO] [logging.py:96:log_dist] [Rank 0] step=7060, skipped=136, lr=[5.2733311809984985e-06, 5.2733311809984985e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:31:34,669] [INFO] [timer.py:215:stop] epoch=7/micro_step=620/global_step=7060, RunningAvgSamplesPerSec=85.10628074938059, CurrSamplesPerSec=85.08851364865812, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:31:42,165] [INFO] [logging.py:96:log_dist] [Rank 0] step=7070, skipped=136, lr=[5.263077038546956e-06, 5.263077038546956e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:31:42,199] [INFO] [timer.py:215:stop] epoch=7/micro_step=630/global_step=7070, RunningAvgSamplesPerSec=85.10621821753399, CurrSamplesPerSec=85.00363877718594, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:31:49,717] [INFO] [logging.py:96:log_dist] [Rank 0] step=7080, skipped=136, lr=[5.252820900674813e-06, 5.252820900674813e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:31:49,750] [INFO] [timer.py:215:stop] epoch=7/micro_step=640/global_step=7080, RunningAvgSamplesPerSec=85.10580650962457, CurrSamplesPerSec=83.99558924948401, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:31:57,255] [INFO] [logging.py:96:log_dist] [Rank 0] step=7090, skipped=136, lr=[5.2425628140983045e-06, 5.2425628140983045e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:31:57,289] [INFO] [timer.py:215:stop] epoch=7/micro_step=650/global_step=7090, RunningAvgSamplesPerSec=85.10560203309603, CurrSamplesPerSec=85.14951250095558, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:32:04,788] [INFO] [logging.py:96:log_dist] [Rank 0] step=7100, skipped=136, lr=[5.232302825542539e-06, 5.232302825542539e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:32:04,822] [INFO] [timer.py:215:stop] epoch=7/micro_step=660/global_step=7100, RunningAvgSamplesPerSec=85.10548656089252, CurrSamplesPerSec=84.72678148932786, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:32:12,325] [INFO] [logging.py:96:log_dist] [Rank 0] step=7110, skipped=136, lr=[5.222040981741288e-06, 5.222040981741288e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:32:12,359] [INFO] [timer.py:215:stop] epoch=7/micro_step=670/global_step=7110, RunningAvgSamplesPerSec=85.10531619347647, CurrSamplesPerSec=85.08026122909526, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:32:19,849] [INFO] [logging.py:96:log_dist] [Rank 0] step=7120, skipped=136, lr=[5.211777329436774e-06, 5.211777329436774e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:32:19,882] [INFO] [timer.py:215:stop] epoch=7/micro_step=680/global_step=7120, RunningAvgSamplesPerSec=85.10535785839286, CurrSamplesPerSec=85.09695650547509, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:32:27,392] [INFO] [logging.py:96:log_dist] [Rank 0] step=7130, skipped=136, lr=[5.201511915379459e-06, 5.201511915379459e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:32:27,425] [INFO] [timer.py:215:stop] epoch=7/micro_step=690/global_step=7130, RunningAvgSamplesPerSec=85.10510665563741, CurrSamplesPerSec=85.08387483672487, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:32:34,150] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:32:34,848] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:32:34,848] [INFO] [logging.py:96:log_dist] [Rank 0] step=7140, skipped=138, lr=[5.193298347093025e-06, 5.193298347093025e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:32:34,849] [INFO] [timer.py:215:stop] epoch=7/micro_step=700/global_step=7140, RunningAvgSamplesPerSec=85.1067278040967, CurrSamplesPerSec=91.85998609278413, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:32:42,356] [INFO] [logging.py:96:log_dist] [Rank 0] step=7150, skipped=138, lr=[5.1830298797173054e-06, 5.1830298797173054e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:32:42,390] [INFO] [timer.py:215:stop] epoch=7/micro_step=710/global_step=7150, RunningAvgSamplesPerSec=85.10648687779606, CurrSamplesPerSec=84.76134685799234, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:32:49,892] [INFO] [logging.py:96:log_dist] [Rank 0] step=7160, skipped=138, lr=[5.172759781532084e-06, 5.172759781532084e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:32:49,925] [INFO] [timer.py:215:stop] epoch=7/micro_step=720/global_step=7160, RunningAvgSamplesPerSec=85.10632886671186, CurrSamplesPerSec=85.01884990137343, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:32:57,425] [INFO] [logging.py:96:log_dist] [Rank 0] step=7170, skipped=138, lr=[5.16248809931718e-06, 5.16248809931718e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:32:57,458] [INFO] [timer.py:215:stop] epoch=7/micro_step=730/global_step=7170, RunningAvgSamplesPerSec=85.10622634625032, CurrSamplesPerSec=85.04055873517927, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:33:04,952] [INFO] [logging.py:96:log_dist] [Rank 0] step=7180, skipped=138, lr=[5.1522148798596316e-06, 5.1522148798596316e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:33:04,986] [INFO] [timer.py:215:stop] epoch=7/micro_step=740/global_step=7180, RunningAvgSamplesPerSec=85.1062076679686, CurrSamplesPerSec=85.21606307441421, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:33:12,481] [INFO] [logging.py:96:log_dist] [Rank 0] step=7190, skipped=138, lr=[5.141940169953478e-06, 5.141940169953478e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:33:12,515] [INFO] [timer.py:215:stop] epoch=7/micro_step=750/global_step=7190, RunningAvgSamplesPerSec=85.10614865049835, CurrSamplesPerSec=84.83906126789832, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:33:20,018] [INFO] [logging.py:96:log_dist] [Rank 0] step=7200, skipped=138, lr=[5.1316640163995466e-06, 5.1316640163995466e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:33:20,051] [INFO] [timer.py:215:stop] epoch=7/micro_step=760/global_step=7200, RunningAvgSamplesPerSec=85.10598434149811, CurrSamplesPerSec=85.22404422937352, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:33:27,556] [INFO] [logging.py:96:log_dist] [Rank 0] step=7210, skipped=138, lr=[5.121386466005237e-06, 5.121386466005237e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:33:27,590] [INFO] [timer.py:215:stop] epoch=7/micro_step=770/global_step=7210, RunningAvgSamplesPerSec=85.10578762966352, CurrSamplesPerSec=84.84774969806219, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:33:35,088] [INFO] [logging.py:96:log_dist] [Rank 0] step=7220, skipped=138, lr=[5.1111075655843175e-06, 5.1111075655843175e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:33:35,122] [INFO] [timer.py:215:stop] epoch=7/micro_step=780/global_step=7220, RunningAvgSamplesPerSec=85.10569813644089, CurrSamplesPerSec=85.1708557429906, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:33:42,624] [INFO] [logging.py:96:log_dist] [Rank 0] step=7230, skipped=138, lr=[5.100827361956704e-06, 5.100827361956704e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:33:42,658] [INFO] [timer.py:215:stop] epoch=7/micro_step=790/global_step=7230, RunningAvgSamplesPerSec=85.10554299611628, CurrSamplesPerSec=84.80823706521379, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:33:50,157] [INFO] [logging.py:96:log_dist] [Rank 0] step=7240, skipped=138, lr=[5.090545901948244e-06, 5.090545901948244e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:33:50,191] [INFO] [timer.py:215:stop] epoch=7/micro_step=800/global_step=7240, RunningAvgSamplesPerSec=85.10543325382913, CurrSamplesPerSec=84.89857510048746, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:33:50,885] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:33:51,581] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:33:57,584] [INFO] [logging.py:96:log_dist] [Rank 0] step=7250, skipped=140, lr=[5.0823198608179e-06, 5.0823198608179e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:33:57,617] [INFO] [timer.py:215:stop] epoch=7/micro_step=810/global_step=7250, RunningAvgSamplesPerSec=85.10700083937064, CurrSamplesPerSec=84.38991415937967, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:34:05,123] [INFO] [logging.py:96:log_dist] [Rank 0] step=7260, skipped=140, lr=[5.072036257343196e-06, 5.072036257343196e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:34:05,156] [INFO] [timer.py:215:stop] epoch=7/micro_step=820/global_step=7260, RunningAvgSamplesPerSec=85.10678229932373, CurrSamplesPerSec=84.884856442112, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:34:12,655] [INFO] [logging.py:96:log_dist] [Rank 0] step=7270, skipped=140, lr=[5.061751528629793e-06, 5.061751528629793e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:34:12,689] [INFO] [timer.py:215:stop] epoch=7/micro_step=830/global_step=7270, RunningAvgSamplesPerSec=85.10667368512145, CurrSamplesPerSec=84.98367066402294, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:34:20,193] [INFO] [logging.py:96:log_dist] [Rank 0] step=7280, skipped=140, lr=[5.0514657215241545e-06, 5.0514657215241545e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:34:20,226] [INFO] [timer.py:215:stop] epoch=7/micro_step=840/global_step=7280, RunningAvgSamplesPerSec=85.10649112823522, CurrSamplesPerSec=84.99438015649692, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:34:27,724] [INFO] [logging.py:96:log_dist] [Rank 0] step=7290, skipped=140, lr=[5.041178882877655e-06, 5.041178882877655e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:34:27,758] [INFO] [timer.py:215:stop] epoch=7/micro_step=850/global_step=7290, RunningAvgSamplesPerSec=85.10640487473898, CurrSamplesPerSec=85.18201792520966, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:34:35,265] [INFO] [logging.py:96:log_dist] [Rank 0] step=7300, skipped=140, lr=[5.030891059546367e-06, 5.030891059546367e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:34:35,298] [INFO] [timer.py:215:stop] epoch=7/micro_step=860/global_step=7300, RunningAvgSamplesPerSec=85.10618259922754, CurrSamplesPerSec=84.99612945062516, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:34:42,795] [INFO] [logging.py:96:log_dist] [Rank 0] step=7310, skipped=140, lr=[5.0206022983908484e-06, 5.0206022983908484e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:34:42,829] [INFO] [timer.py:215:stop] epoch=7/micro_step=870/global_step=7310, RunningAvgSamplesPerSec=85.10611654730057, CurrSamplesPerSec=85.04042403082086, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:34:50,329] [INFO] [logging.py:96:log_dist] [Rank 0] step=7320, skipped=140, lr=[5.0103126462759325e-06, 5.0103126462759325e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:34:50,362] [INFO] [timer.py:215:stop] epoch=7/micro_step=880/global_step=7320, RunningAvgSamplesPerSec=85.105995405934, CurrSamplesPerSec=85.04177109360867, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:34:57,853] [INFO] [logging.py:96:log_dist] [Rank 0] step=7330, skipped=140, lr=[5.000022150070503e-06, 5.000022150070503e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:34:57,886] [INFO] [timer.py:215:stop] epoch=7/micro_step=890/global_step=7330, RunningAvgSamplesPerSec=85.10603093712766, CurrSamplesPerSec=84.78348671200699, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:35:05,388] [INFO] [logging.py:96:log_dist] [Rank 0] step=7340, skipped=140, lr=[4.989730856647296e-06, 4.989730856647296e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:35:05,421] [INFO] [timer.py:215:stop] epoch=7/micro_step=900/global_step=7340, RunningAvgSamplesPerSec=85.1058915780441, CurrSamplesPerSec=85.02636324330831, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:35:07,624] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:35:08,321] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:35:12,806] [INFO] [logging.py:96:log_dist] [Rank 0] step=7350, skipped=142, lr=[4.98149727941273e-06, 4.98149727941273e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:35:12,840] [INFO] [timer.py:215:stop] epoch=7/micro_step=910/global_step=7350, RunningAvgSamplesPerSec=85.10753704404159, CurrSamplesPerSec=85.28819121094824, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:35:20,334] [INFO] [logging.py:96:log_dist] [Rank 0] step=7360, skipped=142, lr=[4.971204669128264e-06, 4.971204669128264e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:35:20,368] [INFO] [timer.py:215:stop] epoch=7/micro_step=920/global_step=7360, RunningAvgSamplesPerSec=85.1075037654681, CurrSamplesPerSec=84.87946147237561, MemAllocated=5.0GB, MaxMemAllocated=23.99GB ***** Evaluating perplexity, Epoch 8/16 ***** ppl: 1.799204707145691 Beginning of Epoch 9/16, Total Micro Batches 920 [2023-06-29 18:35:45,827] [INFO] [logging.py:96:log_dist] [Rank 0] step=7370, skipped=142, lr=[4.960911392888308e-06, 4.960911392888308e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:35:45,860] [INFO] [timer.py:215:stop] epoch=8/micro_step=10/global_step=7370, RunningAvgSamplesPerSec=85.10703727147583, CurrSamplesPerSec=84.70873416615045, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:35:53,368] [INFO] [logging.py:96:log_dist] [Rank 0] step=7380, skipped=142, lr=[4.950617497578259e-06, 4.950617497578259e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:35:53,402] [INFO] [timer.py:215:stop] epoch=8/micro_step=20/global_step=7380, RunningAvgSamplesPerSec=85.10678115486616, CurrSamplesPerSec=85.00797271997183, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:36:00,911] [INFO] [logging.py:96:log_dist] [Rank 0] step=7390, skipped=142, lr=[4.940323030086334e-06, 4.940323030086334e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:36:00,944] [INFO] [timer.py:215:stop] epoch=8/micro_step=30/global_step=7390, RunningAvgSamplesPerSec=85.10653418074823, CurrSamplesPerSec=84.82820323214703, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:36:08,448] [INFO] [logging.py:96:log_dist] [Rank 0] step=7400, skipped=142, lr=[4.930028037303352e-06, 4.930028037303352e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:36:08,482] [INFO] [timer.py:215:stop] epoch=8/micro_step=40/global_step=7400, RunningAvgSamplesPerSec=85.10635811510983, CurrSamplesPerSec=84.99271166611521, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:36:16,000] [INFO] [logging.py:96:log_dist] [Rank 0] step=7410, skipped=142, lr=[4.919732566122531e-06, 4.919732566122531e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:36:16,034] [INFO] [timer.py:215:stop] epoch=8/micro_step=50/global_step=7410, RunningAvgSamplesPerSec=85.10595423473889, CurrSamplesPerSec=84.85204093728258, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:36:23,554] [INFO] [logging.py:96:log_dist] [Rank 0] step=7420, skipped=142, lr=[4.909436663439265e-06, 4.909436663439265e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:36:23,588] [INFO] [timer.py:215:stop] epoch=8/micro_step=60/global_step=7420, RunningAvgSamplesPerSec=85.10554040401122, CurrSamplesPerSec=85.01211864401412, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:36:31,091] [INFO] [logging.py:96:log_dist] [Rank 0] step=7430, skipped=142, lr=[4.899140376150912e-06, 4.899140376150912e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:36:31,125] [INFO] [timer.py:215:stop] epoch=8/micro_step=70/global_step=7430, RunningAvgSamplesPerSec=85.10536857089615, CurrSamplesPerSec=85.01610341964094, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:36:38,616] [INFO] [logging.py:96:log_dist] [Rank 0] step=7440, skipped=142, lr=[4.888843751156581e-06, 4.888843751156581e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:36:38,649] [INFO] [timer.py:215:stop] epoch=8/micro_step=80/global_step=7440, RunningAvgSamplesPerSec=85.10539344538803, CurrSamplesPerSec=85.04532754442509, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:36:42,359] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:36:43,054] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:36:46,035] [INFO] [logging.py:96:log_dist] [Rank 0] step=7450, skipped=144, lr=[4.880606239530004e-06, 4.880606239530004e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:36:46,069] [INFO] [timer.py:215:stop] epoch=8/micro_step=90/global_step=7450, RunningAvgSamplesPerSec=85.10701744271321, CurrSamplesPerSec=84.77237518253338, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:36:53,577] [INFO] [logging.py:96:log_dist] [Rank 0] step=7460, skipped=144, lr=[4.8703091248554536e-06, 4.8703091248554536e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:36:53,611] [INFO] [timer.py:215:stop] epoch=8/micro_step=100/global_step=7460, RunningAvgSamplesPerSec=85.1067665530849, CurrSamplesPerSec=84.61610544976787, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:37:01,113] [INFO] [logging.py:96:log_dist] [Rank 0] step=7470, skipped=144, lr=[4.860011803799938e-06, 4.860011803799938e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:37:01,146] [INFO] [timer.py:215:stop] epoch=8/micro_step=110/global_step=7470, RunningAvgSamplesPerSec=85.10662036930042, CurrSamplesPerSec=84.99185053541909, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:37:08,647] [INFO] [logging.py:96:log_dist] [Rank 0] step=7480, skipped=144, lr=[4.84971432326728e-06, 4.84971432326728e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:37:08,680] [INFO] [timer.py:215:stop] epoch=8/micro_step=120/global_step=7480, RunningAvgSamplesPerSec=85.1064933544637, CurrSamplesPerSec=84.96825696274172, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:37:16,188] [INFO] [logging.py:96:log_dist] [Rank 0] step=7490, skipped=144, lr=[4.839416730162025e-06, 4.839416730162025e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:37:16,222] [INFO] [timer.py:215:stop] epoch=8/micro_step=130/global_step=7490, RunningAvgSamplesPerSec=85.10625681383728, CurrSamplesPerSec=84.88686966621793, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:37:23,726] [INFO] [logging.py:96:log_dist] [Rank 0] step=7500, skipped=144, lr=[4.829119071389233e-06, 4.829119071389233e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:37:23,760] [INFO] [timer.py:215:stop] epoch=8/micro_step=140/global_step=7500, RunningAvgSamplesPerSec=85.10607544969001, CurrSamplesPerSec=85.16569456077565, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:37:31,268] [INFO] [logging.py:96:log_dist] [Rank 0] step=7510, skipped=144, lr=[4.818821393854262e-06, 4.818821393854262e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:37:31,301] [INFO] [timer.py:215:stop] epoch=8/micro_step=150/global_step=7510, RunningAvgSamplesPerSec=85.10584781039594, CurrSamplesPerSec=84.92113595805374, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:37:38,815] [INFO] [logging.py:96:log_dist] [Rank 0] step=7520, skipped=144, lr=[4.808523744462554e-06, 4.808523744462554e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:37:38,848] [INFO] [timer.py:215:stop] epoch=8/micro_step=160/global_step=7520, RunningAvgSamplesPerSec=85.10555494479769, CurrSamplesPerSec=84.4917272624834, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:37:46,360] [INFO] [logging.py:96:log_dist] [Rank 0] step=7530, skipped=144, lr=[4.798226170119427e-06, 4.798226170119427e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:37:46,394] [INFO] [timer.py:215:stop] epoch=8/micro_step=170/global_step=7530, RunningAvgSamplesPerSec=85.1052480672485, CurrSamplesPerSec=84.63029768022548, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:37:53,896] [INFO] [logging.py:96:log_dist] [Rank 0] step=7540, skipped=144, lr=[4.7879287177298555e-06, 4.7879287177298555e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:37:53,930] [INFO] [timer.py:215:stop] epoch=8/micro_step=180/global_step=7540, RunningAvgSamplesPerSec=85.10509879964155, CurrSamplesPerSec=84.98687246561083, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:37:59,143] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:37:59,839] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:38:01,310] [INFO] [logging.py:96:log_dist] [Rank 0] step=7550, skipped=146, lr=[4.779690875144548e-06, 4.779690875144548e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:38:01,344] [INFO] [timer.py:215:stop] epoch=8/micro_step=190/global_step=7550, RunningAvgSamplesPerSec=85.10677915554243, CurrSamplesPerSec=85.51103007112341, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:38:08,843] [INFO] [logging.py:96:log_dist] [Rank 0] step=7560, skipped=146, lr=[4.769393760469996e-06, 4.769393760469996e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:38:08,877] [INFO] [timer.py:215:stop] epoch=8/micro_step=200/global_step=7560, RunningAvgSamplesPerSec=85.10666958992829, CurrSamplesPerSec=84.95446200451491, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:38:16,387] [INFO] [logging.py:96:log_dist] [Rank 0] step=7570, skipped=146, lr=[4.759096899079287e-06, 4.759096899079287e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:38:16,420] [INFO] [timer.py:215:stop] epoch=8/micro_step=210/global_step=7570, RunningAvgSamplesPerSec=85.10641097515789, CurrSamplesPerSec=84.7952172211375, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:38:23,921] [INFO] [logging.py:96:log_dist] [Rank 0] step=7580, skipped=146, lr=[4.748800337874146e-06, 4.748800337874146e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:38:23,955] [INFO] [timer.py:215:stop] epoch=8/micro_step=220/global_step=7580, RunningAvgSamplesPerSec=85.10628260551218, CurrSamplesPerSec=84.6972680850849, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:38:31,460] [INFO] [logging.py:96:log_dist] [Rank 0] step=7590, skipped=146, lr=[4.738504123754934e-06, 4.738504123754934e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:38:31,494] [INFO] [timer.py:215:stop] epoch=8/micro_step=230/global_step=7590, RunningAvgSamplesPerSec=85.10607354185976, CurrSamplesPerSec=84.72236921133704, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:38:39,005] [INFO] [logging.py:96:log_dist] [Rank 0] step=7600, skipped=146, lr=[4.728208303620428e-06, 4.728208303620428e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:38:39,039] [INFO] [timer.py:215:stop] epoch=8/micro_step=240/global_step=7600, RunningAvgSamplesPerSec=85.10579872087337, CurrSamplesPerSec=84.93621006721857, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:38:46,559] [INFO] [logging.py:96:log_dist] [Rank 0] step=7610, skipped=146, lr=[4.717912924367608e-06, 4.717912924367608e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:38:46,592] [INFO] [timer.py:215:stop] epoch=8/micro_step=250/global_step=7610, RunningAvgSamplesPerSec=85.10538708233268, CurrSamplesPerSec=84.87583837019443, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:38:54,091] [INFO] [logging.py:96:log_dist] [Rank 0] step=7620, skipped=146, lr=[4.707618032891456e-06, 4.707618032891456e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:38:54,125] [INFO] [timer.py:215:stop] epoch=8/micro_step=260/global_step=7620, RunningAvgSamplesPerSec=85.1052833415441, CurrSamplesPerSec=85.06839781019006, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:39:01,616] [INFO] [logging.py:96:log_dist] [Rank 0] step=7630, skipped=146, lr=[4.697323676084721e-06, 4.697323676084721e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:39:01,650] [INFO] [timer.py:215:stop] epoch=8/micro_step=270/global_step=7630, RunningAvgSamplesPerSec=85.10529588319704, CurrSamplesPerSec=84.80850500522085, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:39:09,144] [INFO] [logging.py:96:log_dist] [Rank 0] step=7640, skipped=146, lr=[4.68702990083772e-06, 4.68702990083772e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:39:09,179] [INFO] [timer.py:215:stop] epoch=8/micro_step=280/global_step=7640, RunningAvgSamplesPerSec=85.10526281450596, CurrSamplesPerSec=84.64206590173042, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:39:15,909] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:39:16,608] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:39:16,609] [INFO] [logging.py:96:log_dist] [Rank 0] step=7650, skipped=148, lr=[4.678795330871738e-06, 4.678795330871738e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:39:16,610] [INFO] [timer.py:215:stop] epoch=8/micro_step=290/global_step=7650, RunningAvgSamplesPerSec=85.10667693448336, CurrSamplesPerSec=91.64099908268689, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:39:24,109] [INFO] [logging.py:96:log_dist] [Rank 0] step=7660, skipped=148, lr=[4.668502720587272e-06, 4.668502720587272e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:39:24,143] [INFO] [timer.py:215:stop] epoch=8/micro_step=300/global_step=7660, RunningAvgSamplesPerSec=85.10655828893341, CurrSamplesPerSec=85.23419192369816, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:39:31,647] [INFO] [logging.py:96:log_dist] [Rank 0] step=7670, skipped=148, lr=[4.658210823140656e-06, 4.658210823140656e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:39:31,680] [INFO] [timer.py:215:stop] epoch=8/micro_step=310/global_step=7670, RunningAvgSamplesPerSec=85.10639006270071, CurrSamplesPerSec=84.69823015439293, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:39:39,181] [INFO] [logging.py:96:log_dist] [Rank 0] step=7680, skipped=148, lr=[4.647919685411009e-06, 4.647919685411009e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:39:39,214] [INFO] [timer.py:215:stop] epoch=8/micro_step=320/global_step=7680, RunningAvgSamplesPerSec=85.1062690045554, CurrSamplesPerSec=85.29350275816861, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:39:46,732] [INFO] [logging.py:96:log_dist] [Rank 0] step=7690, skipped=148, lr=[4.6376293542739845e-06, 4.6376293542739845e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:39:46,765] [INFO] [timer.py:215:stop] epoch=8/micro_step=330/global_step=7690, RunningAvgSamplesPerSec=85.10589693360315, CurrSamplesPerSec=85.02076177282248, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:39:54,265] [INFO] [logging.py:96:log_dist] [Rank 0] step=7700, skipped=148, lr=[4.627339876601561e-06, 4.627339876601561e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:39:54,299] [INFO] [timer.py:215:stop] epoch=8/micro_step=340/global_step=7700, RunningAvgSamplesPerSec=85.10578363536113, CurrSamplesPerSec=85.25530687062565, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:40:01,798] [INFO] [logging.py:96:log_dist] [Rank 0] step=7710, skipped=148, lr=[4.617051299261837e-06, 4.617051299261837e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:40:01,831] [INFO] [timer.py:215:stop] epoch=8/micro_step=350/global_step=7710, RunningAvgSamplesPerSec=85.10568552620823, CurrSamplesPerSec=84.96083455714792, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:40:09,319] [INFO] [logging.py:96:log_dist] [Rank 0] step=7720, skipped=148, lr=[4.606763669118804e-06, 4.606763669118804e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:40:09,352] [INFO] [timer.py:215:stop] epoch=8/micro_step=360/global_step=7720, RunningAvgSamplesPerSec=85.10575182304356, CurrSamplesPerSec=84.962932062054, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:40:16,851] [INFO] [logging.py:96:log_dist] [Rank 0] step=7730, skipped=148, lr=[4.596477033032136e-06, 4.596477033032136e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:40:16,884] [INFO] [timer.py:215:stop] epoch=8/micro_step=370/global_step=7730, RunningAvgSamplesPerSec=85.10566750710649, CurrSamplesPerSec=85.03412034413363, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:40:24,389] [INFO] [logging.py:96:log_dist] [Rank 0] step=7740, skipped=148, lr=[4.58619143785699e-06, 4.58619143785699e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:40:24,422] [INFO] [timer.py:215:stop] epoch=8/micro_step=380/global_step=7740, RunningAvgSamplesPerSec=85.1054850002698, CurrSamplesPerSec=84.99612945062516, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:40:31,920] [INFO] [logging.py:96:log_dist] [Rank 0] step=7750, skipped=148, lr=[4.5759069304437725e-06, 4.5759069304437725e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:40:31,953] [INFO] [timer.py:215:stop] epoch=8/micro_step=390/global_step=7750, RunningAvgSamplesPerSec=85.10541670895617, CurrSamplesPerSec=84.42582485150378, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:40:32,652] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:40:33,350] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:40:39,345] [INFO] [logging.py:96:log_dist] [Rank 0] step=7760, skipped=150, lr=[4.5676801391821015e-06, 4.5676801391821015e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:40:39,379] [INFO] [timer.py:215:stop] epoch=8/micro_step=400/global_step=7760, RunningAvgSamplesPerSec=85.10687552030342, CurrSamplesPerSec=84.89156756052469, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:40:46,874] [INFO] [logging.py:96:log_dist] [Rank 0] step=7770, skipped=150, lr=[4.557397707787432e-06, 4.557397707787432e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:40:46,907] [INFO] [timer.py:215:stop] epoch=8/micro_step=410/global_step=7770, RunningAvgSamplesPerSec=85.10684125857462, CurrSamplesPerSec=85.18466700812193, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:40:54,420] [INFO] [logging.py:96:log_dist] [Rank 0] step=7780, skipped=150, lr=[4.547116495308796e-06, 4.547116495308796e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:40:54,451] [INFO] [timer.py:215:stop] epoch=8/micro_step=420/global_step=7780, RunningAvgSamplesPerSec=85.10657835533182, CurrSamplesPerSec=84.64673673801197, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:41:01,950] [INFO] [logging.py:96:log_dist] [Rank 0] step=7790, skipped=150, lr=[4.536836548576639e-06, 4.536836548576639e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:41:01,983] [INFO] [timer.py:215:stop] epoch=8/micro_step=430/global_step=7790, RunningAvgSamplesPerSec=85.10649732283822, CurrSamplesPerSec=85.08719207513056, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:41:09,482] [INFO] [logging.py:96:log_dist] [Rank 0] step=7800, skipped=150, lr=[4.526557914415644e-06, 4.526557914415644e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:41:09,516] [INFO] [timer.py:215:stop] epoch=8/micro_step=440/global_step=7800, RunningAvgSamplesPerSec=85.10639969904805, CurrSamplesPerSec=85.22623592674941, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:41:17,015] [INFO] [logging.py:96:log_dist] [Rank 0] step=7810, skipped=150, lr=[4.516280639644511e-06, 4.516280639644511e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:41:17,049] [INFO] [timer.py:215:stop] epoch=8/micro_step=450/global_step=7810, RunningAvgSamplesPerSec=85.10629397987876, CurrSamplesPerSec=85.00326193348012, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:41:24,551] [INFO] [logging.py:96:log_dist] [Rank 0] step=7820, skipped=150, lr=[4.506004771075747e-06, 4.506004771075747e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:41:24,585] [INFO] [timer.py:215:stop] epoch=8/micro_step=460/global_step=7820, RunningAvgSamplesPerSec=85.10614758829001, CurrSamplesPerSec=85.19969961868331, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:41:32,080] [INFO] [logging.py:96:log_dist] [Rank 0] step=7830, skipped=150, lr=[4.495730355515464e-06, 4.495730355515464e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:41:32,114] [INFO] [timer.py:215:stop] epoch=8/micro_step=470/global_step=7830, RunningAvgSamplesPerSec=85.10610154545735, CurrSamplesPerSec=85.18596457823602, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:41:39,612] [INFO] [logging.py:96:log_dist] [Rank 0] step=7840, skipped=150, lr=[4.485457439763144e-06, 4.485457439763144e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:41:39,646] [INFO] [timer.py:215:stop] epoch=8/micro_step=480/global_step=7840, RunningAvgSamplesPerSec=85.106014279023, CurrSamplesPerSec=84.95725828120999, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:41:47,152] [INFO] [logging.py:96:log_dist] [Rank 0] step=7850, skipped=150, lr=[4.47518607061144e-06, 4.47518607061144e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:41:47,185] [INFO] [timer.py:215:stop] epoch=8/micro_step=490/global_step=7850, RunningAvgSamplesPerSec=85.10582476562135, CurrSamplesPerSec=84.44760378542068, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:41:49,389] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:41:50,088] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:41:54,577] [INFO] [logging.py:96:log_dist] [Rank 0] step=7860, skipped=152, lr=[4.466970120282696e-06, 4.466970120282696e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:41:54,611] [INFO] [timer.py:215:stop] epoch=8/micro_step=500/global_step=7860, RunningAvgSamplesPerSec=85.10726868892273, CurrSamplesPerSec=84.84453155347708, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:42:02,108] [INFO] [logging.py:96:log_dist] [Rank 0] step=7870, skipped=152, lr=[4.4567016529069755e-06, 4.4567016529069755e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:42:02,142] [INFO] [timer.py:215:stop] epoch=8/micro_step=510/global_step=7870, RunningAvgSamplesPerSec=85.10719681323104, CurrSamplesPerSec=84.7389243113662, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:42:09,631] [INFO] [logging.py:96:log_dist] [Rank 0] step=7880, skipped=152, lr=[4.4464348631131495e-06, 4.4464348631131495e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:42:09,665] [INFO] [timer.py:215:stop] epoch=8/micro_step=520/global_step=7880, RunningAvgSamplesPerSec=85.10723714915474, CurrSamplesPerSec=85.14116723145062, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:42:17,165] [INFO] [logging.py:96:log_dist] [Rank 0] step=7890, skipped=152, lr=[4.436169797665969e-06, 4.436169797665969e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:42:17,199] [INFO] [timer.py:215:stop] epoch=8/micro_step=530/global_step=7890, RunningAvgSamplesPerSec=85.1071172832586, CurrSamplesPerSec=84.99034359868833, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:42:24,692] [INFO] [logging.py:96:log_dist] [Rank 0] step=7900, skipped=152, lr=[4.425906503322332e-06, 4.425906503322332e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:42:24,726] [INFO] [timer.py:215:stop] epoch=8/micro_step=540/global_step=7900, RunningAvgSamplesPerSec=85.1071052879173, CurrSamplesPerSec=85.27575502279015, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:42:32,216] [INFO] [logging.py:96:log_dist] [Rank 0] step=7910, skipped=152, lr=[4.4156450268310666e-06, 4.4156450268310666e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:42:32,249] [INFO] [timer.py:215:stop] epoch=8/micro_step=550/global_step=7910, RunningAvgSamplesPerSec=85.10713273639477, CurrSamplesPerSec=85.48295502674652, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:42:39,745] [INFO] [logging.py:96:log_dist] [Rank 0] step=7920, skipped=152, lr=[4.405385414932725e-06, 4.405385414932725e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:42:39,779] [INFO] [timer.py:215:stop] epoch=8/micro_step=560/global_step=7920, RunningAvgSamplesPerSec=85.10708522534148, CurrSamplesPerSec=84.96381949915316, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:42:47,272] [INFO] [logging.py:96:log_dist] [Rank 0] step=7930, skipped=152, lr=[4.395127714359361e-06, 4.395127714359361e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:42:47,306] [INFO] [timer.py:215:stop] epoch=8/micro_step=570/global_step=7930, RunningAvgSamplesPerSec=85.10705747587618, CurrSamplesPerSec=84.95077872865356, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:42:54,799] [INFO] [logging.py:96:log_dist] [Rank 0] step=7940, skipped=152, lr=[4.3848719718343285e-06, 4.3848719718343285e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:42:54,833] [INFO] [timer.py:215:stop] epoch=8/micro_step=580/global_step=7940, RunningAvgSamplesPerSec=85.10704975999671, CurrSamplesPerSec=84.81289946291206, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:43:02,326] [INFO] [logging.py:96:log_dist] [Rank 0] step=7950, skipped=152, lr=[4.374618234072057e-06, 4.374618234072057e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:43:02,360] [INFO] [timer.py:215:stop] epoch=8/micro_step=590/global_step=7950, RunningAvgSamplesPerSec=85.10703140679094, CurrSamplesPerSec=84.8640587166544, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:43:06,074] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:43:06,769] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:43:09,745] [INFO] [logging.py:96:log_dist] [Rank 0] step=7960, skipped=154, lr=[4.366416718677702e-06, 4.366416718677702e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:43:09,779] [INFO] [timer.py:215:stop] epoch=8/micro_step=600/global_step=7960, RunningAvgSamplesPerSec=85.10855739966644, CurrSamplesPerSec=85.28223005814247, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:43:17,291] [INFO] [logging.py:96:log_dist] [Rank 0] step=7970, skipped=154, lr=[4.356166707179485e-06, 4.356166707179485e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:43:17,324] [INFO] [timer.py:215:stop] epoch=8/micro_step=610/global_step=7970, RunningAvgSamplesPerSec=85.10827626034464, CurrSamplesPerSec=84.98251376984432, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:43:24,843] [INFO] [logging.py:96:log_dist] [Rank 0] step=7980, skipped=154, lr=[4.345918831195178e-06, 4.345918831195178e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:43:24,872] [INFO] [timer.py:215:stop] epoch=8/micro_step=620/global_step=7980, RunningAvgSamplesPerSec=85.10797173922037, CurrSamplesPerSec=83.7389872352346, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:43:32,371] [INFO] [logging.py:96:log_dist] [Rank 0] step=7990, skipped=154, lr=[4.335673137403381e-06, 4.335673137403381e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:43:32,405] [INFO] [timer.py:215:stop] epoch=8/micro_step=630/global_step=7990, RunningAvgSamplesPerSec=85.10787062594538, CurrSamplesPerSec=85.00129701686122, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:43:39,909] [INFO] [logging.py:96:log_dist] [Rank 0] step=8000, skipped=154, lr=[4.325429672472757e-06, 4.325429672472757e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:43:39,943] [INFO] [timer.py:215:stop] epoch=8/micro_step=640/global_step=8000, RunningAvgSamplesPerSec=85.10770181107381, CurrSamplesPerSec=85.15475276185038, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:43:47,452] [INFO] [logging.py:96:log_dist] [Rank 0] step=8010, skipped=154, lr=[4.315188483061812e-06, 4.315188483061812e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:43:47,486] [INFO] [timer.py:215:stop] epoch=8/micro_step=650/global_step=8010, RunningAvgSamplesPerSec=85.10746769543883, CurrSamplesPerSec=85.18896536053435, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:43:54,996] [INFO] [logging.py:96:log_dist] [Rank 0] step=8020, skipped=154, lr=[4.304949615818686e-06, 4.304949615818686e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:43:55,029] [INFO] [timer.py:215:stop] epoch=8/micro_step=660/global_step=8020, RunningAvgSamplesPerSec=85.10722243354053, CurrSamplesPerSec=84.85238961953598, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:44:02,532] [INFO] [logging.py:96:log_dist] [Rank 0] step=8030, skipped=154, lr=[4.2947131173809494e-06, 4.2947131173809494e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:44:02,565] [INFO] [timer.py:215:stop] epoch=8/micro_step=670/global_step=8030, RunningAvgSamplesPerSec=85.10708242572191, CurrSamplesPerSec=84.91716007965468, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:44:10,074] [INFO] [logging.py:96:log_dist] [Rank 0] step=8040, skipped=154, lr=[4.284479034375376e-06, 4.284479034375376e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:44:10,107] [INFO] [timer.py:215:stop] epoch=8/micro_step=680/global_step=8040, RunningAvgSamplesPerSec=85.10686444965572, CurrSamplesPerSec=84.80890691840591, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:44:17,607] [INFO] [logging.py:96:log_dist] [Rank 0] step=8050, skipped=154, lr=[4.274247413417738e-06, 4.274247413417738e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:44:17,640] [INFO] [timer.py:215:stop] epoch=8/micro_step=690/global_step=8050, RunningAvgSamplesPerSec=85.10676408732326, CurrSamplesPerSec=85.12102649248123, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:44:22,865] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:44:23,563] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:44:25,040] [INFO] [logging.py:96:log_dist] [Rank 0] step=8060, skipped=156, lr=[4.266063920644788e-06, 4.266063920644788e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:44:25,073] [INFO] [timer.py:215:stop] epoch=8/micro_step=700/global_step=8060, RunningAvgSamplesPerSec=85.1080782382194, CurrSamplesPerSec=84.77200038653957, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:44:32,578] [INFO] [logging.py:96:log_dist] [Rank 0] step=8070, skipped=156, lr=[4.255836848809254e-06, 4.255836848809254e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:44:32,612] [INFO] [timer.py:215:stop] epoch=8/micro_step=710/global_step=8070, RunningAvgSamplesPerSec=85.10790274684294, CurrSamplesPerSec=84.92078671057668, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:44:40,120] [INFO] [logging.py:96:log_dist] [Rank 0] step=8080, skipped=156, lr=[4.245612369485483e-06, 4.245612369485483e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:44:40,153] [INFO] [timer.py:215:stop] epoch=8/micro_step=720/global_step=8080, RunningAvgSamplesPerSec=85.10769959485701, CurrSamplesPerSec=84.45692966617176, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:44:47,652] [INFO] [logging.py:96:log_dist] [Rank 0] step=8090, skipped=156, lr=[4.2353905292455066e-06, 4.2353905292455066e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:44:47,686] [INFO] [timer.py:215:stop] epoch=8/micro_step=730/global_step=8090, RunningAvgSamplesPerSec=85.10759905464748, CurrSamplesPerSec=85.02068098778194, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:44:55,183] [INFO] [logging.py:96:log_dist] [Rank 0] step=8100, skipped=156, lr=[4.225171374649331e-06, 4.225171374649331e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:44:55,218] [INFO] [timer.py:215:stop] epoch=8/micro_step=740/global_step=8100, RunningAvgSamplesPerSec=85.10751829575698, CurrSamplesPerSec=84.93419450209159, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:45:02,716] [INFO] [logging.py:96:log_dist] [Rank 0] step=8110, skipped=156, lr=[4.21495495224473e-06, 4.21495495224473e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:45:02,750] [INFO] [timer.py:215:stop] epoch=8/micro_step=750/global_step=8110, RunningAvgSamplesPerSec=85.10744998654314, CurrSamplesPerSec=84.6673479532385, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:45:10,251] [INFO] [logging.py:96:log_dist] [Rank 0] step=8120, skipped=156, lr=[4.204741308567039e-06, 4.204741308567039e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:45:10,285] [INFO] [timer.py:215:stop] epoch=8/micro_step=760/global_step=8120, RunningAvgSamplesPerSec=85.10732918550688, CurrSamplesPerSec=84.98450472355191, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:45:17,791] [INFO] [logging.py:96:log_dist] [Rank 0] step=8130, skipped=156, lr=[4.1945304901389275e-06, 4.1945304901389275e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:45:17,824] [INFO] [timer.py:215:stop] epoch=8/micro_step=770/global_step=8130, RunningAvgSamplesPerSec=85.10713843892165, CurrSamplesPerSec=84.8708470471439, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:45:25,321] [INFO] [logging.py:96:log_dist] [Rank 0] step=8140, skipped=156, lr=[4.1843225434702e-06, 4.1843225434702e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:45:25,354] [INFO] [timer.py:215:stop] epoch=8/micro_step=780/global_step=8140, RunningAvgSamplesPerSec=85.10709503685177, CurrSamplesPerSec=85.36603357647567, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:45:32,863] [INFO] [logging.py:96:log_dist] [Rank 0] step=8150, skipped=156, lr=[4.174117515057583e-06, 4.174117515057583e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:45:32,897] [INFO] [timer.py:215:stop] epoch=8/micro_step=790/global_step=8150, RunningAvgSamplesPerSec=85.10687653065254, CurrSamplesPerSec=85.15089003282507, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:45:39,620] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:45:40,318] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:45:40,319] [INFO] [logging.py:96:log_dist] [Rank 0] step=8160, skipped=158, lr=[4.165955624709205e-06, 4.165955624709205e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:45:40,319] [INFO] [timer.py:215:stop] epoch=8/micro_step=800/global_step=8160, RunningAvgSamplesPerSec=85.1083153837857, CurrSamplesPerSec=91.78984543113474, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:45:47,817] [INFO] [logging.py:96:log_dist] [Rank 0] step=8170, skipped=158, lr=[4.155755966286761e-06, 4.155755966286761e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:45:47,850] [INFO] [timer.py:215:stop] epoch=8/micro_step=810/global_step=8170, RunningAvgSamplesPerSec=85.10825299828556, CurrSamplesPerSec=84.89323208211418, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:45:55,345] [INFO] [logging.py:96:log_dist] [Rank 0] step=8180, skipped=158, lr=[4.145559356239861e-06, 4.145559356239861e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:45:55,379] [INFO] [timer.py:215:stop] epoch=8/micro_step=820/global_step=8180, RunningAvgSamplesPerSec=85.1082152943649, CurrSamplesPerSec=85.18512655968311, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:46:02,873] [INFO] [logging.py:96:log_dist] [Rank 0] step=8190, skipped=158, lr=[4.135365841013592e-06, 4.135365841013592e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:46:02,907] [INFO] [timer.py:215:stop] epoch=8/micro_step=830/global_step=8190, RunningAvgSamplesPerSec=85.10818750654012, CurrSamplesPerSec=85.24539778524617, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:46:10,413] [INFO] [logging.py:96:log_dist] [Rank 0] step=8200, skipped=158, lr=[4.12517546703894e-06, 4.12517546703894e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:46:10,446] [INFO] [timer.py:215:stop] epoch=8/micro_step=840/global_step=8200, RunningAvgSamplesPerSec=85.10800939131369, CurrSamplesPerSec=85.02787145762677, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:46:17,951] [INFO] [logging.py:96:log_dist] [Rank 0] step=8210, skipped=158, lr=[4.114988280732588e-06, 4.114988280732588e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:46:17,984] [INFO] [timer.py:215:stop] epoch=8/micro_step=850/global_step=8210, RunningAvgSamplesPerSec=85.10784986753583, CurrSamplesPerSec=84.60247792006248, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:46:25,483] [INFO] [logging.py:96:log_dist] [Rank 0] step=8220, skipped=158, lr=[4.104804328496698e-06, 4.104804328496698e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:46:25,516] [INFO] [timer.py:215:stop] epoch=8/micro_step=860/global_step=8220, RunningAvgSamplesPerSec=85.10776654771989, CurrSamplesPerSec=85.30564592218826, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:46:33,021] [INFO] [logging.py:96:log_dist] [Rank 0] step=8230, skipped=158, lr=[4.0946236567186964e-06, 4.0946236567186964e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:46:33,054] [INFO] [timer.py:215:stop] epoch=8/micro_step=870/global_step=8230, RunningAvgSamplesPerSec=85.10761356814778, CurrSamplesPerSec=85.07155207820985, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:46:40,563] [INFO] [logging.py:96:log_dist] [Rank 0] step=8240, skipped=158, lr=[4.084446311771076e-06, 4.084446311771076e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:46:40,596] [INFO] [timer.py:215:stop] epoch=8/micro_step=880/global_step=8240, RunningAvgSamplesPerSec=85.10739538894362, CurrSamplesPerSec=84.92022254764348, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:46:48,083] [INFO] [logging.py:96:log_dist] [Rank 0] step=8250, skipped=158, lr=[4.074272340011168e-06, 4.074272340011168e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:46:48,117] [INFO] [timer.py:215:stop] epoch=8/micro_step=890/global_step=8250, RunningAvgSamplesPerSec=85.10746208780212, CurrSamplesPerSec=84.89790383008493, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:46:55,608] [INFO] [logging.py:96:log_dist] [Rank 0] step=8260, skipped=158, lr=[4.064101787780942e-06, 4.064101787780942e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:46:55,641] [INFO] [timer.py:215:stop] epoch=8/micro_step=900/global_step=8260, RunningAvgSamplesPerSec=85.10748940166903, CurrSamplesPerSec=84.89159440712919, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:46:56,337] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:46:57,032] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:47:03,015] [INFO] [logging.py:96:log_dist] [Rank 0] step=8270, skipped=160, lr=[4.05596783919e-06, 4.05596783919e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:47:03,048] [INFO] [timer.py:215:stop] epoch=8/micro_step=910/global_step=8270, RunningAvgSamplesPerSec=85.10911594268863, CurrSamplesPerSec=85.23007843370094, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:47:10,542] [INFO] [logging.py:96:log_dist] [Rank 0] step=8280, skipped=160, lr=[4.045803558845116e-06, 4.045803558845116e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:47:10,576] [INFO] [timer.py:215:stop] epoch=8/micro_step=920/global_step=8280, RunningAvgSamplesPerSec=85.10910137886457, CurrSamplesPerSec=84.92218371772103, MemAllocated=5.0GB, MaxMemAllocated=23.99GB ***** Evaluating perplexity, Epoch 9/16 ***** ppl: 1.7948983907699585 Beginning of Epoch 10/16, Total Micro Batches 920 [2023-06-29 18:47:35,987] [INFO] [logging.py:96:log_dist] [Rank 0] step=8290, skipped=160, lr=[4.0356428277038916e-06, 4.0356428277038916e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:47:36,020] [INFO] [timer.py:215:stop] epoch=9/micro_step=10/global_step=8290, RunningAvgSamplesPerSec=85.10860291881383, CurrSamplesPerSec=84.60810441101752, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:47:43,523] [INFO] [logging.py:96:log_dist] [Rank 0] step=8300, skipped=160, lr=[4.0254856920479895e-06, 4.0254856920479895e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:47:43,556] [INFO] [timer.py:215:stop] epoch=9/micro_step=20/global_step=8300, RunningAvgSamplesPerSec=85.10847476063968, CurrSamplesPerSec=85.19788785389032, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:47:51,064] [INFO] [logging.py:96:log_dist] [Rank 0] step=8310, skipped=160, lr=[4.01533219814269e-06, 4.01533219814269e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:47:51,098] [INFO] [timer.py:215:stop] epoch=9/micro_step=30/global_step=8310, RunningAvgSamplesPerSec=85.10825221779403, CurrSamplesPerSec=84.95204229825052, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:47:58,587] [INFO] [logging.py:96:log_dist] [Rank 0] step=8320, skipped=160, lr=[4.005182392236684e-06, 4.005182392236684e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:47:58,621] [INFO] [timer.py:215:stop] epoch=9/micro_step=40/global_step=8320, RunningAvgSamplesPerSec=85.10829099801798, CurrSamplesPerSec=85.1556982384235, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:48:06,124] [INFO] [logging.py:96:log_dist] [Rank 0] step=8330, skipped=160, lr=[3.995036320561872e-06, 3.995036320561872e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:48:06,158] [INFO] [timer.py:215:stop] epoch=9/micro_step=50/global_step=8330, RunningAvgSamplesPerSec=85.10814534435953, CurrSamplesPerSec=85.12988076413956, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:48:13,654] [INFO] [logging.py:96:log_dist] [Rank 0] step=8340, skipped=160, lr=[3.9848940293331355e-06, 3.9848940293331355e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:48:13,687] [INFO] [timer.py:215:stop] epoch=9/micro_step=60/global_step=8340, RunningAvgSamplesPerSec=85.10809963557891, CurrSamplesPerSec=84.95198852856576, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:48:21,188] [INFO] [logging.py:96:log_dist] [Rank 0] step=8350, skipped=160, lr=[3.974755564748145e-06, 3.974755564748145e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:48:21,221] [INFO] [timer.py:215:stop] epoch=9/micro_step=70/global_step=8350, RunningAvgSamplesPerSec=85.10799224994165, CurrSamplesPerSec=85.00536153382112, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:48:28,709] [INFO] [logging.py:96:log_dist] [Rank 0] step=8360, skipped=160, lr=[3.964620972987135e-06, 3.964620972987135e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:48:28,743] [INFO] [timer.py:215:stop] epoch=9/micro_step=80/global_step=8360, RunningAvgSamplesPerSec=85.10805404534257, CurrSamplesPerSec=85.18599161136709, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:48:30,945] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:48:31,642] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:48:36,135] [INFO] [logging.py:96:log_dist] [Rank 0] step=8370, skipped=162, lr=[3.956516119033455e-06, 3.956516119033455e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:48:36,169] [INFO] [timer.py:215:stop] epoch=9/micro_step=90/global_step=8370, RunningAvgSamplesPerSec=85.10939870719932, CurrSamplesPerSec=84.52314680556356, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:48:43,688] [INFO] [logging.py:96:log_dist] [Rank 0] step=8380, skipped=162, lr=[3.946388614673359e-06, 3.946388614673359e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:48:43,721] [INFO] [timer.py:215:stop] epoch=9/micro_step=100/global_step=8380, RunningAvgSamplesPerSec=85.10903772021705, CurrSamplesPerSec=84.7832189300596, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:48:51,224] [INFO] [logging.py:96:log_dist] [Rank 0] step=8390, skipped=162, lr=[3.936265112347387e-06, 3.936265112347387e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:48:51,258] [INFO] [timer.py:215:stop] epoch=9/micro_step=110/global_step=8390, RunningAvgSamplesPerSec=85.10889878799864, CurrSamplesPerSec=85.10772155442481, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:48:58,756] [INFO] [logging.py:96:log_dist] [Rank 0] step=8400, skipped=162, lr=[3.926145658167621e-06, 3.926145658167621e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:48:58,789] [INFO] [timer.py:215:stop] epoch=9/micro_step=120/global_step=8400, RunningAvgSamplesPerSec=85.10881966600682, CurrSamplesPerSec=85.08341637696483, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:49:06,280] [INFO] [logging.py:96:log_dist] [Rank 0] step=8410, skipped=162, lr=[3.916030298227706e-06, 3.916030298227706e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:49:06,314] [INFO] [timer.py:215:stop] epoch=9/micro_step=130/global_step=8410, RunningAvgSamplesPerSec=85.10883520581861, CurrSamplesPerSec=85.26110177477844, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:49:13,811] [INFO] [logging.py:96:log_dist] [Rank 0] step=8420, skipped=162, lr=[3.905919078602639e-06, 3.905919078602639e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:49:13,845] [INFO] [timer.py:215:stop] epoch=9/micro_step=140/global_step=8420, RunningAvgSamplesPerSec=85.10877122752592, CurrSamplesPerSec=85.01316864999248, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:49:21,342] [INFO] [logging.py:96:log_dist] [Rank 0] step=8430, skipped=162, lr=[3.89581204534855e-06, 3.89581204534855e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:49:21,375] [INFO] [timer.py:215:stop] epoch=9/micro_step=150/global_step=8430, RunningAvgSamplesPerSec=85.10870641822373, CurrSamplesPerSec=85.09150756910077, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:49:28,874] [INFO] [logging.py:96:log_dist] [Rank 0] step=8440, skipped=162, lr=[3.885709244502516e-06, 3.885709244502516e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:49:28,907] [INFO] [timer.py:215:stop] epoch=9/micro_step=160/global_step=8440, RunningAvgSamplesPerSec=85.10862851684035, CurrSamplesPerSec=84.99185053541909, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:49:36,399] [INFO] [logging.py:96:log_dist] [Rank 0] step=8450, skipped=162, lr=[3.875610722082321e-06, 3.875610722082321e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:49:36,433] [INFO] [timer.py:215:stop] epoch=9/micro_step=170/global_step=8450, RunningAvgSamplesPerSec=85.10862617517459, CurrSamplesPerSec=85.27651355348965, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:49:43,942] [INFO] [logging.py:96:log_dist] [Rank 0] step=8460, skipped=162, lr=[3.865516524086265e-06, 3.865516524086265e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:49:43,976] [INFO] [timer.py:215:stop] epoch=9/micro_step=180/global_step=8460, RunningAvgSamplesPerSec=85.10839884275158, CurrSamplesPerSec=84.99874006961757, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:49:47,696] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:49:48,398] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:49:51,379] [INFO] [logging.py:96:log_dist] [Rank 0] step=8470, skipped=164, lr=[3.8574443101730934e-06, 3.8574443101730934e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:49:51,413] [INFO] [timer.py:215:stop] epoch=9/micro_step=190/global_step=8470, RunningAvgSamplesPerSec=85.10958481955683, CurrSamplesPerSec=85.09579652491085, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:49:58,917] [INFO] [logging.py:96:log_dist] [Rank 0] step=8480, skipped=164, lr=[3.847358011993206e-06, 3.847358011993206e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:49:58,951] [INFO] [timer.py:215:stop] epoch=9/micro_step=200/global_step=8480, RunningAvgSamplesPerSec=85.10942590217253, CurrSamplesPerSec=84.82230620473476, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:50:06,452] [INFO] [logging.py:96:log_dist] [Rank 0] step=8490, skipped=164, lr=[3.837276166927244e-06, 3.837276166927244e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:50:06,486] [INFO] [timer.py:215:stop] epoch=9/micro_step=210/global_step=8490, RunningAvgSamplesPerSec=85.10929826735648, CurrSamplesPerSec=85.02717120861055, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:50:13,982] [INFO] [logging.py:96:log_dist] [Rank 0] step=8500, skipped=164, lr=[3.827198820897545e-06, 3.827198820897545e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:50:14,015] [INFO] [timer.py:215:stop] epoch=9/micro_step=220/global_step=8500, RunningAvgSamplesPerSec=85.10927098657018, CurrSamplesPerSec=85.12283498974317, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:50:21,516] [INFO] [logging.py:96:log_dist] [Rank 0] step=8510, skipped=164, lr=[3.817126019805953e-06, 3.817126019805953e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:50:21,549] [INFO] [timer.py:215:stop] epoch=9/micro_step=230/global_step=8510, RunningAvgSamplesPerSec=85.10916891256747, CurrSamplesPerSec=84.75925929434291, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:50:29,053] [INFO] [logging.py:96:log_dist] [Rank 0] step=8520, skipped=164, lr=[3.807057809533608e-06, 3.807057809533608e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:50:29,086] [INFO] [timer.py:215:stop] epoch=9/micro_step=240/global_step=8520, RunningAvgSamplesPerSec=85.109023997959, CurrSamplesPerSec=85.034443587384, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:50:36,592] [INFO] [logging.py:96:log_dist] [Rank 0] step=8530, skipped=164, lr=[3.796994235940744e-06, 3.796994235940744e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:50:36,626] [INFO] [timer.py:215:stop] epoch=9/micro_step=250/global_step=8530, RunningAvgSamplesPerSec=85.10884568394431, CurrSamplesPerSec=84.93118476958499, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:50:44,129] [INFO] [logging.py:96:log_dist] [Rank 0] step=8540, skipped=164, lr=[3.786935344866471e-06, 3.786935344866471e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:50:44,162] [INFO] [timer.py:215:stop] epoch=9/micro_step=260/global_step=8540, RunningAvgSamplesPerSec=85.10872473374054, CurrSamplesPerSec=84.89213134278513, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:50:51,665] [INFO] [logging.py:96:log_dist] [Rank 0] step=8550, skipped=164, lr=[3.7768811821285694e-06, 3.7768811821285694e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:50:51,698] [INFO] [timer.py:215:stop] epoch=9/micro_step=270/global_step=8550, RunningAvgSamplesPerSec=85.1085877432304, CurrSamplesPerSec=84.94158537526944, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:50:59,191] [INFO] [logging.py:96:log_dist] [Rank 0] step=8560, skipped=164, lr=[3.7668317935232878e-06, 3.7668317935232878e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:50:59,225] [INFO] [timer.py:215:stop] epoch=9/micro_step=280/global_step=8560, RunningAvgSamplesPerSec=85.108573617435, CurrSamplesPerSec=84.56583071804044, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:51:04,440] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:51:05,134] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:51:06,606] [INFO] [logging.py:96:log_dist] [Rank 0] step=8570, skipped=166, lr=[3.7587957507757475e-06, 3.7587957507757475e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:51:06,639] [INFO] [timer.py:215:stop] epoch=9/micro_step=290/global_step=8570, RunningAvgSamplesPerSec=85.11004892480689, CurrSamplesPerSec=85.1104469737467, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:51:14,142] [INFO] [logging.py:96:log_dist] [Rank 0] step=8580, skipped=166, lr=[3.7487550709461683e-06, 3.7487550709461683e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:51:14,176] [INFO] [timer.py:215:stop] epoch=9/micro_step=300/global_step=8580, RunningAvgSamplesPerSec=85.10990250067859, CurrSamplesPerSec=84.89339316831276, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:51:21,671] [INFO] [logging.py:96:log_dist] [Rank 0] step=8590, skipped=166, lr=[3.7387192933623415e-06, 3.7387192933623415e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:51:21,705] [INFO] [timer.py:215:stop] epoch=9/micro_step=310/global_step=8590, RunningAvgSamplesPerSec=85.10986692129028, CurrSamplesPerSec=85.0771602434077, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:51:29,201] [INFO] [logging.py:96:log_dist] [Rank 0] step=8600, skipped=166, lr=[3.7286884637367676e-06, 3.7286884637367676e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:51:29,235] [INFO] [timer.py:215:stop] epoch=9/micro_step=320/global_step=8600, RunningAvgSamplesPerSec=85.10981349130805, CurrSamplesPerSec=85.2838828397434, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:51:36,738] [INFO] [logging.py:96:log_dist] [Rank 0] step=8610, skipped=166, lr=[3.718662627759408e-06, 3.718662627759408e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:51:36,771] [INFO] [timer.py:215:stop] epoch=9/micro_step=330/global_step=8610, RunningAvgSamplesPerSec=85.10966956655561, CurrSamplesPerSec=85.14170732883196, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:51:44,267] [INFO] [logging.py:96:log_dist] [Rank 0] step=8620, skipped=166, lr=[3.708641831097484e-06, 3.708641831097484e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:51:44,301] [INFO] [timer.py:215:stop] epoch=9/micro_step=340/global_step=8620, RunningAvgSamplesPerSec=85.10962671632146, CurrSamplesPerSec=85.23365065232915, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:51:51,803] [INFO] [logging.py:96:log_dist] [Rank 0] step=8630, skipped=166, lr=[3.6986261193952582e-06, 3.6986261193952582e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:51:51,836] [INFO] [timer.py:215:stop] epoch=9/micro_step=350/global_step=8630, RunningAvgSamplesPerSec=85.10949865171295, CurrSamplesPerSec=84.98897125637254, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:51:59,346] [INFO] [logging.py:96:log_dist] [Rank 0] step=8640, skipped=166, lr=[3.688615538273831e-06, 3.688615538273831e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:51:59,380] [INFO] [timer.py:215:stop] epoch=9/micro_step=360/global_step=8640, RunningAvgSamplesPerSec=85.10927088703173, CurrSamplesPerSec=84.91866442229244, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:52:06,880] [INFO] [logging.py:96:log_dist] [Rank 0] step=8650, skipped=166, lr=[3.678610133330939e-06, 3.678610133330939e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:52:06,914] [INFO] [timer.py:215:stop] epoch=9/micro_step=370/global_step=8650, RunningAvgSamplesPerSec=85.10915856162435, CurrSamplesPerSec=84.96134547828913, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:52:14,413] [INFO] [logging.py:96:log_dist] [Rank 0] step=8660, skipped=166, lr=[3.6686099501407364e-06, 3.6686099501407364e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:52:14,446] [INFO] [timer.py:215:stop] epoch=9/micro_step=380/global_step=8660, RunningAvgSamplesPerSec=85.1090750542496, CurrSamplesPerSec=84.90421419098458, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:52:21,169] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:52:21,865] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:52:21,866] [INFO] [logging.py:96:log_dist] [Rank 0] step=8670, skipped=168, lr=[3.6606135938611617e-06, 3.6606135938611617e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:52:21,866] [INFO] [timer.py:215:stop] epoch=9/micro_step=390/global_step=8670, RunningAvgSamplesPerSec=85.1104620976729, CurrSamplesPerSec=92.0582963439782, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:52:29,366] [INFO] [logging.py:96:log_dist] [Rank 0] step=8680, skipped=168, lr=[3.650622924596618e-06, 3.650622924596618e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:52:29,400] [INFO] [timer.py:215:stop] epoch=9/micro_step=400/global_step=8680, RunningAvgSamplesPerSec=85.11036465928429, CurrSamplesPerSec=84.90381137396328, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:52:36,885] [INFO] [logging.py:96:log_dist] [Rank 0] step=8690, skipped=168, lr=[3.6406376045652013e-06, 3.6406376045652013e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:52:36,918] [INFO] [timer.py:215:stop] epoch=9/micro_step=410/global_step=8690, RunningAvgSamplesPerSec=85.11045618313526, CurrSamplesPerSec=85.19120932688793, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:52:44,416] [INFO] [logging.py:96:log_dist] [Rank 0] step=8700, skipped=168, lr=[3.630657679249581e-06, 3.630657679249581e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:52:44,450] [INFO] [timer.py:215:stop] epoch=9/micro_step=420/global_step=8700, RunningAvgSamplesPerSec=85.1103810123548, CurrSamplesPerSec=85.10556293032633, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:52:51,948] [INFO] [logging.py:96:log_dist] [Rank 0] step=8710, skipped=168, lr=[3.6206831941078554e-06, 3.6206831941078554e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:52:51,981] [INFO] [timer.py:215:stop] epoch=9/micro_step=430/global_step=8710, RunningAvgSamplesPerSec=85.11031216874805, CurrSamplesPerSec=84.98590382791298, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:52:59,488] [INFO] [logging.py:96:log_dist] [Rank 0] step=8720, skipped=168, lr=[3.61071419457334e-06, 3.61071419457334e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:52:59,522] [INFO] [timer.py:215:stop] epoch=9/micro_step=440/global_step=8720, RunningAvgSamplesPerSec=85.11011521601029, CurrSamplesPerSec=85.17142324098552, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:53:07,022] [INFO] [logging.py:96:log_dist] [Rank 0] step=8730, skipped=168, lr=[3.600750726054367e-06, 3.600750726054367e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:53:07,056] [INFO] [timer.py:215:stop] epoch=9/micro_step=450/global_step=8730, RunningAvgSamplesPerSec=85.11001063670096, CurrSamplesPerSec=85.10585973464683, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:53:14,563] [INFO] [logging.py:96:log_dist] [Rank 0] step=8740, skipped=168, lr=[3.590792833934074e-06, 3.590792833934074e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:53:14,596] [INFO] [timer.py:215:stop] epoch=9/micro_step=460/global_step=8740, RunningAvgSamplesPerSec=85.1098205310403, CurrSamplesPerSec=84.74705713356633, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:53:22,095] [INFO] [logging.py:96:log_dist] [Rank 0] step=8750, skipped=168, lr=[3.580840563570196e-06, 3.580840563570196e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:53:22,129] [INFO] [timer.py:215:stop] epoch=9/micro_step=470/global_step=8750, RunningAvgSamplesPerSec=85.10973167725346, CurrSamplesPerSec=85.14935044131292, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:53:29,629] [INFO] [logging.py:96:log_dist] [Rank 0] step=8760, skipped=168, lr=[3.570893960294865e-06, 3.570893960294865e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:53:29,663] [INFO] [timer.py:215:stop] epoch=9/micro_step=480/global_step=8760, RunningAvgSamplesPerSec=85.10962693372151, CurrSamplesPerSec=84.87970302351907, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:53:37,165] [INFO] [logging.py:96:log_dist] [Rank 0] step=8770, skipped=168, lr=[3.5609530694143975e-06, 3.5609530694143975e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:53:37,199] [INFO] [timer.py:215:stop] epoch=9/micro_step=490/global_step=8770, RunningAvgSamplesPerSec=85.10949615260337, CurrSamplesPerSec=85.0783197159444, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:53:37,894] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:53:38,589] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:53:44,595] [INFO] [logging.py:96:log_dist] [Rank 0] step=8780, skipped=170, lr=[3.553004500063564e-06, 3.553004500063564e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:53:44,629] [INFO] [timer.py:215:stop] epoch=9/micro_step=500/global_step=8780, RunningAvgSamplesPerSec=85.11072577268273, CurrSamplesPerSec=84.72279704847594, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:53:52,128] [INFO] [logging.py:96:log_dist] [Rank 0] step=8790, skipped=170, lr=[3.543074005582579e-06, 3.543074005582579e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:53:52,161] [INFO] [timer.py:215:stop] epoch=9/micro_step=510/global_step=8790, RunningAvgSamplesPerSec=85.11063938354668, CurrSamplesPerSec=84.83608508529578, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:53:59,659] [INFO] [logging.py:96:log_dist] [Rank 0] step=8800, skipped=170, lr=[3.533149350215063e-06, 3.533149350215063e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:53:59,692] [INFO] [timer.py:215:stop] epoch=9/micro_step=520/global_step=8800, RunningAvgSamplesPerSec=85.11057910273588, CurrSamplesPerSec=85.12329387440582, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:54:07,197] [INFO] [logging.py:96:log_dist] [Rank 0] step=8810, skipped=170, lr=[3.5232305791673577e-06, 3.5232305791673577e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:54:07,231] [INFO] [timer.py:215:stop] epoch=9/micro_step=530/global_step=8810, RunningAvgSamplesPerSec=85.11041646220625, CurrSamplesPerSec=85.04689032142781, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:54:14,749] [INFO] [logging.py:96:log_dist] [Rank 0] step=8820, skipped=170, lr=[3.5133177376190076e-06, 3.5133177376190076e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:54:14,783] [INFO] [timer.py:215:stop] epoch=9/micro_step=540/global_step=8820, RunningAvgSamplesPerSec=85.11007495399537, CurrSamplesPerSec=85.28450603632245, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:54:22,291] [INFO] [logging.py:96:log_dist] [Rank 0] step=8830, skipped=170, lr=[3.5034108707225454e-06, 3.5034108707225454e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:54:22,325] [INFO] [timer.py:215:stop] epoch=9/micro_step=550/global_step=8830, RunningAvgSamplesPerSec=85.1098680549783, CurrSamplesPerSec=85.15348315490685, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:54:29,829] [INFO] [logging.py:96:log_dist] [Rank 0] step=8840, skipped=170, lr=[3.4935100236032875e-06, 3.4935100236032875e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:54:29,863] [INFO] [timer.py:215:stop] epoch=9/micro_step=560/global_step=8840, RunningAvgSamplesPerSec=85.10970384872672, CurrSamplesPerSec=84.97124242632393, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:54:37,371] [INFO] [logging.py:96:log_dist] [Rank 0] step=8850, skipped=170, lr=[3.483615241359139e-06, 3.483615241359139e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:54:37,405] [INFO] [timer.py:215:stop] epoch=9/micro_step=570/global_step=8850, RunningAvgSamplesPerSec=85.10950283722084, CurrSamplesPerSec=84.98170665302412, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:54:44,903] [INFO] [logging.py:96:log_dist] [Rank 0] step=8860, skipped=170, lr=[3.4737265690603706e-06, 3.4737265690603706e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:54:44,936] [INFO] [timer.py:215:stop] epoch=9/micro_step=580/global_step=8860, RunningAvgSamplesPerSec=85.10942683339186, CurrSamplesPerSec=84.7467360715163, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:54:52,437] [INFO] [logging.py:96:log_dist] [Rank 0] step=8870, skipped=170, lr=[3.463844051749425e-06, 3.463844051749425e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:54:52,471] [INFO] [timer.py:215:stop] epoch=9/micro_step=590/global_step=8870, RunningAvgSamplesPerSec=85.1093137102736, CurrSamplesPerSec=84.94650437824936, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:54:54,677] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:54:55,373] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:54:59,861] [INFO] [logging.py:96:log_dist] [Rank 0] step=8880, skipped=172, lr=[3.455942499742533e-06, 3.455942499742533e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:54:59,894] [INFO] [timer.py:215:stop] epoch=9/micro_step=600/global_step=8880, RunningAvgSamplesPerSec=85.11061880042146, CurrSamplesPerSec=85.1520245043024, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:55:07,390] [INFO] [logging.py:96:log_dist] [Rank 0] step=8890, skipped=172, lr=[3.4460711748270122e-06, 3.4460711748270122e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:55:07,423] [INFO] [timer.py:215:stop] epoch=9/micro_step=610/global_step=8890, RunningAvgSamplesPerSec=85.11057101239805, CurrSamplesPerSec=84.9093437286094, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:55:14,921] [INFO] [logging.py:96:log_dist] [Rank 0] step=8900, skipped=172, lr=[3.4362061308683534e-06, 3.4362061308683534e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:55:14,955] [INFO] [timer.py:215:stop] epoch=9/micro_step=620/global_step=8900, RunningAvgSamplesPerSec=85.11049559446995, CurrSamplesPerSec=84.868110119217, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:55:22,461] [INFO] [logging.py:96:log_dist] [Rank 0] step=8910, skipped=172, lr=[3.4263474128013763e-06, 3.4263474128013763e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:55:22,495] [INFO] [timer.py:215:stop] epoch=9/micro_step=630/global_step=8910, RunningAvgSamplesPerSec=85.11030228744309, CurrSamplesPerSec=85.03519783119195, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:55:30,004] [INFO] [logging.py:96:log_dist] [Rank 0] step=8920, skipped=172, lr=[3.416495065532083e-06, 3.416495065532083e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:55:30,037] [INFO] [timer.py:215:stop] epoch=9/micro_step=640/global_step=8920, RunningAvgSamplesPerSec=85.1100889506377, CurrSamplesPerSec=84.98980541994901, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:55:37,551] [INFO] [logging.py:96:log_dist] [Rank 0] step=8930, skipped=172, lr=[3.406649133937459e-06, 3.406649133937459e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:55:37,585] [INFO] [timer.py:215:stop] epoch=9/micro_step=650/global_step=8930, RunningAvgSamplesPerSec=85.10981667660174, CurrSamplesPerSec=84.98245996157932, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:55:45,088] [INFO] [logging.py:96:log_dist] [Rank 0] step=8940, skipped=172, lr=[3.396809662865268e-06, 3.396809662865268e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:55:45,122] [INFO] [timer.py:215:stop] epoch=9/micro_step=660/global_step=8940, RunningAvgSamplesPerSec=85.109669502963, CurrSamplesPerSec=85.00129701686122, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:55:52,623] [INFO] [logging.py:96:log_dist] [Rank 0] step=8950, skipped=172, lr=[3.386976697133843e-06, 3.386976697133843e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:55:52,657] [INFO] [timer.py:215:stop] epoch=9/micro_step=670/global_step=8950, RunningAvgSamplesPerSec=85.10954806318476, CurrSamplesPerSec=85.09231677295462, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:56:00,156] [INFO] [logging.py:96:log_dist] [Rank 0] step=8960, skipped=172, lr=[3.377150281531885e-06, 3.377150281531885e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:56:00,189] [INFO] [timer.py:215:stop] epoch=9/micro_step=680/global_step=8960, RunningAvgSamplesPerSec=85.1094563164459, CurrSamplesPerSec=85.00525385948562, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:56:07,699] [INFO] [logging.py:96:log_dist] [Rank 0] step=8970, skipped=172, lr=[3.367330460818266e-06, 3.367330460818266e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:56:07,733] [INFO] [timer.py:215:stop] epoch=9/micro_step=690/global_step=8970, RunningAvgSamplesPerSec=85.10923359825685, CurrSamplesPerSec=84.92573017306236, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:56:11,445] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:56:12,143] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:56:15,132] [INFO] [logging.py:96:log_dist] [Rank 0] step=8980, skipped=174, lr=[3.359479382625759e-06, 3.359479382625759e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:56:15,163] [INFO] [timer.py:215:stop] epoch=9/micro_step=700/global_step=8980, RunningAvgSamplesPerSec=85.11044069483074, CurrSamplesPerSec=84.77751557628463, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:56:22,675] [INFO] [logging.py:96:log_dist] [Rank 0] step=8990, skipped=174, lr=[3.349671545407474e-06, 3.349671545407474e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:56:22,709] [INFO] [timer.py:215:stop] epoch=9/micro_step=710/global_step=8990, RunningAvgSamplesPerSec=85.110185634102, CurrSamplesPerSec=84.85040485113313, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:56:30,218] [INFO] [logging.py:96:log_dist] [Rank 0] step=9000, skipped=174, lr=[3.3398704282418955e-06, 3.3398704282418955e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:56:30,252] [INFO] [timer.py:215:stop] epoch=9/micro_step=720/global_step=9000, RunningAvgSamplesPerSec=85.10996564518612, CurrSamplesPerSec=84.70215883091119, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:56:37,757] [INFO] [logging.py:96:log_dist] [Rank 0] step=9010, skipped=174, lr=[3.3300760757726578e-06, 3.3300760757726578e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:56:37,791] [INFO] [timer.py:215:stop] epoch=9/micro_step=730/global_step=9010, RunningAvgSamplesPerSec=85.10980298224281, CurrSamplesPerSec=85.1261012765326, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:56:45,283] [INFO] [logging.py:96:log_dist] [Rank 0] step=9020, skipped=174, lr=[3.32028853261258e-06, 3.32028853261258e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:56:45,316] [INFO] [timer.py:215:stop] epoch=9/micro_step=740/global_step=9020, RunningAvgSamplesPerSec=85.10980165259593, CurrSamplesPerSec=85.17496351826404, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:56:52,813] [INFO] [logging.py:96:log_dist] [Rank 0] step=9030, skipped=174, lr=[3.3105078433434694e-06, 3.3105078433434694e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:56:52,847] [INFO] [timer.py:215:stop] epoch=9/micro_step=750/global_step=9030, RunningAvgSamplesPerSec=85.10974396511452, CurrSamplesPerSec=84.88815817976267, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:57:00,337] [INFO] [logging.py:96:log_dist] [Rank 0] step=9040, skipped=174, lr=[3.300734052515911e-06, 3.300734052515911e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:57:00,371] [INFO] [timer.py:215:stop] epoch=9/micro_step=760/global_step=9040, RunningAvgSamplesPerSec=85.10976613848565, CurrSamplesPerSec=85.12375276401606, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:57:07,862] [INFO] [logging.py:96:log_dist] [Rank 0] step=9050, skipped=174, lr=[3.2909672046490673e-06, 3.2909672046490673e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:57:07,896] [INFO] [timer.py:215:stop] epoch=9/micro_step=770/global_step=9050, RunningAvgSamplesPerSec=85.109777034104, CurrSamplesPerSec=85.25647120401554, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:57:15,397] [INFO] [logging.py:96:log_dist] [Rank 0] step=9060, skipped=174, lr=[3.2812073442304823e-06, 3.2812073442304823e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:57:15,430] [INFO] [timer.py:215:stop] epoch=9/micro_step=780/global_step=9060, RunningAvgSamplesPerSec=85.1096710442042, CurrSamplesPerSec=85.46921015420952, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:57:22,921] [INFO] [logging.py:96:log_dist] [Rank 0] step=9070, skipped=174, lr=[3.271454515715864e-06, 3.271454515715864e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:57:22,954] [INFO] [timer.py:215:stop] epoch=9/micro_step=790/global_step=9070, RunningAvgSamplesPerSec=85.10968797822574, CurrSamplesPerSec=85.2332717664612, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:57:28,174] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:57:28,870] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:57:30,347] [INFO] [logging.py:96:log_dist] [Rank 0] step=9080, skipped=176, lr=[3.2636573457288193e-06, 3.2636573457288193e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:57:30,380] [INFO] [timer.py:215:stop] epoch=9/micro_step=800/global_step=9080, RunningAvgSamplesPerSec=85.1109299844678, CurrSamplesPerSec=84.70555328708171, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:57:37,883] [INFO] [logging.py:96:log_dist] [Rank 0] step=9090, skipped=176, lr=[3.253917286567367e-06, 3.253917286567367e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:57:37,916] [INFO] [timer.py:215:stop] epoch=9/micro_step=810/global_step=9090, RunningAvgSamplesPerSec=85.11080358804877, CurrSamplesPerSec=84.62917706789176, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:57:45,417] [INFO] [logging.py:96:log_dist] [Rank 0] step=9100, skipped=176, lr=[3.24418438361483e-06, 3.24418438361483e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:57:45,450] [INFO] [timer.py:215:stop] epoch=9/micro_step=820/global_step=9100, RunningAvgSamplesPerSec=85.11069978662144, CurrSamplesPerSec=85.11951497163587, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:57:52,942] [INFO] [logging.py:96:log_dist] [Rank 0] step=9110, skipped=176, lr=[3.2344586812041282e-06, 3.2344586812041282e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:57:52,976] [INFO] [timer.py:215:stop] epoch=9/micro_step=830/global_step=9110, RunningAvgSamplesPerSec=85.11069620846578, CurrSamplesPerSec=85.20732610034345, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:58:00,470] [INFO] [logging.py:96:log_dist] [Rank 0] step=9120, skipped=176, lr=[3.2247402236353862e-06, 3.2247402236353862e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:58:00,503] [INFO] [timer.py:215:stop] epoch=9/micro_step=840/global_step=9120, RunningAvgSamplesPerSec=85.11067202179144, CurrSamplesPerSec=85.0909141627228, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:58:07,991] [INFO] [logging.py:96:log_dist] [Rank 0] step=9130, skipped=176, lr=[3.215029055175729e-06, 3.215029055175729e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:58:08,024] [INFO] [timer.py:215:stop] epoch=9/micro_step=850/global_step=9130, RunningAvgSamplesPerSec=85.11073209914696, CurrSamplesPerSec=85.4044646254599, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:58:15,520] [INFO] [logging.py:96:log_dist] [Rank 0] step=9140, skipped=176, lr=[3.2053252200590755e-06, 3.2053252200590755e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:58:15,552] [INFO] [timer.py:215:stop] epoch=9/micro_step=860/global_step=9140, RunningAvgSamplesPerSec=85.1107021439534, CurrSamplesPerSec=85.09871002450228, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:58:23,045] [INFO] [logging.py:96:log_dist] [Rank 0] step=9150, skipped=176, lr=[3.1956287624859495e-06, 3.1956287624859495e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:58:23,079] [INFO] [timer.py:215:stop] epoch=9/micro_step=870/global_step=9150, RunningAvgSamplesPerSec=85.1106971009835, CurrSamplesPerSec=85.11074381213429, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:58:30,572] [INFO] [logging.py:96:log_dist] [Rank 0] step=9160, skipped=176, lr=[3.185939726623261e-06, 3.185939726623261e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:58:30,605] [INFO] [timer.py:215:stop] epoch=9/micro_step=880/global_step=9160, RunningAvgSamplesPerSec=85.11068066256857, CurrSamplesPerSec=85.0712015924341, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:58:38,108] [INFO] [logging.py:96:log_dist] [Rank 0] step=9170, skipped=176, lr=[3.1762581566041202e-06, 3.1762581566041202e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:58:38,142] [INFO] [timer.py:215:stop] epoch=9/micro_step=890/global_step=9170, RunningAvgSamplesPerSec=85.11055139735122, CurrSamplesPerSec=84.84742787261797, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:58:44,864] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 18:58:45,564] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 18:58:45,565] [INFO] [logging.py:96:log_dist] [Rank 0] step=9180, skipped=178, lr=[3.1685183056319086e-06, 3.1685183056319086e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:58:45,565] [INFO] [timer.py:215:stop] epoch=9/micro_step=900/global_step=9180, RunningAvgSamplesPerSec=85.11181344776433, CurrSamplesPerSec=91.49120024757984, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:58:53,054] [INFO] [logging.py:96:log_dist] [Rank 0] step=9190, skipped=178, lr=[3.158850285237914e-06, 3.158850285237914e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:58:53,087] [INFO] [timer.py:215:stop] epoch=9/micro_step=910/global_step=9190, RunningAvgSamplesPerSec=85.11185400805432, CurrSamplesPerSec=85.3780887821203, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:59:00,576] [INFO] [logging.py:96:log_dist] [Rank 0] step=9200, skipped=178, lr=[3.149189854078616e-06, 3.149189854078616e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:59:00,610] [INFO] [timer.py:215:stop] epoch=9/micro_step=920/global_step=9200, RunningAvgSamplesPerSec=85.11189230322576, CurrSamplesPerSec=84.7994495713676, MemAllocated=5.0GB, MaxMemAllocated=23.99GB ***** Evaluating perplexity, Epoch 10/16 ***** ppl: 1.7907774448394775 Beginning of Epoch 11/16, Total Micro Batches 920 [2023-06-29 18:59:26,001] [INFO] [logging.py:96:log_dist] [Rank 0] step=9210, skipped=178, lr=[3.139537056156834e-06, 3.139537056156834e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:59:26,035] [INFO] [timer.py:215:stop] epoch=10/micro_step=10/global_step=9210, RunningAvgSamplesPerSec=85.11156864793308, CurrSamplesPerSec=84.87548949519983, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:59:33,544] [INFO] [logging.py:96:log_dist] [Rank 0] step=9220, skipped=178, lr=[3.1298919354406117e-06, 3.1298919354406117e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:59:33,577] [INFO] [timer.py:215:stop] epoch=10/micro_step=20/global_step=9220, RunningAvgSamplesPerSec=85.11136146264462, CurrSamplesPerSec=84.99734054258758, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:59:41,076] [INFO] [logging.py:96:log_dist] [Rank 0] step=9230, skipped=178, lr=[3.120254535863029e-06, 3.120254535863029e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:59:41,110] [INFO] [timer.py:215:stop] epoch=10/micro_step=30/global_step=9230, RunningAvgSamplesPerSec=85.11127853761093, CurrSamplesPerSec=84.7709563294603, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:59:48,618] [INFO] [logging.py:96:log_dist] [Rank 0] step=9240, skipped=178, lr=[3.1106249013219936e-06, 3.1106249013219936e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:59:48,652] [INFO] [timer.py:215:stop] epoch=10/micro_step=40/global_step=9240, RunningAvgSamplesPerSec=85.11107669668638, CurrSamplesPerSec=84.85166543652217, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 18:59:56,172] [INFO] [logging.py:96:log_dist] [Rank 0] step=9250, skipped=178, lr=[3.1010030756800415e-06, 3.1010030756800415e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 18:59:56,206] [INFO] [timer.py:215:stop] epoch=10/micro_step=50/global_step=9250, RunningAvgSamplesPerSec=85.11072659026894, CurrSamplesPerSec=84.78003245471469, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:00:03,705] [INFO] [logging.py:96:log_dist] [Rank 0] step=9260, skipped=178, lr=[3.0913891027641468e-06, 3.0913891027641468e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:00:03,739] [INFO] [timer.py:215:stop] epoch=10/micro_step=60/global_step=9260, RunningAvgSamplesPerSec=85.11062723407447, CurrSamplesPerSec=85.11247091291361, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:00:11,233] [INFO] [logging.py:96:log_dist] [Rank 0] step=9270, skipped=178, lr=[3.0817830263655086e-06, 3.0817830263655086e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:00:11,267] [INFO] [timer.py:215:stop] epoch=10/micro_step=70/global_step=9270, RunningAvgSamplesPerSec=85.1106024156174, CurrSamplesPerSec=85.03008001084592, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:00:18,757] [INFO] [logging.py:96:log_dist] [Rank 0] step=9280, skipped=178, lr=[3.0721848902393567e-06, 3.0721848902393567e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:00:18,791] [INFO] [timer.py:215:stop] epoch=10/micro_step=80/global_step=9280, RunningAvgSamplesPerSec=85.11062755227026, CurrSamplesPerSec=85.04570476120172, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:00:19,487] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:00:20,183] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:00:26,177] [INFO] [logging.py:96:log_dist] [Rank 0] step=9290, skipped=180, lr=[3.0645121277150607e-06, 3.0645121277150607e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:00:26,211] [INFO] [timer.py:215:stop] epoch=10/micro_step=90/global_step=9290, RunningAvgSamplesPerSec=85.11191123442201, CurrSamplesPerSec=84.75596757596797, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:00:33,717] [INFO] [logging.py:96:log_dist] [Rank 0] step=9300, skipped=180, lr=[3.054928394227003e-06, 3.054928394227003e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:00:33,751] [INFO] [timer.py:215:stop] epoch=10/micro_step=100/global_step=9300, RunningAvgSamplesPerSec=85.11172850627904, CurrSamplesPerSec=84.55578826675655, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:00:41,254] [INFO] [logging.py:96:log_dist] [Rank 0] step=9310, skipped=180, lr=[3.0453527233330375e-06, 3.0453527233330375e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:00:41,288] [INFO] [timer.py:215:stop] epoch=10/micro_step=110/global_step=9310, RunningAvgSamplesPerSec=85.11158194789567, CurrSamplesPerSec=84.6633423978008, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:00:48,797] [INFO] [logging.py:96:log_dist] [Rank 0] step=9320, skipped=180, lr=[3.035785158649902e-06, 3.035785158649902e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:00:48,831] [INFO] [timer.py:215:stop] epoch=10/micro_step=120/global_step=9320, RunningAvgSamplesPerSec=85.11137369338147, CurrSamplesPerSec=84.74628123777632, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:00:56,327] [INFO] [logging.py:96:log_dist] [Rank 0] step=9330, skipped=180, lr=[3.0262257437574108e-06, 3.0262257437574108e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:00:56,361] [INFO] [timer.py:215:stop] epoch=10/micro_step=130/global_step=9330, RunningAvgSamplesPerSec=85.11131798771733, CurrSamplesPerSec=85.25717523498207, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:01:03,873] [INFO] [logging.py:96:log_dist] [Rank 0] step=9340, skipped=180, lr=[3.016674522198254e-06, 3.016674522198254e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:01:03,907] [INFO] [timer.py:215:stop] epoch=10/micro_step=140/global_step=9340, RunningAvgSamplesPerSec=85.11106834620949, CurrSamplesPerSec=84.89916582722266, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:01:11,404] [INFO] [logging.py:96:log_dist] [Rank 0] step=9350, skipped=180, lr=[3.0071315374778044e-06, 3.0071315374778044e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:01:11,438] [INFO] [timer.py:215:stop] epoch=10/micro_step=150/global_step=9350, RunningAvgSamplesPerSec=85.11100219208139, CurrSamplesPerSec=85.22915836527682, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:01:18,942] [INFO] [logging.py:96:log_dist] [Rank 0] step=9360, skipped=180, lr=[2.9975968330639143e-06, 2.9975968330639143e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:01:18,975] [INFO] [timer.py:215:stop] epoch=10/micro_step=160/global_step=9360, RunningAvgSamplesPerSec=85.11086603962896, CurrSamplesPerSec=84.71488276081021, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:01:26,471] [INFO] [logging.py:96:log_dist] [Rank 0] step=9370, skipped=180, lr=[2.988070452386718e-06, 2.988070452386718e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:01:26,505] [INFO] [timer.py:215:stop] epoch=10/micro_step=170/global_step=9370, RunningAvgSamplesPerSec=85.11081607188939, CurrSamplesPerSec=84.91439330289809, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:01:34,000] [INFO] [logging.py:96:log_dist] [Rank 0] step=9380, skipped=180, lr=[2.978552438838442e-06, 2.978552438838442e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:01:34,034] [INFO] [timer.py:215:stop] epoch=10/micro_step=180/global_step=9380, RunningAvgSamplesPerSec=85.11077314272697, CurrSamplesPerSec=85.05793917639681, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:01:36,239] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:01:36,933] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:01:41,418] [INFO] [logging.py:96:log_dist] [Rank 0] step=9390, skipped=182, lr=[2.9709440814678908e-06, 2.9709440814678908e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:01:41,452] [INFO] [timer.py:215:stop] epoch=10/micro_step=190/global_step=9390, RunningAvgSamplesPerSec=85.11206366251706, CurrSamplesPerSec=85.09056351738975, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:01:48,958] [INFO] [logging.py:96:log_dist] [Rank 0] step=9400, skipped=182, lr=[2.9614412379782863e-06, 2.9614412379782863e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:01:48,991] [INFO] [timer.py:215:stop] epoch=10/micro_step=200/global_step=9400, RunningAvgSamplesPerSec=85.11189149109009, CurrSamplesPerSec=84.54628076996245, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:01:56,490] [INFO] [logging.py:96:log_dist] [Rank 0] step=9410, skipped=182, lr=[2.9519468829124396e-06, 2.9519468829124396e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:01:56,524] [INFO] [timer.py:215:stop] epoch=10/micro_step=210/global_step=9410, RunningAvgSamplesPerSec=85.11180671981667, CurrSamplesPerSec=84.8944133951003, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:02:04,027] [INFO] [logging.py:96:log_dist] [Rank 0] step=9420, skipped=182, lr=[2.9424610595166944e-06, 2.9424610595166944e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:02:04,060] [INFO] [timer.py:215:stop] epoch=10/micro_step=220/global_step=9420, RunningAvgSamplesPerSec=85.1116712398512, CurrSamplesPerSec=85.08711116379926, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:02:11,556] [INFO] [logging.py:96:log_dist] [Rank 0] step=9430, skipped=182, lr=[2.932983810998537e-06, 2.932983810998537e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:02:11,590] [INFO] [timer.py:215:stop] epoch=10/micro_step=230/global_step=9430, RunningAvgSamplesPerSec=85.11161993652132, CurrSamplesPerSec=85.07756470698232, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:02:19,096] [INFO] [logging.py:96:log_dist] [Rank 0] step=9440, skipped=182, lr=[2.9235151805263955e-06, 2.9235151805263955e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:02:19,129] [INFO] [timer.py:215:stop] epoch=10/micro_step=240/global_step=9440, RunningAvgSamplesPerSec=85.11145463094563, CurrSamplesPerSec=84.48704692566311, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:02:26,639] [INFO] [logging.py:96:log_dist] [Rank 0] step=9450, skipped=182, lr=[2.914055211229443e-06, 2.914055211229443e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:02:26,672] [INFO] [timer.py:215:stop] epoch=10/micro_step=250/global_step=9450, RunningAvgSamplesPerSec=85.11124376191633, CurrSamplesPerSec=85.050770567829, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:02:34,164] [INFO] [logging.py:96:log_dist] [Rank 0] step=9460, skipped=182, lr=[2.904603946197398e-06, 2.904603946197398e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:02:34,198] [INFO] [timer.py:215:stop] epoch=10/micro_step=260/global_step=9460, RunningAvgSamplesPerSec=85.1112489397461, CurrSamplesPerSec=85.1311766657216, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:02:41,690] [INFO] [logging.py:96:log_dist] [Rank 0] step=9470, skipped=182, lr=[2.8951614284803398e-06, 2.8951614284803398e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:02:41,723] [INFO] [timer.py:215:stop] epoch=10/micro_step=270/global_step=9470, RunningAvgSamplesPerSec=85.11124092730056, CurrSamplesPerSec=85.0512286758411, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:02:49,218] [INFO] [logging.py:96:log_dist] [Rank 0] step=9480, skipped=182, lr=[2.885727701088495e-06, 2.885727701088495e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:02:49,251] [INFO] [timer.py:215:stop] epoch=10/micro_step=280/global_step=9480, RunningAvgSamplesPerSec=85.11120989511714, CurrSamplesPerSec=84.76383601058839, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:02:52,962] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:02:53,660] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:02:56,639] [INFO] [logging.py:96:log_dist] [Rank 0] step=9490, skipped=184, lr=[2.8781870770864895e-06, 2.8781870770864895e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:02:56,673] [INFO] [timer.py:215:stop] epoch=10/micro_step=290/global_step=9490, RunningAvgSamplesPerSec=85.11245158890213, CurrSamplesPerSec=85.21373664624114, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:03:04,169] [INFO] [logging.py:96:log_dist] [Rank 0] step=9500, skipped=184, lr=[2.8687692805378802e-06, 2.8687692805378802e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:03:04,202] [INFO] [timer.py:215:stop] epoch=10/micro_step=300/global_step=9500, RunningAvgSamplesPerSec=85.11240866330982, CurrSamplesPerSec=85.27133955842015, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:03:11,712] [INFO] [logging.py:96:log_dist] [Rank 0] step=9510, skipped=184, lr=[2.859360394529495e-06, 2.859360394529495e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:03:11,746] [INFO] [timer.py:215:stop] epoch=10/micro_step=310/global_step=9510, RunningAvgSamplesPerSec=85.11219768543972, CurrSamplesPerSec=85.01642652592881, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:03:19,251] [INFO] [logging.py:96:log_dist] [Rank 0] step=9520, skipped=184, lr=[2.8499604619183716e-06, 2.8499604619183716e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:03:19,284] [INFO] [timer.py:215:stop] epoch=10/micro_step=320/global_step=9520, RunningAvgSamplesPerSec=85.11205012354105, CurrSamplesPerSec=84.91689145264942, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:03:26,794] [INFO] [logging.py:96:log_dist] [Rank 0] step=9530, skipped=184, lr=[2.8405695255207722e-06, 2.8405695255207722e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:03:26,828] [INFO] [timer.py:215:stop] epoch=10/micro_step=330/global_step=9530, RunningAvgSamplesPerSec=85.11183278956766, CurrSamplesPerSec=84.97659526916358, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:03:34,326] [INFO] [logging.py:96:log_dist] [Rank 0] step=9540, skipped=184, lr=[2.831187628111973e-06, 2.831187628111973e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:03:34,360] [INFO] [timer.py:215:stop] epoch=10/micro_step=340/global_step=9540, RunningAvgSamplesPerSec=85.11175989340354, CurrSamplesPerSec=85.30819425707617, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:03:41,856] [INFO] [logging.py:96:log_dist] [Rank 0] step=9550, skipped=184, lr=[2.8218148124260823e-06, 2.8218148124260823e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:03:41,889] [INFO] [timer.py:215:stop] epoch=10/micro_step=350/global_step=9550, RunningAvgSamplesPerSec=85.1117127765911, CurrSamplesPerSec=85.28873317525441, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:03:49,381] [INFO] [logging.py:96:log_dist] [Rank 0] step=9560, skipped=184, lr=[2.8124511211558416e-06, 2.8124511211558416e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:03:49,415] [INFO] [timer.py:215:stop] epoch=10/micro_step=360/global_step=9560, RunningAvgSamplesPerSec=85.11171515088128, CurrSamplesPerSec=84.94252612412944, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:03:56,916] [INFO] [logging.py:96:log_dist] [Rank 0] step=9570, skipped=184, lr=[2.8030965969524295e-06, 2.8030965969524295e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:03:56,950] [INFO] [timer.py:215:stop] epoch=10/micro_step=370/global_step=9570, RunningAvgSamplesPerSec=85.1116124249099, CurrSamplesPerSec=84.94287555044194, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:04:04,443] [INFO] [logging.py:96:log_dist] [Rank 0] step=9580, skipped=184, lr=[2.79375128242527e-06, 2.79375128242527e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:04:04,477] [INFO] [timer.py:215:stop] epoch=10/micro_step=380/global_step=9580, RunningAvgSamplesPerSec=85.11159394813184, CurrSamplesPerSec=85.00894186080389, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:04:09,684] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:04:10,381] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:04:11,857] [INFO] [logging.py:96:log_dist] [Rank 0] step=9590, skipped=186, lr=[2.7862816903772034e-06, 2.7862816903772034e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:04:11,890] [INFO] [timer.py:215:stop] epoch=10/micro_step=390/global_step=9590, RunningAvgSamplesPerSec=85.11291994717988, CurrSamplesPerSec=84.7531577747594, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:04:19,387] [INFO] [logging.py:96:log_dist] [Rank 0] step=9600, skipped=186, lr=[2.7769530605090217e-06, 2.7769530605090217e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:04:19,421] [INFO] [timer.py:215:stop] epoch=10/micro_step=400/global_step=9600, RunningAvgSamplesPerSec=85.11285592913947, CurrSamplesPerSec=84.7654419930763, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:04:26,917] [INFO] [logging.py:96:log_dist] [Rank 0] step=9610, skipped=186, lr=[2.7676337593996896e-06, 2.7676337593996896e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:04:26,951] [INFO] [timer.py:215:stop] epoch=10/micro_step=410/global_step=9610, RunningAvgSamplesPerSec=85.11280982399829, CurrSamplesPerSec=84.97105414764778, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:04:34,456] [INFO] [logging.py:96:log_dist] [Rank 0] step=9620, skipped=186, lr=[2.758323829498193e-06, 2.758323829498193e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:04:34,490] [INFO] [timer.py:215:stop] epoch=10/micro_step=420/global_step=9620, RunningAvgSamplesPerSec=85.1126541145314, CurrSamplesPerSec=85.30025154530675, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:04:41,978] [INFO] [logging.py:96:log_dist] [Rank 0] step=9630, skipped=186, lr=[2.749023313210828e-06, 2.749023313210828e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:04:42,011] [INFO] [timer.py:215:stop] epoch=10/micro_step=430/global_step=9630, RunningAvgSamplesPerSec=85.11269811527369, CurrSamplesPerSec=85.17642295486975, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:04:49,511] [INFO] [logging.py:96:log_dist] [Rank 0] step=9640, skipped=186, lr=[2.739732252901016e-06, 2.739732252901016e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:04:49,545] [INFO] [timer.py:215:stop] epoch=10/micro_step=440/global_step=9640, RunningAvgSamplesPerSec=85.11260627121685, CurrSamplesPerSec=85.20702858788005, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:04:57,043] [INFO] [logging.py:96:log_dist] [Rank 0] step=9650, skipped=186, lr=[2.7304506908891064e-06, 2.7304506908891064e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:04:57,076] [INFO] [timer.py:215:stop] epoch=10/micro_step=450/global_step=9650, RunningAvgSamplesPerSec=85.11254174973084, CurrSamplesPerSec=84.80392346414285, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:05:04,581] [INFO] [logging.py:96:log_dist] [Rank 0] step=9660, skipped=186, lr=[2.721178669452184e-06, 2.721178669452184e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:05:04,614] [INFO] [timer.py:215:stop] epoch=10/micro_step=460/global_step=9660, RunningAvgSamplesPerSec=85.11238763989587, CurrSamplesPerSec=85.21479163862526, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:05:12,108] [INFO] [logging.py:96:log_dist] [Rank 0] step=9670, skipped=186, lr=[2.711916230823877e-06, 2.711916230823877e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:05:12,142] [INFO] [timer.py:215:stop] epoch=10/micro_step=470/global_step=9670, RunningAvgSamplesPerSec=85.11236130064663, CurrSamplesPerSec=85.01319557356375, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:05:19,644] [INFO] [logging.py:96:log_dist] [Rank 0] step=9680, skipped=186, lr=[2.7026634171941642e-06, 2.7026634171941642e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:05:19,678] [INFO] [timer.py:215:stop] epoch=10/micro_step=480/global_step=9680, RunningAvgSamplesPerSec=85.11223963775454, CurrSamplesPerSec=84.77853301978263, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:05:26,405] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:05:27,103] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:05:27,104] [INFO] [logging.py:96:log_dist] [Rank 0] step=9690, skipped=188, lr=[2.6952681246130607e-06, 2.6952681246130607e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:05:27,104] [INFO] [timer.py:215:stop] epoch=10/micro_step=490/global_step=9690, RunningAvgSamplesPerSec=85.11339609631023, CurrSamplesPerSec=91.68146878630998, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:05:34,605] [INFO] [logging.py:96:log_dist] [Rank 0] step=9700, skipped=188, lr=[2.686032742159498e-06, 2.686032742159498e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:05:34,639] [INFO] [timer.py:215:stop] epoch=10/micro_step=500/global_step=9700, RunningAvgSamplesPerSec=85.11329429009747, CurrSamplesPerSec=84.91710635411766, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:05:42,130] [INFO] [logging.py:96:log_dist] [Rank 0] step=9710, skipped=188, lr=[2.676807102602617e-06, 2.676807102602617e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:05:42,164] [INFO] [timer.py:215:stop] epoch=10/micro_step=510/global_step=9710, RunningAvgSamplesPerSec=85.11330138960744, CurrSamplesPerSec=85.1168969286181, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:05:49,650] [INFO] [logging.py:96:log_dist] [Rank 0] step=9720, skipped=188, lr=[2.6675912479647796e-06, 2.6675912479647796e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:05:49,684] [INFO] [timer.py:215:stop] epoch=10/micro_step=520/global_step=9720, RunningAvgSamplesPerSec=85.11336256801039, CurrSamplesPerSec=85.37830602425963, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:05:57,191] [INFO] [logging.py:96:log_dist] [Rank 0] step=9730, skipped=188, lr=[2.6583852202237785e-06, 2.6583852202237785e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:05:57,225] [INFO] [timer.py:215:stop] epoch=10/micro_step=530/global_step=9730, RunningAvgSamplesPerSec=85.113183256433, CurrSamplesPerSec=84.96892934614536, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:06:04,720] [INFO] [logging.py:96:log_dist] [Rank 0] step=9740, skipped=188, lr=[2.6491890613126433e-06, 2.6491890613126433e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:06:04,754] [INFO] [timer.py:215:stop] epoch=10/micro_step=540/global_step=9740, RunningAvgSamplesPerSec=85.11314118263985, CurrSamplesPerSec=85.3419605075332, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:06:12,248] [INFO] [logging.py:96:log_dist] [Rank 0] step=9750, skipped=188, lr=[2.6400028131194465e-06, 2.6400028131194465e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:06:12,281] [INFO] [timer.py:215:stop] epoch=10/micro_step=550/global_step=9750, RunningAvgSamplesPerSec=85.11312020501846, CurrSamplesPerSec=85.23817047922674, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:06:19,778] [INFO] [logging.py:96:log_dist] [Rank 0] step=9760, skipped=188, lr=[2.6308265174871297e-06, 2.6308265174871297e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:06:19,811] [INFO] [timer.py:215:stop] epoch=10/micro_step=560/global_step=9760, RunningAvgSamplesPerSec=85.11305862137623, CurrSamplesPerSec=84.75064249171865, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:06:27,303] [INFO] [logging.py:96:log_dist] [Rank 0] step=9770, skipped=188, lr=[2.6216602162132887e-06, 2.6216602162132887e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:06:27,336] [INFO] [timer.py:215:stop] epoch=10/micro_step=570/global_step=9770, RunningAvgSamplesPerSec=85.1130607576235, CurrSamplesPerSec=85.2022686710043, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:06:34,840] [INFO] [logging.py:96:log_dist] [Rank 0] step=9780, skipped=188, lr=[2.612503951050003e-06, 2.612503951050003e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:06:34,874] [INFO] [timer.py:215:stop] epoch=10/micro_step=580/global_step=9780, RunningAvgSamplesPerSec=85.11292968860029, CurrSamplesPerSec=84.91987330767937, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:06:42,371] [INFO] [logging.py:96:log_dist] [Rank 0] step=9790, skipped=188, lr=[2.603357763703635e-06, 2.603357763703635e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:06:42,405] [INFO] [timer.py:215:stop] epoch=10/micro_step=590/global_step=9790, RunningAvgSamplesPerSec=85.1128681936149, CurrSamplesPerSec=85.02811385420428, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:06:43,100] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:06:43,797] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:06:49,792] [INFO] [logging.py:96:log_dist] [Rank 0] step=9800, skipped=190, lr=[2.596048097852099e-06, 2.596048097852099e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:06:49,826] [INFO] [timer.py:215:stop] epoch=10/micro_step=600/global_step=9800, RunningAvgSamplesPerSec=85.11407429856348, CurrSamplesPerSec=85.11525061291374, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:06:57,329] [INFO] [logging.py:96:log_dist] [Rank 0] step=9810, skipped=190, lr=[2.586920155529573e-06, 2.586920155529573e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:06:57,363] [INFO] [timer.py:215:stop] epoch=10/micro_step=610/global_step=9810, RunningAvgSamplesPerSec=85.11394305345378, CurrSamplesPerSec=84.99419177527008, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:07:04,861] [INFO] [logging.py:96:log_dist] [Rank 0] step=9820, skipped=190, lr=[2.57780240755697e-06, 2.57780240755697e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:07:04,893] [INFO] [timer.py:215:stop] epoch=10/micro_step=620/global_step=9820, RunningAvgSamplesPerSec=85.11388814266942, CurrSamplesPerSec=85.06958400319951, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:07:12,400] [INFO] [logging.py:96:log_dist] [Rank 0] step=9830, skipped=190, lr=[2.568694895465204e-06, 2.568694895465204e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:07:12,434] [INFO] [timer.py:215:stop] epoch=10/micro_step=630/global_step=9830, RunningAvgSamplesPerSec=85.11371431444482, CurrSamplesPerSec=84.93029801408315, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:07:19,933] [INFO] [logging.py:96:log_dist] [Rank 0] step=9840, skipped=190, lr=[2.559597660738574e-06, 2.559597660738574e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:07:19,967] [INFO] [timer.py:215:stop] epoch=10/micro_step=640/global_step=9840, RunningAvgSamplesPerSec=85.11362596329049, CurrSamplesPerSec=85.17366628326657, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:07:27,468] [INFO] [logging.py:96:log_dist] [Rank 0] step=9850, skipped=190, lr=[2.5505107448145615e-06, 2.5505107448145615e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:07:27,502] [INFO] [timer.py:215:stop] epoch=10/micro_step=650/global_step=9850, RunningAvgSamplesPerSec=85.11351564681618, CurrSamplesPerSec=84.99696375472305, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:07:35,005] [INFO] [logging.py:96:log_dist] [Rank 0] step=9860, skipped=190, lr=[2.541434189083649e-06, 2.541434189083649e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:07:35,038] [INFO] [timer.py:215:stop] epoch=10/micro_step=660/global_step=9860, RunningAvgSamplesPerSec=85.11339629867203, CurrSamplesPerSec=84.90563750840562, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:07:42,537] [INFO] [logging.py:96:log_dist] [Rank 0] step=9870, skipped=190, lr=[2.532368034889122e-06, 2.532368034889122e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:07:42,570] [INFO] [timer.py:215:stop] epoch=10/micro_step=670/global_step=9870, RunningAvgSamplesPerSec=85.1133166914887, CurrSamplesPerSec=85.23633001281551, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:07:50,060] [INFO] [logging.py:96:log_dist] [Rank 0] step=9880, skipped=190, lr=[2.5233123235268985e-06, 2.5233123235268985e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:07:50,094] [INFO] [timer.py:215:stop] epoch=10/micro_step=680/global_step=9880, RunningAvgSamplesPerSec=85.11333916136944, CurrSamplesPerSec=85.05316895272315, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:07:57,575] [INFO] [logging.py:96:log_dist] [Rank 0] step=9890, skipped=190, lr=[2.51426709624532e-06, 2.51426709624532e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:07:57,609] [INFO] [timer.py:215:stop] epoch=10/micro_step=690/global_step=9890, RunningAvgSamplesPerSec=85.1134522414776, CurrSamplesPerSec=85.43942209329491, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:07:59,810] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:08:00,508] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:08:04,997] [INFO] [logging.py:96:log_dist] [Rank 0] step=9900, skipped=192, lr=[2.50703849064653e-06, 2.50703849064653e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:08:05,030] [INFO] [timer.py:215:stop] epoch=10/micro_step=700/global_step=9900, RunningAvgSamplesPerSec=85.11464289878731, CurrSamplesPerSec=85.00242750573628, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:08:12,536] [INFO] [logging.py:96:log_dist] [Rank 0] step=9910, skipped=192, lr=[2.4980122385033927e-06, 2.4980122385033927e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:08:12,570] [INFO] [timer.py:215:stop] epoch=10/micro_step=710/global_step=9910, RunningAvgSamplesPerSec=85.11448159881701, CurrSamplesPerSec=84.99599488698198, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:08:20,067] [INFO] [logging.py:96:log_dist] [Rank 0] step=9920, skipped=192, lr=[2.4889965856816176e-06, 2.4889965856816176e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:08:20,101] [INFO] [timer.py:215:stop] epoch=10/micro_step=720/global_step=9920, RunningAvgSamplesPerSec=85.11441473881729, CurrSamplesPerSec=85.17069360209535, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:08:27,599] [INFO] [logging.py:96:log_dist] [Rank 0] step=9930, skipped=192, lr=[2.47999157324708e-06, 2.47999157324708e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:08:27,633] [INFO] [timer.py:215:stop] epoch=10/micro_step=730/global_step=9930, RunningAvgSamplesPerSec=85.11434113349202, CurrSamplesPerSec=85.23771035517267, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:08:35,133] [INFO] [logging.py:96:log_dist] [Rank 0] step=9940, skipped=192, lr=[2.470997242217201e-06, 2.470997242217201e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:08:35,167] [INFO] [timer.py:215:stop] epoch=10/micro_step=740/global_step=9940, RunningAvgSamplesPerSec=85.11424582399196, CurrSamplesPerSec=84.96363125337088, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:08:42,670] [INFO] [logging.py:96:log_dist] [Rank 0] step=9950, skipped=192, lr=[2.462013633560736e-06, 2.462013633560736e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:08:42,704] [INFO] [timer.py:215:stop] epoch=10/micro_step=750/global_step=9950, RunningAvgSamplesPerSec=85.11411601152757, CurrSamplesPerSec=84.54921002286055, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:08:50,225] [INFO] [logging.py:96:log_dist] [Rank 0] step=9960, skipped=192, lr=[2.4530407881976083e-06, 2.4530407881976083e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:08:50,258] [INFO] [timer.py:215:stop] epoch=10/micro_step=760/global_step=9960, RunningAvgSamplesPerSec=85.1137866618527, CurrSamplesPerSec=84.88434644049852, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:08:57,772] [INFO] [logging.py:96:log_dist] [Rank 0] step=9970, skipped=192, lr=[2.4440787469987114e-06, 2.4440787469987114e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:08:57,806] [INFO] [timer.py:215:stop] epoch=10/micro_step=770/global_step=9970, RunningAvgSamplesPerSec=85.11352793689174, CurrSamplesPerSec=84.84506789395515, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:09:05,313] [INFO] [logging.py:96:log_dist] [Rank 0] step=9980, skipped=192, lr=[2.4351275507857298e-06, 2.4351275507857298e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:09:05,347] [INFO] [timer.py:215:stop] epoch=10/micro_step=780/global_step=9980, RunningAvgSamplesPerSec=85.11334577340982, CurrSamplesPerSec=85.24052530950385, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:09:12,857] [INFO] [logging.py:96:log_dist] [Rank 0] step=9990, skipped=192, lr=[2.4261872403309417e-06, 2.4261872403309417e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:09:12,891] [INFO] [timer.py:215:stop] epoch=10/micro_step=790/global_step=9990, RunningAvgSamplesPerSec=85.1131409252602, CurrSamplesPerSec=84.92390317422856, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:09:16,602] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:09:17,299] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:09:20,290] [INFO] [logging.py:96:log_dist] [Rank 0] step=10000, skipped=194, lr=[2.419042857080341e-06, 2.419042857080341e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:09:20,323] [INFO] [timer.py:215:stop] epoch=10/micro_step=800/global_step=10000, RunningAvgSamplesPerSec=85.11419385128755, CurrSamplesPerSec=84.45392709228676, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:09:27,848] [INFO] [logging.py:96:log_dist] [Rank 0] step=10010, skipped=194, lr=[2.4101222435780758e-06, 2.4101222435780758e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:09:27,881] [INFO] [timer.py:215:stop] epoch=10/micro_step=810/global_step=10010, RunningAvgSamplesPerSec=85.11382911742946, CurrSamplesPerSec=84.48478672140037, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:09:35,395] [INFO] [logging.py:96:log_dist] [Rank 0] step=10020, skipped=194, lr=[2.401212629732021e-06, 2.401212629732021e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:09:35,428] [INFO] [timer.py:215:stop] epoch=10/micro_step=820/global_step=10020, RunningAvgSamplesPerSec=85.11357346218573, CurrSamplesPerSec=84.77783687159307, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:09:42,949] [INFO] [logging.py:96:log_dist] [Rank 0] step=10030, skipped=194, lr=[2.3923140561250565e-06, 2.3923140561250565e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:09:42,983] [INFO] [timer.py:215:stop] epoch=10/micro_step=830/global_step=10030, RunningAvgSamplesPerSec=85.1132426400825, CurrSamplesPerSec=84.9349469685488, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:09:50,499] [INFO] [logging.py:96:log_dist] [Rank 0] step=10040, skipped=194, lr=[2.383426563289774e-06, 2.383426563289774e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:09:50,534] [INFO] [timer.py:215:stop] epoch=10/micro_step=840/global_step=10040, RunningAvgSamplesPerSec=85.11294681135094, CurrSamplesPerSec=84.84697195741775, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:09:58,051] [INFO] [logging.py:96:log_dist] [Rank 0] step=10050, skipped=194, lr=[2.374550191708286e-06, 2.374550191708286e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:09:58,084] [INFO] [timer.py:215:stop] epoch=10/micro_step=850/global_step=10050, RunningAvgSamplesPerSec=85.11266314379729, CurrSamplesPerSec=85.13833183261427, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:10:05,598] [INFO] [logging.py:96:log_dist] [Rank 0] step=10060, skipped=194, lr=[2.3656849818120608e-06, 2.3656849818120608e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:10:05,632] [INFO] [timer.py:215:stop] epoch=10/micro_step=860/global_step=10060, RunningAvgSamplesPerSec=85.11241393221573, CurrSamplesPerSec=85.05723843204828, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:10:13,154] [INFO] [logging.py:96:log_dist] [Rank 0] step=10070, skipped=194, lr=[2.356830973981711e-06, 2.356830973981711e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:10:13,187] [INFO] [timer.py:215:stop] epoch=10/micro_step=870/global_step=10070, RunningAvgSamplesPerSec=85.11207844988955, CurrSamplesPerSec=84.75832260048644, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:10:20,692] [INFO] [logging.py:96:log_dist] [Rank 0] step=10080, skipped=194, lr=[2.3479882085468388e-06, 2.3479882085468388e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:10:20,725] [INFO] [timer.py:215:stop] epoch=10/micro_step=880/global_step=10080, RunningAvgSamplesPerSec=85.11193021705252, CurrSamplesPerSec=84.93677444259694, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:10:28,219] [INFO] [logging.py:96:log_dist] [Rank 0] step=10090, skipped=194, lr=[2.3391567257858264e-06, 2.3391567257858264e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:10:28,253] [INFO] [timer.py:215:stop] epoch=10/micro_step=890/global_step=10090, RunningAvgSamplesPerSec=85.11190676028909, CurrSamplesPerSec=84.82040324815443, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:10:33,463] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:10:34,159] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:10:35,634] [INFO] [logging.py:96:log_dist] [Rank 0] step=10100, skipped=196, lr=[2.332099690136468e-06, 2.332099690136468e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:10:35,668] [INFO] [timer.py:215:stop] epoch=10/micro_step=900/global_step=10100, RunningAvgSamplesPerSec=85.11314582389134, CurrSamplesPerSec=85.02337390519942, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:10:43,150] [INFO] [logging.py:96:log_dist] [Rank 0] step=10110, skipped=196, lr=[2.3232886175257783e-06, 2.3232886175257783e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:10:43,184] [INFO] [timer.py:215:stop] epoch=10/micro_step=910/global_step=10110, RunningAvgSamplesPerSec=85.1132420988413, CurrSamplesPerSec=85.23451668981942, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:10:50,675] [INFO] [logging.py:96:log_dist] [Rank 0] step=10120, skipped=196, lr=[2.314488940094443e-06, 2.314488940094443e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:10:50,709] [INFO] [timer.py:215:stop] epoch=10/micro_step=920/global_step=10120, RunningAvgSamplesPerSec=85.11324906626204, CurrSamplesPerSec=84.74422116961885, MemAllocated=5.0GB, MaxMemAllocated=23.99GB ***** Evaluating perplexity, Epoch 11/16 ***** ppl: 1.7752763032913208 Beginning of Epoch 12/16, Total Micro Batches 920 [2023-06-29 19:11:16,107] [INFO] [logging.py:96:log_dist] [Rank 0] step=10130, skipped=196, lr=[2.3057006979245793e-06, 2.3057006979245793e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:11:16,140] [INFO] [timer.py:215:stop] epoch=11/micro_step=10/global_step=10130, RunningAvgSamplesPerSec=85.11287294647569, CurrSamplesPerSec=84.83056227270372, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:11:23,637] [INFO] [logging.py:96:log_dist] [Rank 0] step=10140, skipped=196, lr=[2.2969239310462307e-06, 2.2969239310462307e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:11:23,670] [INFO] [timer.py:215:stop] epoch=11/micro_step=20/global_step=10140, RunningAvgSamplesPerSec=85.11282179738701, CurrSamplesPerSec=85.17742296808554, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:11:31,178] [INFO] [logging.py:96:log_dist] [Rank 0] step=10150, skipped=196, lr=[2.288158679437157e-06, 2.288158679437157e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:11:31,212] [INFO] [timer.py:215:stop] epoch=11/micro_step=30/global_step=10150, RunningAvgSamplesPerSec=85.11263688764159, CurrSamplesPerSec=84.74842167441899, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:11:38,727] [INFO] [logging.py:96:log_dist] [Rank 0] step=10160, skipped=196, lr=[2.2794049830226756e-06, 2.2794049830226756e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:11:38,761] [INFO] [timer.py:215:stop] epoch=11/micro_step=40/global_step=10160, RunningAvgSamplesPerSec=85.1123854823247, CurrSamplesPerSec=84.82672889842519, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:11:46,285] [INFO] [logging.py:96:log_dist] [Rank 0] step=10170, skipped=196, lr=[2.2706628816754673e-06, 2.2706628816754673e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:11:46,318] [INFO] [timer.py:215:stop] epoch=11/micro_step=50/global_step=10170, RunningAvgSamplesPerSec=85.1120291227843, CurrSamplesPerSec=84.68332053576893, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:11:53,828] [INFO] [logging.py:96:log_dist] [Rank 0] step=10180, skipped=196, lr=[2.2619324152154e-06, 2.2619324152154e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:11:53,861] [INFO] [timer.py:215:stop] epoch=11/micro_step=60/global_step=10180, RunningAvgSamplesPerSec=85.11182509259395, CurrSamplesPerSec=84.84774969806219, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:12:01,376] [INFO] [logging.py:96:log_dist] [Rank 0] step=10190, skipped=196, lr=[2.2532136234093383e-06, 2.2532136234093383e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:12:01,410] [INFO] [timer.py:215:stop] epoch=11/micro_step=70/global_step=10190, RunningAvgSamplesPerSec=85.11156215205571, CurrSamplesPerSec=84.88166232248432, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:12:08,138] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:12:08,838] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:12:08,839] [INFO] [logging.py:96:log_dist] [Rank 0] step=10200, skipped=198, lr=[2.246247022404771e-06, 2.246247022404771e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:12:08,840] [INFO] [timer.py:215:stop] epoch=11/micro_step=80/global_step=10200, RunningAvgSamplesPerSec=85.11262991337631, CurrSamplesPerSec=91.40674100372048, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:12:16,368] [INFO] [logging.py:96:log_dist] [Rank 0] step=10210, skipped=198, lr=[2.2375493450185842e-06, 2.2375493450185842e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:12:16,402] [INFO] [timer.py:215:stop] epoch=11/micro_step=90/global_step=10210, RunningAvgSamplesPerSec=85.11221409254186, CurrSamplesPerSec=84.72023009045625, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:12:23,929] [INFO] [logging.py:96:log_dist] [Rank 0] step=10220, skipped=198, lr=[2.228863453350159e-06, 2.228863453350159e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:12:23,963] [INFO] [timer.py:215:stop] epoch=11/micro_step=100/global_step=10220, RunningAvgSamplesPerSec=85.11181873630332, CurrSamplesPerSec=84.83158098987022, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:12:31,479] [INFO] [logging.py:96:log_dist] [Rank 0] step=10230, skipped=198, lr=[2.2201893869633316e-06, 2.2201893869633316e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:12:31,513] [INFO] [timer.py:215:stop] epoch=11/micro_step=110/global_step=10230, RunningAvgSamplesPerSec=85.1115459652564, CurrSamplesPerSec=84.9213508810059, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:12:39,030] [INFO] [logging.py:96:log_dist] [Rank 0] step=10240, skipped=198, lr=[2.2115271853680737e-06, 2.2115271853680737e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:12:39,064] [INFO] [timer.py:215:stop] epoch=11/micro_step=120/global_step=10240, RunningAvgSamplesPerSec=85.11126216538074, CurrSamplesPerSec=84.66668033434506, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:12:46,578] [INFO] [logging.py:96:log_dist] [Rank 0] step=10250, skipped=198, lr=[2.202876888020306e-06, 2.202876888020306e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:12:46,611] [INFO] [timer.py:215:stop] epoch=11/micro_step=130/global_step=10250, RunningAvgSamplesPerSec=85.11102357594157, CurrSamplesPerSec=85.09433984992533, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:12:54,128] [INFO] [logging.py:96:log_dist] [Rank 0] step=10260, skipped=198, lr=[2.1942385343217394e-06, 2.1942385343217394e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:12:54,162] [INFO] [timer.py:215:stop] epoch=11/micro_step=140/global_step=10260, RunningAvgSamplesPerSec=85.11074805014218, CurrSamplesPerSec=84.85340886101558, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:13:01,665] [INFO] [logging.py:96:log_dist] [Rank 0] step=10270, skipped=198, lr=[2.1856121636196695e-06, 2.1856121636196695e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:13:01,699] [INFO] [timer.py:215:stop] epoch=11/micro_step=150/global_step=10270, RunningAvgSamplesPerSec=85.11061237545289, CurrSamplesPerSec=84.93707007078842, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:13:09,218] [INFO] [logging.py:96:log_dist] [Rank 0] step=10280, skipped=198, lr=[2.176997815206816e-06, 2.176997815206816e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:13:09,252] [INFO] [timer.py:215:stop] epoch=11/micro_step=160/global_step=10280, RunningAvgSamplesPerSec=85.11030782342462, CurrSamplesPerSec=84.62320096994885, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:13:16,767] [INFO] [logging.py:96:log_dist] [Rank 0] step=10290, skipped=198, lr=[2.1683955283211374e-06, 2.1683955283211374e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:13:16,801] [INFO] [timer.py:215:stop] epoch=11/micro_step=170/global_step=10290, RunningAvgSamplesPerSec=85.11005297256305, CurrSamplesPerSec=84.52985408560005, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:13:24,328] [INFO] [logging.py:96:log_dist] [Rank 0] step=10300, skipped=198, lr=[2.159805342145652e-06, 2.159805342145652e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:13:24,362] [INFO] [timer.py:215:stop] epoch=11/micro_step=180/global_step=10300, RunningAvgSamplesPerSec=85.10967097904464, CurrSamplesPerSec=84.83316273079932, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:13:25,058] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:13:25,754] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:13:31,757] [INFO] [logging.py:96:log_dist] [Rank 0] step=10310, skipped=200, lr=[2.1529419320124055e-06, 2.1529419320124055e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:13:31,791] [INFO] [timer.py:215:stop] epoch=11/micro_step=190/global_step=10310, RunningAvgSamplesPerSec=85.11072834553296, CurrSamplesPerSec=85.14297658465073, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:13:39,300] [INFO] [logging.py:96:log_dist] [Rank 0] step=10320, skipped=200, lr=[2.144373625680421e-06, 2.144373625680421e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:13:39,334] [INFO] [timer.py:215:stop] epoch=11/micro_step=200/global_step=10320, RunningAvgSamplesPerSec=85.11053694722104, CurrSamplesPerSec=84.75503095486475, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:13:46,853] [INFO] [logging.py:96:log_dist] [Rank 0] step=10330, skipped=200, lr=[2.1358175294772792e-06, 2.1358175294772792e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:13:46,886] [INFO] [timer.py:215:stop] epoch=11/micro_step=210/global_step=10330, RunningAvgSamplesPerSec=85.11023789633506, CurrSamplesPerSec=84.62184045846118, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:13:54,400] [INFO] [logging.py:96:log_dist] [Rank 0] step=10340, skipped=200, lr=[2.127273682375604e-06, 2.127273682375604e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:13:54,433] [INFO] [timer.py:215:stop] epoch=11/micro_step=220/global_step=10340, RunningAvgSamplesPerSec=85.11001063149718, CurrSamplesPerSec=84.93580694654023, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:14:01,928] [INFO] [logging.py:96:log_dist] [Rank 0] step=10350, skipped=200, lr=[2.1187421232922227e-06, 2.1187421232922227e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:14:01,962] [INFO] [timer.py:215:stop] epoch=11/micro_step=230/global_step=10350, RunningAvgSamplesPerSec=85.10997650317377, CurrSamplesPerSec=84.94118220356548, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:14:09,457] [INFO] [logging.py:96:log_dist] [Rank 0] step=10360, skipped=200, lr=[2.1102228910879934e-06, 2.1102228910879934e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:14:09,491] [INFO] [timer.py:215:stop] epoch=11/micro_step=240/global_step=10360, RunningAvgSamplesPerSec=85.10993941350682, CurrSamplesPerSec=85.0965788339124, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:14:16,995] [INFO] [logging.py:96:log_dist] [Rank 0] step=10370, skipped=200, lr=[2.101716024567618e-06, 2.101716024567618e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:14:17,029] [INFO] [timer.py:215:stop] epoch=11/micro_step=250/global_step=10370, RunningAvgSamplesPerSec=85.10980001545023, CurrSamplesPerSec=85.26364744898753, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:14:24,539] [INFO] [logging.py:96:log_dist] [Rank 0] step=10380, skipped=200, lr=[2.093221562479486e-06, 2.093221562479486e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:14:24,572] [INFO] [timer.py:215:stop] epoch=11/micro_step=260/global_step=10380, RunningAvgSamplesPerSec=85.10961093957452, CurrSamplesPerSec=85.01281864511773, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:14:32,078] [INFO] [logging.py:96:log_dist] [Rank 0] step=10390, skipped=200, lr=[2.084739543515474e-06, 2.084739543515474e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:14:32,112] [INFO] [timer.py:215:stop] epoch=11/micro_step=270/global_step=10390, RunningAvgSamplesPerSec=85.10946217550469, CurrSamplesPerSec=84.82048365304283, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:14:39,607] [INFO] [logging.py:96:log_dist] [Rank 0] step=10400, skipped=200, lr=[2.0762700063107855e-06, 2.0762700063107855e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:14:39,642] [INFO] [timer.py:215:stop] epoch=11/micro_step=280/global_step=10400, RunningAvgSamplesPerSec=85.10941731266018, CurrSamplesPerSec=84.66582579752465, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:14:41,846] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:14:42,541] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:14:47,024] [INFO] [logging.py:96:log_dist] [Rank 0] step=10410, skipped=202, lr=[2.0695033893403147e-06, 2.0695033893403147e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:14:47,057] [INFO] [timer.py:215:stop] epoch=11/micro_step=290/global_step=10410, RunningAvgSamplesPerSec=85.11060834863906, CurrSamplesPerSec=85.30456157054064, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:14:54,554] [INFO] [logging.py:96:log_dist] [Rank 0] step=10420, skipped=202, lr=[2.0610564164815326e-06, 2.0610564164815326e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:14:54,588] [INFO] [timer.py:215:stop] epoch=11/micro_step=300/global_step=10420, RunningAvgSamplesPerSec=85.11054998548296, CurrSamplesPerSec=84.98937488186486, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:15:02,082] [INFO] [logging.py:96:log_dist] [Rank 0] step=10430, skipped=202, lr=[2.05262203325762e-06, 2.05262203325762e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:15:02,116] [INFO] [timer.py:215:stop] epoch=11/micro_step=310/global_step=10430, RunningAvgSamplesPerSec=85.11052945107402, CurrSamplesPerSec=85.04349539625066, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:15:09,615] [INFO] [logging.py:96:log_dist] [Rank 0] step=10440, skipped=202, lr=[2.0442002780868037e-06, 2.0442002780868037e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:15:09,649] [INFO] [timer.py:215:stop] epoch=11/micro_step=320/global_step=10440, RunningAvgSamplesPerSec=85.11043916618411, CurrSamplesPerSec=85.11673499310183, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:15:17,143] [INFO] [logging.py:96:log_dist] [Rank 0] step=10450, skipped=202, lr=[2.035791189329784e-06, 2.035791189329784e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:15:17,180] [INFO] [timer.py:215:stop] epoch=11/micro_step=330/global_step=10450, RunningAvgSamplesPerSec=85.11037498827015, CurrSamplesPerSec=84.7316221531931, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:15:24,682] [INFO] [logging.py:96:log_dist] [Rank 0] step=10460, skipped=202, lr=[2.027394805289572e-06, 2.027394805289572e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:15:24,716] [INFO] [timer.py:215:stop] epoch=11/micro_step=340/global_step=10460, RunningAvgSamplesPerSec=85.11026679918626, CurrSamplesPerSec=85.17282850666443, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:15:32,222] [INFO] [logging.py:96:log_dist] [Rank 0] step=10470, skipped=202, lr=[2.019011164211309e-06, 2.019011164211309e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:15:32,255] [INFO] [timer.py:215:stop] epoch=11/micro_step=350/global_step=10470, RunningAvgSamplesPerSec=85.11012936762663, CurrSamplesPerSec=85.00821500310822, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:15:39,754] [INFO] [logging.py:96:log_dist] [Rank 0] step=10480, skipped=202, lr=[2.010640304282091e-06, 2.010640304282091e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:15:39,788] [INFO] [timer.py:215:stop] epoch=11/micro_step=360/global_step=10480, RunningAvgSamplesPerSec=85.11005590370242, CurrSamplesPerSec=84.78498632217261, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:15:47,286] [INFO] [logging.py:96:log_dist] [Rank 0] step=10490, skipped=202, lr=[2.002282263630794e-06, 2.002282263630794e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:15:47,320] [INFO] [timer.py:215:stop] epoch=11/micro_step=370/global_step=10490, RunningAvgSamplesPerSec=85.10998259543366, CurrSamplesPerSec=84.9908817842435, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:15:54,819] [INFO] [logging.py:96:log_dist] [Rank 0] step=10500, skipped=202, lr=[1.9939370803279105e-06, 1.9939370803279105e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:15:54,852] [INFO] [timer.py:215:stop] epoch=11/micro_step=380/global_step=10500, RunningAvgSamplesPerSec=85.10990845788224, CurrSamplesPerSec=85.06279080010889, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:15:58,556] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:15:59,255] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:16:02,237] [INFO] [logging.py:96:log_dist] [Rank 0] step=10510, skipped=204, lr=[1.9872702165224435e-06, 1.9872702165224435e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:16:02,271] [INFO] [timer.py:215:stop] epoch=11/micro_step=390/global_step=10510, RunningAvgSamplesPerSec=85.11105874742587, CurrSamplesPerSec=84.80829065307975, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:16:09,768] [INFO] [logging.py:96:log_dist] [Rank 0] step=10520, skipped=204, lr=[1.978948272197276e-06, 1.978948272197276e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:16:09,801] [INFO] [timer.py:215:stop] epoch=11/micro_step=400/global_step=10520, RunningAvgSamplesPerSec=85.11100642929482, CurrSamplesPerSec=84.93454385986006, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:16:17,292] [INFO] [logging.py:96:log_dist] [Rank 0] step=10530, skipped=204, lr=[1.9706392915057724e-06, 1.9706392915057724e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:16:17,326] [INFO] [timer.py:215:stop] epoch=11/micro_step=410/global_step=10530, RunningAvgSamplesPerSec=85.11102071842767, CurrSamplesPerSec=85.35132215073456, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:16:24,830] [INFO] [logging.py:96:log_dist] [Rank 0] step=10540, skipped=204, lr=[1.962343312294949e-06, 1.962343312294949e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:16:24,864] [INFO] [timer.py:215:stop] epoch=11/micro_step=420/global_step=10540, RunningAvgSamplesPerSec=85.11087711838559, CurrSamplesPerSec=85.02186585042942, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:16:32,370] [INFO] [logging.py:96:log_dist] [Rank 0] step=10550, skipped=204, lr=[1.9540603723526074e-06, 1.9540603723526074e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:16:32,403] [INFO] [timer.py:215:stop] epoch=11/micro_step=430/global_step=10550, RunningAvgSamplesPerSec=85.11073461489215, CurrSamplesPerSec=85.27152917019271, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:16:39,905] [INFO] [logging.py:96:log_dist] [Rank 0] step=10560, skipped=204, lr=[1.945790509407158e-06, 1.945790509407158e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:16:39,939] [INFO] [timer.py:215:stop] epoch=11/micro_step=440/global_step=10560, RunningAvgSamplesPerSec=85.11062695819379, CurrSamplesPerSec=84.78056798004445, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:16:47,446] [INFO] [logging.py:96:log_dist] [Rank 0] step=10570, skipped=204, lr=[1.937533761127437e-06, 1.937533761127437e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:16:47,480] [INFO] [timer.py:215:stop] epoch=11/micro_step=450/global_step=10570, RunningAvgSamplesPerSec=85.11047443391735, CurrSamplesPerSec=84.9656213224146, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:16:54,995] [INFO] [logging.py:96:log_dist] [Rank 0] step=10580, skipped=204, lr=[1.929290165122557e-06, 1.929290165122557e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:16:55,029] [INFO] [timer.py:215:stop] epoch=11/micro_step=460/global_step=10580, RunningAvgSamplesPerSec=85.110223424108, CurrSamplesPerSec=84.71405398513724, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:17:02,524] [INFO] [logging.py:96:log_dist] [Rank 0] step=10590, skipped=204, lr=[1.9210597589417105e-06, 1.9210597589417105e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:17:02,558] [INFO] [timer.py:215:stop] epoch=11/micro_step=470/global_step=10590, RunningAvgSamplesPerSec=85.11018901698708, CurrSamplesPerSec=85.14964755112904, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:17:10,065] [INFO] [logging.py:96:log_dist] [Rank 0] step=10600, skipped=204, lr=[1.912842580074018e-06, 1.912842580074018e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:17:10,098] [INFO] [timer.py:215:stop] epoch=11/micro_step=480/global_step=10600, RunningAvgSamplesPerSec=85.11003069359356, CurrSamplesPerSec=84.84828607922785, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:17:15,312] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:17:16,012] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:17:17,496] [INFO] [logging.py:96:log_dist] [Rank 0] step=10610, skipped=206, lr=[1.9062783857995639e-06, 1.9062783857995639e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:17:17,529] [INFO] [timer.py:215:stop] epoch=11/micro_step=490/global_step=10610, RunningAvgSamplesPerSec=85.11104066531189, CurrSamplesPerSec=84.64983311931906, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:17:25,031] [INFO] [logging.py:96:log_dist] [Rank 0] step=10620, skipped=206, lr=[1.8980851103757003e-06, 1.8980851103757003e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:17:25,065] [INFO] [timer.py:215:stop] epoch=11/micro_step=500/global_step=10620, RunningAvgSamplesPerSec=85.1109403399164, CurrSamplesPerSec=85.0309149815833, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:17:32,574] [INFO] [logging.py:96:log_dist] [Rank 0] step=10630, skipped=206, lr=[1.88990516691345e-06, 1.88990516691345e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:17:32,607] [INFO] [timer.py:215:stop] epoch=11/micro_step=510/global_step=10630, RunningAvgSamplesPerSec=85.1107699469507, CurrSamplesPerSec=84.47957541717295, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:17:40,119] [INFO] [logging.py:96:log_dist] [Rank 0] step=10640, skipped=206, lr=[1.8817385926720774e-06, 1.8817385926720774e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:17:40,153] [INFO] [timer.py:215:stop] epoch=11/micro_step=520/global_step=10640, RunningAvgSamplesPerSec=85.1105548513218, CurrSamplesPerSec=84.93062046848681, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:17:47,666] [INFO] [logging.py:96:log_dist] [Rank 0] step=10650, skipped=206, lr=[1.873585424849946e-06, 1.873585424849946e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:17:47,700] [INFO] [timer.py:215:stop] epoch=11/micro_step=530/global_step=10650, RunningAvgSamplesPerSec=85.11032438988929, CurrSamplesPerSec=84.74162616137359, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:17:55,205] [INFO] [logging.py:96:log_dist] [Rank 0] step=10660, skipped=206, lr=[1.8654457005843584e-06, 1.8654457005843584e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:17:55,238] [INFO] [timer.py:215:stop] epoch=11/micro_step=540/global_step=10660, RunningAvgSamplesPerSec=85.11018478907265, CurrSamplesPerSec=84.77349959039975, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:18:02,747] [INFO] [logging.py:96:log_dist] [Rank 0] step=10670, skipped=206, lr=[1.8573194569513824e-06, 1.8573194569513824e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:18:02,781] [INFO] [timer.py:215:stop] epoch=11/micro_step=550/global_step=10670, RunningAvgSamplesPerSec=85.11000519055074, CurrSamplesPerSec=85.12027072534835, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:18:10,274] [INFO] [logging.py:96:log_dist] [Rank 0] step=10680, skipped=206, lr=[1.849206730965682e-06, 1.849206730965682e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:18:10,308] [INFO] [timer.py:215:stop] epoch=11/micro_step=560/global_step=10680, RunningAvgSamplesPerSec=85.11000033011389, CurrSamplesPerSec=84.91345317708031, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:18:17,806] [INFO] [logging.py:96:log_dist] [Rank 0] step=10690, skipped=206, lr=[1.8411075595803423e-06, 1.8411075595803423e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:18:17,840] [INFO] [timer.py:215:stop] epoch=11/micro_step=570/global_step=10690, RunningAvgSamplesPerSec=85.10993071550085, CurrSamplesPerSec=84.78043409807775, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:18:25,332] [INFO] [logging.py:96:log_dist] [Rank 0] step=10700, skipped=206, lr=[1.8330219796867217e-06, 1.8330219796867217e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:18:25,366] [INFO] [timer.py:215:stop] epoch=11/micro_step=580/global_step=10700, RunningAvgSamplesPerSec=85.10992400664364, CurrSamplesPerSec=85.02426260538337, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:18:32,093] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:18:32,792] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:18:32,793] [INFO] [logging.py:96:log_dist] [Rank 0] step=10710, skipped=208, lr=[1.8265633263973277e-06, 1.8265633263973277e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:18:32,794] [INFO] [timer.py:215:stop] epoch=11/micro_step=590/global_step=10710, RunningAvgSamplesPerSec=85.11095748322549, CurrSamplesPerSec=91.60225303588601, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:18:40,305] [INFO] [logging.py:96:log_dist] [Rank 0] step=10720, skipped=208, lr=[1.818502303957273e-06, 1.818502303957273e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:18:40,338] [INFO] [timer.py:215:stop] epoch=11/micro_step=600/global_step=10720, RunningAvgSamplesPerSec=85.11076103868386, CurrSamplesPerSec=85.11249789941945, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:18:47,844] [INFO] [logging.py:96:log_dist] [Rank 0] step=10730, skipped=208, lr=[1.8104549759748275e-06, 1.8104549759748275e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:18:47,878] [INFO] [timer.py:215:stop] epoch=11/micro_step=610/global_step=10730, RunningAvgSamplesPerSec=85.11061622027346, CurrSamplesPerSec=85.21268167997913, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:18:55,374] [INFO] [logging.py:96:log_dist] [Rank 0] step=10740, skipped=208, lr=[1.8024213791051924e-06, 1.8024213791051924e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:18:55,408] [INFO] [timer.py:215:stop] epoch=11/micro_step=620/global_step=10740, RunningAvgSamplesPerSec=85.11057033239003, CurrSamplesPerSec=85.11956895359884, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:19:02,909] [INFO] [logging.py:96:log_dist] [Rank 0] step=10750, skipped=208, lr=[1.7944015499410302e-06, 1.7944015499410302e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:19:02,943] [INFO] [timer.py:215:stop] epoch=11/micro_step=630/global_step=10750, RunningAvgSamplesPerSec=85.11047509883652, CurrSamplesPerSec=84.92916944295216, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:19:10,437] [INFO] [logging.py:96:log_dist] [Rank 0] step=10760, skipped=208, lr=[1.7863955250122931e-06, 1.7863955250122931e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:19:10,470] [INFO] [timer.py:215:stop] epoch=11/micro_step=640/global_step=10760, RunningAvgSamplesPerSec=85.11045536680624, CurrSamplesPerSec=85.26974143555711, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:19:17,962] [INFO] [logging.py:96:log_dist] [Rank 0] step=10770, skipped=208, lr=[1.77840334078605e-06, 1.77840334078605e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:19:17,996] [INFO] [timer.py:215:stop] epoch=11/micro_step=650/global_step=10770, RunningAvgSamplesPerSec=85.11045461972506, CurrSamplesPerSec=84.97100035389356, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:19:25,492] [INFO] [logging.py:96:log_dist] [Rank 0] step=10780, skipped=208, lr=[1.7704250336663302e-06, 1.7704250336663302e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:19:25,525] [INFO] [timer.py:215:stop] epoch=11/micro_step=660/global_step=10780, RunningAvgSamplesPerSec=85.11042041417544, CurrSamplesPerSec=84.96263625380722, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:19:33,026] [INFO] [logging.py:96:log_dist] [Rank 0] step=10790, skipped=208, lr=[1.7624606399939543e-06, 1.7624606399939543e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:19:33,059] [INFO] [timer.py:215:stop] epoch=11/micro_step=670/global_step=10790, RunningAvgSamplesPerSec=85.11032772668403, CurrSamplesPerSec=85.12531842462853, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:19:40,562] [INFO] [logging.py:96:log_dist] [Rank 0] step=10800, skipped=208, lr=[1.7545101960463666e-06, 1.7545101960463666e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:19:40,595] [INFO] [timer.py:215:stop] epoch=11/micro_step=680/global_step=10800, RunningAvgSamplesPerSec=85.11022931823601, CurrSamplesPerSec=85.06187433989898, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:19:48,094] [INFO] [logging.py:96:log_dist] [Rank 0] step=10810, skipped=208, lr=[1.7465737380374663e-06, 1.7465737380374663e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:19:48,128] [INFO] [timer.py:215:stop] epoch=11/micro_step=690/global_step=10810, RunningAvgSamplesPerSec=85.11015092141358, CurrSamplesPerSec=85.1641274145949, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:19:48,825] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:19:49,521] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:19:55,519] [INFO] [logging.py:96:log_dist] [Rank 0] step=10820, skipped=210, lr=[1.740234665801283e-06, 1.740234665801283e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:19:55,553] [INFO] [timer.py:215:stop] epoch=11/micro_step=700/global_step=10820, RunningAvgSamplesPerSec=85.11119567159584, CurrSamplesPerSec=84.93795696771043, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:20:03,063] [INFO] [logging.py:96:log_dist] [Rank 0] step=10830, skipped=210, lr=[1.7323234735376135e-06, 1.7323234735376135e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:20:03,096] [INFO] [timer.py:215:stop] epoch=11/micro_step=710/global_step=10830, RunningAvgSamplesPerSec=85.11101541404496, CurrSamplesPerSec=85.03614065476782, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:20:10,602] [INFO] [logging.py:96:log_dist] [Rank 0] step=10840, skipped=210, lr=[1.7244263682721251e-06, 1.7244263682721251e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:20:10,635] [INFO] [timer.py:215:stop] epoch=11/micro_step=720/global_step=10840, RunningAvgSamplesPerSec=85.11087903565615, CurrSamplesPerSec=84.86438066827247, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:20:18,144] [INFO] [logging.py:96:log_dist] [Rank 0] step=10850, skipped=210, lr=[1.716543385975768e-06, 1.716543385975768e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:20:18,177] [INFO] [timer.py:215:stop] epoch=11/micro_step=730/global_step=10850, RunningAvgSamplesPerSec=85.1107049186453, CurrSamplesPerSec=84.75503095486475, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:20:25,678] [INFO] [logging.py:96:log_dist] [Rank 0] step=10860, skipped=210, lr=[1.7086745625551605e-06, 1.7086745625551605e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:20:25,712] [INFO] [timer.py:215:stop] epoch=11/micro_step=740/global_step=10860, RunningAvgSamplesPerSec=85.1106092705583, CurrSamplesPerSec=84.91793910758099, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:20:33,211] [INFO] [logging.py:96:log_dist] [Rank 0] step=10870, skipped=210, lr=[1.7008199338524288e-06, 1.7008199338524288e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:20:33,244] [INFO] [timer.py:215:stop] epoch=11/micro_step=750/global_step=10870, RunningAvgSamplesPerSec=85.1105378639572, CurrSamplesPerSec=84.87098121481631, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:20:40,755] [INFO] [logging.py:96:log_dist] [Rank 0] step=10880, skipped=210, lr=[1.692979535645045e-06, 1.692979535645045e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:20:40,788] [INFO] [timer.py:215:stop] epoch=11/micro_step=760/global_step=10880, RunningAvgSamplesPerSec=85.1103450239393, CurrSamplesPerSec=84.61826598968386, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:20:48,296] [INFO] [logging.py:96:log_dist] [Rank 0] step=10890, skipped=210, lr=[1.6851534036456545e-06, 1.6851534036456545e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:20:48,329] [INFO] [timer.py:215:stop] epoch=11/micro_step=770/global_step=10890, RunningAvgSamplesPerSec=85.11018580437184, CurrSamplesPerSec=85.00100094141057, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:20:55,835] [INFO] [logging.py:96:log_dist] [Rank 0] step=10900, skipped=210, lr=[1.6773415735019288e-06, 1.6773415735019288e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:20:55,869] [INFO] [timer.py:215:stop] epoch=11/micro_step=780/global_step=10900, RunningAvgSamplesPerSec=85.1100334269825, CurrSamplesPerSec=85.1653973389646, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:21:03,377] [INFO] [logging.py:96:log_dist] [Rank 0] step=10910, skipped=210, lr=[1.6695440807963904e-06, 1.6695440807963904e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:21:03,410] [INFO] [timer.py:215:stop] epoch=11/micro_step=790/global_step=10910, RunningAvgSamplesPerSec=85.10987317321106, CurrSamplesPerSec=85.07996460329575, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:21:05,618] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:21:06,316] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:21:10,812] [INFO] [logging.py:96:log_dist] [Rank 0] step=10920, skipped=212, lr=[1.6633164334572254e-06, 1.6633164334572254e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:21:10,846] [INFO] [timer.py:215:stop] epoch=11/micro_step=800/global_step=10920, RunningAvgSamplesPerSec=85.11080519296948, CurrSamplesPerSec=84.60029152359687, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:21:18,364] [INFO] [logging.py:96:log_dist] [Rank 0] step=10930, skipped=212, lr=[1.655544837599826e-06, 1.655544837599826e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:21:18,397] [INFO] [timer.py:215:stop] epoch=11/micro_step=810/global_step=10930, RunningAvgSamplesPerSec=85.11053704143809, CurrSamplesPerSec=84.7933690488224, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:21:25,897] [INFO] [logging.py:96:log_dist] [Rank 0] step=10940, skipped=212, lr=[1.6477876784637358e-06, 1.6477876784637358e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:21:25,930] [INFO] [timer.py:215:stop] epoch=11/micro_step=820/global_step=10940, RunningAvgSamplesPerSec=85.11046036767146, CurrSamplesPerSec=85.12286198282166, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:21:33,432] [INFO] [logging.py:96:log_dist] [Rank 0] step=10950, skipped=212, lr=[1.6400449913824576e-06, 1.6400449913824576e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:21:33,466] [INFO] [timer.py:215:stop] epoch=11/micro_step=830/global_step=10950, RunningAvgSamplesPerSec=85.11036110573919, CurrSamplesPerSec=84.82198457227433, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:21:40,964] [INFO] [logging.py:96:log_dist] [Rank 0] step=10960, skipped=212, lr=[1.6323168116235678e-06, 1.6323168116235678e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:21:40,998] [INFO] [timer.py:215:stop] epoch=11/micro_step=840/global_step=10960, RunningAvgSamplesPerSec=85.11029668375694, CurrSamplesPerSec=84.88778235927138, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:21:48,506] [INFO] [logging.py:96:log_dist] [Rank 0] step=10970, skipped=212, lr=[1.6246031743885691e-06, 1.6246031743885691e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:21:48,539] [INFO] [timer.py:215:stop] epoch=11/micro_step=850/global_step=10970, RunningAvgSamplesPerSec=85.1101347086309, CurrSamplesPerSec=84.49699326127904, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:21:56,047] [INFO] [logging.py:96:log_dist] [Rank 0] step=10980, skipped=212, lr=[1.6169041148127212e-06, 1.6169041148127212e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:21:56,080] [INFO] [timer.py:215:stop] epoch=11/micro_step=860/global_step=10980, RunningAvgSamplesPerSec=85.10997640662603, CurrSamplesPerSec=85.09957332195441, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:22:03,580] [INFO] [logging.py:96:log_dist] [Rank 0] step=10990, skipped=212, lr=[1.6092196679648839e-06, 1.6092196679648839e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:22:03,610] [INFO] [timer.py:215:stop] epoch=11/micro_step=870/global_step=10990, RunningAvgSamplesPerSec=85.10992981268258, CurrSamplesPerSec=84.77802429498098, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:22:11,123] [INFO] [logging.py:96:log_dist] [Rank 0] step=11000, skipped=212, lr=[1.6015498688473575e-06, 1.6015498688473575e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:22:11,157] [INFO] [timer.py:215:stop] epoch=11/micro_step=880/global_step=11000, RunningAvgSamplesPerSec=85.10971446325611, CurrSamplesPerSec=84.7613200937051, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:22:18,661] [INFO] [logging.py:96:log_dist] [Rank 0] step=11010, skipped=212, lr=[1.5938947523957166e-06, 1.5938947523957166e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:22:18,694] [INFO] [timer.py:215:stop] epoch=11/micro_step=890/global_step=11010, RunningAvgSamplesPerSec=85.10959266545792, CurrSamplesPerSec=84.837961932887, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:22:22,407] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:22:23,102] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:22:26,081] [INFO] [logging.py:96:log_dist] [Rank 0] step=11020, skipped=214, lr=[1.5877812541878633e-06, 1.5877812541878633e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:22:26,114] [INFO] [timer.py:215:stop] epoch=11/micro_step=900/global_step=11020, RunningAvgSamplesPerSec=85.11067508244273, CurrSamplesPerSec=84.97785960688053, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:22:33,604] [INFO] [logging.py:96:log_dist] [Rank 0] step=11030, skipped=214, lr=[1.5801526543589066e-06, 1.5801526543589066e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:22:33,638] [INFO] [timer.py:215:stop] epoch=11/micro_step=910/global_step=11030, RunningAvgSamplesPerSec=85.11068894821427, CurrSamplesPerSec=85.03385097663556, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:22:41,131] [INFO] [logging.py:96:log_dist] [Rank 0] step=11040, skipped=214, lr=[1.572538834659157e-06, 1.572538834659157e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:22:41,165] [INFO] [timer.py:215:stop] epoch=11/micro_step=920/global_step=11040, RunningAvgSamplesPerSec=85.11067276709626, CurrSamplesPerSec=84.50704835469132, MemAllocated=5.0GB, MaxMemAllocated=23.99GB ***** Evaluating perplexity, Epoch 12/16 ***** ppl: 1.7762515544891357 Beginning of Epoch 13/16, Total Micro Batches 920 [2023-06-29 19:23:06,564] [INFO] [logging.py:96:log_dist] [Rank 0] step=11050, skipped=214, lr=[1.5649398297692118e-06, 1.5649398297692118e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:23:06,597] [INFO] [timer.py:215:stop] epoch=12/micro_step=10/global_step=11050, RunningAvgSamplesPerSec=85.11037495054546, CurrSamplesPerSec=84.8624221670317, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:23:14,105] [INFO] [logging.py:96:log_dist] [Rank 0] step=11060, skipped=214, lr=[1.5573556743021863e-06, 1.5573556743021863e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:23:14,138] [INFO] [timer.py:215:stop] epoch=12/micro_step=20/global_step=11060, RunningAvgSamplesPerSec=85.11021397129016, CurrSamplesPerSec=85.09976217060692, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:23:21,645] [INFO] [logging.py:96:log_dist] [Rank 0] step=11070, skipped=214, lr=[1.549786402803556e-06, 1.549786402803556e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:23:21,679] [INFO] [timer.py:215:stop] epoch=12/micro_step=30/global_step=11070, RunningAvgSamplesPerSec=85.11006432084366, CurrSamplesPerSec=84.71592543710989, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:23:29,188] [INFO] [logging.py:96:log_dist] [Rank 0] step=11080, skipped=214, lr=[1.5422320497510037e-06, 1.5422320497510037e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:23:29,221] [INFO] [timer.py:215:stop] epoch=12/micro_step=40/global_step=11080, RunningAvgSamplesPerSec=85.10988804879405, CurrSamplesPerSec=84.89760847447067, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:23:36,737] [INFO] [logging.py:96:log_dist] [Rank 0] step=11090, skipped=214, lr=[1.5346926495542545e-06, 1.5346926495542545e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:23:36,771] [INFO] [timer.py:215:stop] epoch=12/micro_step=50/global_step=11090, RunningAvgSamplesPerSec=85.10964200035986, CurrSamplesPerSec=85.18242338640454, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:23:44,261] [INFO] [logging.py:96:log_dist] [Rank 0] step=11100, skipped=214, lr=[1.5271682365549279e-06, 1.5271682365549279e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:23:44,295] [INFO] [timer.py:215:stop] epoch=12/micro_step=60/global_step=11100, RunningAvgSamplesPerSec=85.10966293139249, CurrSamplesPerSec=85.11821942506495, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:23:51,791] [INFO] [logging.py:96:log_dist] [Rank 0] step=11110, skipped=214, lr=[1.5196588450263763e-06, 1.5196588450263763e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:23:51,824] [INFO] [timer.py:215:stop] epoch=12/micro_step=70/global_step=11110, RunningAvgSamplesPerSec=85.10961870968349, CurrSamplesPerSec=85.28548149273088, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:23:57,038] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:23:57,735] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:23:59,205] [INFO] [logging.py:96:log_dist] [Rank 0] step=11120, skipped=216, lr=[1.5136621702505102e-06, 1.5136621702505102e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:23:59,238] [INFO] [timer.py:215:stop] epoch=12/micro_step=80/global_step=11120, RunningAvgSamplesPerSec=85.11075556212076, CurrSamplesPerSec=85.2557942621282, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:24:06,739] [INFO] [logging.py:96:log_dist] [Rank 0] step=11130, skipped=216, lr=[1.5061799035196989e-06, 1.5061799035196989e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:24:06,772] [INFO] [timer.py:215:stop] epoch=12/micro_step=90/global_step=11130, RunningAvgSamplesPerSec=85.1106687487447, CurrSamplesPerSec=85.03177690406109, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:24:14,275] [INFO] [logging.py:96:log_dist] [Rank 0] step=11140, skipped=216, lr=[1.4987127538605462e-06, 1.4987127538605462e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:24:14,309] [INFO] [timer.py:215:stop] epoch=12/micro_step=100/global_step=11140, RunningAvgSamplesPerSec=85.11055325026703, CurrSamplesPerSec=85.06532464581659, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:24:21,814] [INFO] [logging.py:96:log_dist] [Rank 0] step=11150, skipped=216, lr=[1.491260755285575e-06, 1.491260755285575e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:24:21,847] [INFO] [timer.py:215:stop] epoch=12/micro_step=110/global_step=11150, RunningAvgSamplesPerSec=85.11042392690858, CurrSamplesPerSec=84.88238701761931, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:24:29,349] [INFO] [logging.py:96:log_dist] [Rank 0] step=11160, skipped=216, lr=[1.4838239417382894e-06, 1.4838239417382894e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:24:29,383] [INFO] [timer.py:215:stop] epoch=12/micro_step=120/global_step=11160, RunningAvgSamplesPerSec=85.11032588871538, CurrSamplesPerSec=84.83849819030895, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:24:36,872] [INFO] [logging.py:96:log_dist] [Rank 0] step=11170, skipped=216, lr=[1.4764023470930319e-06, 1.4764023470930319e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:24:36,906] [INFO] [timer.py:215:stop] epoch=12/micro_step=130/global_step=11170, RunningAvgSamplesPerSec=85.1103494448303, CurrSamplesPerSec=85.27169169524035, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:24:44,403] [INFO] [logging.py:96:log_dist] [Rank 0] step=11180, skipped=216, lr=[1.4689960051548237e-06, 1.4689960051548237e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:24:44,437] [INFO] [timer.py:215:stop] epoch=12/micro_step=140/global_step=11180, RunningAvgSamplesPerSec=85.11029394675978, CurrSamplesPerSec=85.22114919699835, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:24:51,936] [INFO] [logging.py:96:log_dist] [Rank 0] step=11190, skipped=216, lr=[1.4616049496592044e-06, 1.4616049496592044e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:24:51,970] [INFO] [timer.py:215:stop] epoch=12/micro_step=150/global_step=11190, RunningAvgSamplesPerSec=85.11021637959273, CurrSamplesPerSec=85.21746981353998, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:24:59,472] [INFO] [logging.py:96:log_dist] [Rank 0] step=11200, skipped=216, lr=[1.4542292142720952e-06, 1.4542292142720952e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:24:59,506] [INFO] [timer.py:215:stop] epoch=12/micro_step=160/global_step=11200, RunningAvgSamplesPerSec=85.11011884849916, CurrSamplesPerSec=84.89476242568341, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:25:07,003] [INFO] [logging.py:96:log_dist] [Rank 0] step=11210, skipped=216, lr=[1.446868832589624e-06, 1.446868832589624e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:25:07,037] [INFO] [timer.py:215:stop] epoch=12/micro_step=170/global_step=11210, RunningAvgSamplesPerSec=85.11005933753805, CurrSamplesPerSec=85.00514618542292, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:25:13,751] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:25:14,447] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:25:14,447] [INFO] [logging.py:96:log_dist] [Rank 0] step=11220, skipped=218, lr=[1.4409916044429452e-06, 1.4409916044429452e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:25:14,448] [INFO] [timer.py:215:stop] epoch=12/micro_step=180/global_step=11220, RunningAvgSamplesPerSec=85.11121651394399, CurrSamplesPerSec=91.99277040554406, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:25:21,950] [INFO] [logging.py:96:log_dist] [Rank 0] step=11230, skipped=218, lr=[1.4336589438677505e-06, 1.4336589438677505e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:25:21,984] [INFO] [timer.py:215:stop] epoch=12/micro_step=190/global_step=11230, RunningAvgSamplesPerSec=85.11111334406698, CurrSamplesPerSec=85.32045007903177, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:25:29,484] [INFO] [logging.py:96:log_dist] [Rank 0] step=11240, skipped=218, lr=[1.4263417306938212e-06, 1.4263417306938212e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:25:29,517] [INFO] [timer.py:215:stop] epoch=12/micro_step=200/global_step=11240, RunningAvgSamplesPerSec=85.11103281001103, CurrSamplesPerSec=84.64775104487063, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:25:37,028] [INFO] [logging.py:96:log_dist] [Rank 0] step=11250, skipped=218, lr=[1.4190399982507265e-06, 1.4190399982507265e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:25:37,061] [INFO] [timer.py:215:stop] epoch=12/micro_step=210/global_step=11250, RunningAvgSamplesPerSec=85.11084601994722, CurrSamplesPerSec=84.84764242264286, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:25:44,552] [INFO] [logging.py:96:log_dist] [Rank 0] step=11260, skipped=218, lr=[1.4117537797975187e-06, 1.4117537797975187e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:25:44,586] [INFO] [timer.py:215:stop] epoch=12/micro_step=220/global_step=11260, RunningAvgSamplesPerSec=85.11085658384275, CurrSamplesPerSec=85.39500982358271, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:25:52,085] [INFO] [logging.py:96:log_dist] [Rank 0] step=11270, skipped=218, lr=[1.4044831085225871e-06, 1.4044831085225871e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:25:52,119] [INFO] [timer.py:215:stop] epoch=12/micro_step=230/global_step=11270, RunningAvgSamplesPerSec=85.11079418347772, CurrSamplesPerSec=85.10982631835431, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:25:59,623] [INFO] [logging.py:96:log_dist] [Rank 0] step=11280, skipped=218, lr=[1.3972280175434985e-06, 1.3972280175434985e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:25:59,657] [INFO] [timer.py:215:stop] epoch=12/micro_step=240/global_step=11280, RunningAvgSamplesPerSec=85.11066650254489, CurrSamplesPerSec=84.96823006762692, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:26:07,158] [INFO] [logging.py:96:log_dist] [Rank 0] step=11290, skipped=218, lr=[1.3899885399068579e-06, 1.3899885399068579e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:26:07,191] [INFO] [timer.py:215:stop] epoch=12/micro_step=250/global_step=11290, RunningAvgSamplesPerSec=85.11058128564095, CurrSamplesPerSec=85.2869447191828, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:26:14,685] [INFO] [logging.py:96:log_dist] [Rank 0] step=11300, skipped=218, lr=[1.3827647085881517e-06, 1.3827647085881517e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:26:14,717] [INFO] [timer.py:215:stop] epoch=12/micro_step=260/global_step=11300, RunningAvgSamplesPerSec=85.11057268757732, CurrSamplesPerSec=84.81407854123887, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:26:22,212] [INFO] [logging.py:96:log_dist] [Rank 0] step=11310, skipped=218, lr=[1.3755565564915916e-06, 1.3755565564915916e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:26:22,246] [INFO] [timer.py:215:stop] epoch=12/micro_step=270/global_step=11310, RunningAvgSamplesPerSec=85.11053605511405, CurrSamplesPerSec=84.67162096348876, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:26:29,746] [INFO] [logging.py:96:log_dist] [Rank 0] step=11320, skipped=218, lr=[1.368364116449983e-06, 1.368364116449983e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:26:29,780] [INFO] [timer.py:215:stop] epoch=12/micro_step=280/global_step=11320, RunningAvgSamplesPerSec=85.11045144664509, CurrSamplesPerSec=84.57632856147232, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:26:30,478] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:26:31,173] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:26:37,167] [INFO] [logging.py:96:log_dist] [Rank 0] step=11330, skipped=220, lr=[1.362621499114214e-06, 1.362621499114214e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:26:37,201] [INFO] [timer.py:215:stop] epoch=12/micro_step=290/global_step=11330, RunningAvgSamplesPerSec=85.11150313352334, CurrSamplesPerSec=84.89044001846851, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:26:44,709] [INFO] [logging.py:96:log_dist] [Rank 0] step=11340, skipped=220, lr=[1.355457423281622e-06, 1.355457423281622e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:26:44,742] [INFO] [timer.py:215:stop] epoch=12/micro_step=300/global_step=11340, RunningAvgSamplesPerSec=85.11133893509262, CurrSamplesPerSec=85.01548413945385, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:26:52,237] [INFO] [logging.py:96:log_dist] [Rank 0] step=11350, skipped=220, lr=[1.3483091510546007e-06, 1.3483091510546007e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:26:52,271] [INFO] [timer.py:215:stop] epoch=12/micro_step=310/global_step=11350, RunningAvgSamplesPerSec=85.11130080941213, CurrSamplesPerSec=84.93438261745595, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:26:59,768] [INFO] [logging.py:96:log_dist] [Rank 0] step=11360, skipped=220, lr=[1.3411767149931948e-06, 1.3411767149931948e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:26:59,802] [INFO] [timer.py:215:stop] epoch=12/micro_step=320/global_step=11360, RunningAvgSamplesPerSec=85.11124663739481, CurrSamplesPerSec=85.28843509403364, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:27:07,308] [INFO] [logging.py:96:log_dist] [Rank 0] step=11370, skipped=220, lr=[1.334060147585321e-06, 1.334060147585321e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:27:07,342] [INFO] [timer.py:215:stop] epoch=12/micro_step=330/global_step=11370, RunningAvgSamplesPerSec=85.11110762049245, CurrSamplesPerSec=85.01965772388984, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:27:14,834] [INFO] [logging.py:96:log_dist] [Rank 0] step=11380, skipped=220, lr=[1.3269594812466154e-06, 1.3269594812466154e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:27:14,867] [INFO] [timer.py:215:stop] epoch=12/micro_step=340/global_step=11380, RunningAvgSamplesPerSec=85.11111550221395, CurrSamplesPerSec=85.22842773685548, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:27:22,370] [INFO] [logging.py:96:log_dist] [Rank 0] step=11390, skipped=220, lr=[1.3198747483202794e-06, 1.3198747483202794e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:27:22,403] [INFO] [timer.py:215:stop] epoch=12/micro_step=350/global_step=11390, RunningAvgSamplesPerSec=85.11100855804577, CurrSamplesPerSec=84.88788973621504, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:27:29,898] [INFO] [logging.py:96:log_dist] [Rank 0] step=11400, skipped=220, lr=[1.312805981076949e-06, 1.312805981076949e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:27:29,932] [INFO] [timer.py:215:stop] epoch=12/micro_step=360/global_step=11400, RunningAvgSamplesPerSec=85.11097923780848, CurrSamplesPerSec=84.88013245116875, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:27:37,434] [INFO] [logging.py:96:log_dist] [Rank 0] step=11410, skipped=220, lr=[1.3057532117145263e-06, 1.3057532117145263e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:27:37,467] [INFO] [timer.py:215:stop] epoch=12/micro_step=370/global_step=11410, RunningAvgSamplesPerSec=85.11088761467066, CurrSamplesPerSec=85.09889886932324, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:27:44,981] [INFO] [logging.py:96:log_dist] [Rank 0] step=11420, skipped=220, lr=[1.2987164723580512e-06, 1.2987164723580512e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:27:45,015] [INFO] [timer.py:215:stop] epoch=12/micro_step=380/global_step=11420, RunningAvgSamplesPerSec=85.1106695184424, CurrSamplesPerSec=84.91984644318572, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:27:47,225] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:27:47,923] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:27:52,417] [INFO] [logging.py:96:log_dist] [Rank 0] step=11430, skipped=222, lr=[1.2930986440185688e-06, 1.2930986440185688e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:27:52,451] [INFO] [timer.py:215:stop] epoch=12/micro_step=390/global_step=11430, RunningAvgSamplesPerSec=85.11155710580253, CurrSamplesPerSec=84.66833604852332, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:27:59,966] [INFO] [logging.py:96:log_dist] [Rank 0] step=11440, skipped=222, lr=[1.286090839394733e-06, 1.286090839394733e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:27:59,999] [INFO] [timer.py:215:stop] epoch=12/micro_step=400/global_step=11440, RunningAvgSamplesPerSec=85.11132617043856, CurrSamplesPerSec=84.74240197192631, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:28:07,503] [INFO] [logging.py:96:log_dist] [Rank 0] step=11450, skipped=222, lr=[1.279099154338038e-06, 1.279099154338038e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:28:07,535] [INFO] [timer.py:215:stop] epoch=12/micro_step=410/global_step=11450, RunningAvgSamplesPerSec=85.11122198068222, CurrSamplesPerSec=84.78115706572136, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:28:15,043] [INFO] [logging.py:96:log_dist] [Rank 0] step=11460, skipped=222, lr=[1.272123620695286e-06, 1.272123620695286e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:28:15,077] [INFO] [timer.py:215:stop] epoch=12/micro_step=420/global_step=11460, RunningAvgSamplesPerSec=85.1110660672889, CurrSamplesPerSec=85.09045562709944, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:28:22,571] [INFO] [logging.py:96:log_dist] [Rank 0] step=11470, skipped=222, lr=[1.2651642702397094e-06, 1.2651642702397094e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:28:22,604] [INFO] [timer.py:215:stop] epoch=12/micro_step=430/global_step=11470, RunningAvgSamplesPerSec=85.111043386919, CurrSamplesPerSec=85.11487277895061, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:28:30,107] [INFO] [logging.py:96:log_dist] [Rank 0] step=11480, skipped=222, lr=[1.2582211346708254e-06, 1.2582211346708254e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:28:30,141] [INFO] [timer.py:215:stop] epoch=12/micro_step=440/global_step=11480, RunningAvgSamplesPerSec=85.11093004426068, CurrSamplesPerSec=85.1252644353728, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:28:37,645] [INFO] [logging.py:96:log_dist] [Rank 0] step=11490, skipped=222, lr=[1.2512942456142958e-06, 1.2512942456142958e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:28:37,679] [INFO] [timer.py:215:stop] epoch=12/micro_step=450/global_step=11490, RunningAvgSamplesPerSec=85.1108110407918, CurrSamplesPerSec=85.01225325871569, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:28:45,185] [INFO] [logging.py:96:log_dist] [Rank 0] step=11500, skipped=222, lr=[1.2443836346217802e-06, 1.2443836346217802e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:28:45,218] [INFO] [timer.py:215:stop] epoch=12/micro_step=460/global_step=11500, RunningAvgSamplesPerSec=85.11067537938762, CurrSamplesPerSec=85.23105264544914, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:28:52,727] [INFO] [logging.py:96:log_dist] [Rank 0] step=11510, skipped=222, lr=[1.237489333170787e-06, 1.237489333170787e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:28:52,760] [INFO] [timer.py:215:stop] epoch=12/micro_step=470/global_step=11510, RunningAvgSamplesPerSec=85.11051921107754, CurrSamplesPerSec=85.0029389272488, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:29:00,263] [INFO] [logging.py:96:log_dist] [Rank 0] step=11520, skipped=222, lr=[1.230611372664545e-06, 1.230611372664545e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:29:00,297] [INFO] [timer.py:215:stop] epoch=12/micro_step=480/global_step=11520, RunningAvgSamplesPerSec=85.11040963304957, CurrSamplesPerSec=84.82766710486399, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:29:04,012] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:29:04,710] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:29:07,696] [INFO] [logging.py:96:log_dist] [Rank 0] step=11530, skipped=224, lr=[1.2251207907952224e-06, 1.2251207907952224e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:29:07,730] [INFO] [timer.py:215:stop] epoch=12/micro_step=490/global_step=11530, RunningAvgSamplesPerSec=85.11131290876186, CurrSamplesPerSec=84.96847212427308, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:29:15,236] [INFO] [logging.py:96:log_dist] [Rank 0] step=11540, skipped=224, lr=[1.2182723228879699e-06, 1.2182723228879699e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:29:15,269] [INFO] [timer.py:215:stop] epoch=12/micro_step=500/global_step=11540, RunningAvgSamplesPerSec=85.11118103262257, CurrSamplesPerSec=84.91162670643025, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:29:22,763] [INFO] [logging.py:96:log_dist] [Rank 0] step=11550, skipped=224, lr=[1.2114402834580596e-06, 1.2114402834580596e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:29:22,797] [INFO] [timer.py:215:stop] epoch=12/micro_step=510/global_step=11550, RunningAvgSamplesPerSec=85.11116260202384, CurrSamplesPerSec=84.94704200873852, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:29:30,288] [INFO] [logging.py:96:log_dist] [Rank 0] step=11560, skipped=224, lr=[1.2046247036251101e-06, 1.2046247036251101e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:29:30,322] [INFO] [timer.py:215:stop] epoch=12/micro_step=520/global_step=11560, RunningAvgSamplesPerSec=85.11117204352851, CurrSamplesPerSec=85.02148884509573, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:29:37,825] [INFO] [logging.py:96:log_dist] [Rank 0] step=11570, skipped=224, lr=[1.1978256144337731e-06, 1.1978256144337731e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:29:37,858] [INFO] [timer.py:215:stop] epoch=12/micro_step=530/global_step=11570, RunningAvgSamplesPerSec=85.11105851935243, CurrSamplesPerSec=84.89486982028637, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:29:45,361] [INFO] [logging.py:96:log_dist] [Rank 0] step=11580, skipped=224, lr=[1.1910430468535866e-06, 1.1910430468535866e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:29:45,394] [INFO] [timer.py:215:stop] epoch=12/micro_step=540/global_step=11580, RunningAvgSamplesPerSec=85.11096541325311, CurrSamplesPerSec=85.30163395878543, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:29:52,902] [INFO] [logging.py:96:log_dist] [Rank 0] step=11590, skipped=224, lr=[1.1842770317788278e-06, 1.1842770317788278e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:29:52,935] [INFO] [timer.py:215:stop] epoch=12/micro_step=550/global_step=11590, RunningAvgSamplesPerSec=85.11080864887376, CurrSamplesPerSec=85.11058190002984, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:30:00,446] [INFO] [logging.py:96:log_dist] [Rank 0] step=11600, skipped=224, lr=[1.1775276000283831e-06, 1.1775276000283831e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:30:00,480] [INFO] [timer.py:215:stop] epoch=12/micro_step=560/global_step=11600, RunningAvgSamplesPerSec=85.11062773178651, CurrSamplesPerSec=84.71394704752191, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:30:07,995] [INFO] [logging.py:96:log_dist] [Rank 0] step=11610, skipped=224, lr=[1.170794782345601e-06, 1.170794782345601e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:30:08,028] [INFO] [timer.py:215:stop] epoch=12/micro_step=570/global_step=11610, RunningAvgSamplesPerSec=85.11040062606273, CurrSamplesPerSec=85.01928073813846, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:30:15,541] [INFO] [logging.py:96:log_dist] [Rank 0] step=11620, skipped=224, lr=[1.1640786093981561e-06, 1.1640786093981561e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:30:15,574] [INFO] [timer.py:215:stop] epoch=12/micro_step=580/global_step=11620, RunningAvgSamplesPerSec=85.11021425768946, CurrSamplesPerSec=84.6512212258552, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:30:20,801] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:30:21,500] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:30:22,977] [INFO] [logging.py:96:log_dist] [Rank 0] step=11630, skipped=226, lr=[1.1587176758099184e-06, 1.1587176758099184e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:30:23,010] [INFO] [timer.py:215:stop] epoch=12/micro_step=590/global_step=11630, RunningAvgSamplesPerSec=85.11108799343893, CurrSamplesPerSec=84.72654080776907, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:30:30,529] [INFO] [logging.py:96:log_dist] [Rank 0] step=11640, skipped=226, lr=[1.1520315404265167e-06, 1.1520315404265167e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:30:30,563] [INFO] [timer.py:215:stop] epoch=12/micro_step=600/global_step=11640, RunningAvgSamplesPerSec=85.11082564489988, CurrSamplesPerSec=84.90759800487173, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:30:38,075] [INFO] [logging.py:96:log_dist] [Rank 0] step=11650, skipped=226, lr=[1.1453621352441482e-06, 1.1453621352441482e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:30:38,109] [INFO] [timer.py:215:stop] epoch=12/micro_step=610/global_step=11650, RunningAvgSamplesPerSec=85.11062383059084, CurrSamplesPerSec=84.82557627321098, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:30:45,626] [INFO] [logging.py:96:log_dist] [Rank 0] step=11660, skipped=226, lr=[1.1387094906416413e-06, 1.1387094906416413e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:30:45,658] [INFO] [timer.py:215:stop] epoch=12/micro_step=620/global_step=11660, RunningAvgSamplesPerSec=85.1103938945262, CurrSamplesPerSec=84.0898138521236, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:30:53,182] [INFO] [logging.py:96:log_dist] [Rank 0] step=11670, skipped=226, lr=[1.132073636921488e-06, 1.132073636921488e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:30:53,216] [INFO] [timer.py:215:stop] epoch=12/micro_step=630/global_step=11670, RunningAvgSamplesPerSec=85.11007904392653, CurrSamplesPerSec=84.71143409128061, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:31:00,731] [INFO] [logging.py:96:log_dist] [Rank 0] step=11680, skipped=226, lr=[1.1254546043096908e-06, 1.1254546043096908e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:31:00,764] [INFO] [timer.py:215:stop] epoch=12/micro_step=640/global_step=11680, RunningAvgSamplesPerSec=85.10985759423004, CurrSamplesPerSec=84.819813616964, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:31:08,289] [INFO] [logging.py:96:log_dist] [Rank 0] step=11690, skipped=226, lr=[1.1188524229556384e-06, 1.1188524229556384e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:31:08,323] [INFO] [timer.py:215:stop] epoch=12/micro_step=650/global_step=11690, RunningAvgSamplesPerSec=85.10953879985193, CurrSamplesPerSec=84.88966149500548, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:31:15,842] [INFO] [logging.py:96:log_dist] [Rank 0] step=11700, skipped=226, lr=[1.1122671229319608e-06, 1.1122671229319608e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:31:15,875] [INFO] [timer.py:215:stop] epoch=12/micro_step=660/global_step=11700, RunningAvgSamplesPerSec=85.10928467385219, CurrSamplesPerSec=84.69219085912097, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:31:23,405] [INFO] [logging.py:96:log_dist] [Rank 0] step=11710, skipped=226, lr=[1.1056987342343921e-06, 1.1056987342343921e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:31:23,439] [INFO] [timer.py:215:stop] epoch=12/micro_step=670/global_step=11710, RunningAvgSamplesPerSec=85.10891007510209, CurrSamplesPerSec=84.96285138687333, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:31:30,940] [INFO] [logging.py:96:log_dist] [Rank 0] step=11720, skipped=226, lr=[1.0991472867816399e-06, 1.0991472867816399e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:31:30,973] [INFO] [timer.py:215:stop] epoch=12/micro_step=680/global_step=11720, RunningAvgSamplesPerSec=85.10882381576452, CurrSamplesPerSec=85.14232844929165, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:31:37,707] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:31:38,404] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:31:38,405] [INFO] [logging.py:96:log_dist] [Rank 0] step=11730, skipped=228, lr=[1.0939183465718164e-06, 1.0939183465718164e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:31:38,406] [INFO] [timer.py:215:stop] epoch=12/micro_step=690/global_step=11730, RunningAvgSamplesPerSec=85.10972702362713, CurrSamplesPerSec=91.85055657370137, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:31:45,929] [INFO] [logging.py:96:log_dist] [Rank 0] step=11740, skipped=228, lr=[1.0873974685084702e-06, 1.0873974685084702e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:31:45,963] [INFO] [timer.py:215:stop] epoch=12/micro_step=700/global_step=11740, RunningAvgSamplesPerSec=85.10942640786367, CurrSamplesPerSec=84.7009294099643, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:31:53,479] [INFO] [logging.py:96:log_dist] [Rank 0] step=11750, skipped=228, lr=[1.0808936150513568e-06, 1.0808936150513568e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:31:53,513] [INFO] [timer.py:215:stop] epoch=12/micro_step=710/global_step=11750, RunningAvgSamplesPerSec=85.10918942118123, CurrSamplesPerSec=84.91154612865247, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:32:01,018] [INFO] [logging.py:96:log_dist] [Rank 0] step=11760, skipped=228, lr=[1.0744068158252268e-06, 1.0744068158252268e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:32:01,051] [INFO] [timer.py:215:stop] epoch=12/micro_step=720/global_step=11760, RunningAvgSamplesPerSec=85.1090642507195, CurrSamplesPerSec=84.99354590311762, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:32:08,562] [INFO] [logging.py:96:log_dist] [Rank 0] step=11770, skipped=228, lr=[1.0679371003771527e-06, 1.0679371003771527e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:32:08,595] [INFO] [timer.py:215:stop] epoch=12/micro_step=730/global_step=11770, RunningAvgSamplesPerSec=85.10889051626455, CurrSamplesPerSec=84.98084574531401, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:32:16,113] [INFO] [logging.py:96:log_dist] [Rank 0] step=11780, skipped=228, lr=[1.0614844981763842e-06, 1.0614844981763842e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:32:16,147] [INFO] [timer.py:215:stop] epoch=12/micro_step=740/global_step=11780, RunningAvgSamplesPerSec=85.10864308319832, CurrSamplesPerSec=84.72908140429688, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:32:23,660] [INFO] [logging.py:96:log_dist] [Rank 0] step=11790, skipped=228, lr=[1.055049038614228e-06, 1.055049038614228e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:32:23,694] [INFO] [timer.py:215:stop] epoch=12/micro_step=750/global_step=11790, RunningAvgSamplesPerSec=85.10843619016141, CurrSamplesPerSec=84.55570836291658, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:32:31,206] [INFO] [logging.py:96:log_dist] [Rank 0] step=11800, skipped=228, lr=[1.0486307510039028e-06, 1.0486307510039028e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:32:31,240] [INFO] [timer.py:215:stop] epoch=12/micro_step=760/global_step=11800, RunningAvgSamplesPerSec=85.10824132704819, CurrSamplesPerSec=84.8410455056462, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:32:38,756] [INFO] [logging.py:96:log_dist] [Rank 0] step=11810, skipped=228, lr=[1.0422296645804113e-06, 1.0422296645804113e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:32:38,789] [INFO] [timer.py:215:stop] epoch=12/micro_step=770/global_step=11810, RunningAvgSamplesPerSec=85.10801553329496, CurrSamplesPerSec=84.72060442881269, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:32:46,295] [INFO] [logging.py:96:log_dist] [Rank 0] step=11820, skipped=228, lr=[1.0358458085004068e-06, 1.0358458085004068e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:32:46,329] [INFO] [timer.py:215:stop] epoch=12/micro_step=780/global_step=11820, RunningAvgSamplesPerSec=85.10788419998991, CurrSamplesPerSec=85.39427634887798, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:32:53,829] [INFO] [logging.py:96:log_dist] [Rank 0] step=11830, skipped=228, lr=[1.0294792118420538e-06, 1.0294792118420538e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:32:53,862] [INFO] [timer.py:215:stop] epoch=12/micro_step=790/global_step=11830, RunningAvgSamplesPerSec=85.10780782273248, CurrSamplesPerSec=84.90703401716009, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:32:54,557] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:32:55,253] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:33:01,243] [INFO] [logging.py:96:log_dist] [Rank 0] step=11840, skipped=230, lr=[1.0243983807893084e-06, 1.0243983807893084e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:33:01,277] [INFO] [timer.py:215:stop] epoch=12/micro_step=800/global_step=11840, RunningAvgSamplesPerSec=85.10887074549059, CurrSamplesPerSec=84.85866639016427, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:33:08,775] [INFO] [logging.py:96:log_dist] [Rank 0] step=11850, skipped=230, lr=[1.0180629241158942e-06, 1.0180629241158942e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:33:08,809] [INFO] [timer.py:215:stop] epoch=12/micro_step=810/global_step=11850, RunningAvgSamplesPerSec=85.10880941183437, CurrSamplesPerSec=84.87001521704283, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:33:16,309] [INFO] [logging.py:96:log_dist] [Rank 0] step=11860, skipped=230, lr=[1.0117448078643448e-06, 1.0117448078643448e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:33:16,342] [INFO] [timer.py:215:stop] epoch=12/micro_step=820/global_step=11860, RunningAvgSamplesPerSec=85.10874321178989, CurrSamplesPerSec=85.11050094420862, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:33:23,844] [INFO] [logging.py:96:log_dist] [Rank 0] step=11870, skipped=230, lr=[1.0054440608133936e-06, 1.0054440608133936e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:33:23,878] [INFO] [timer.py:215:stop] epoch=12/micro_step=830/global_step=11870, RunningAvgSamplesPerSec=85.10864775650556, CurrSamplesPerSec=85.08174438911833, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:33:31,377] [INFO] [logging.py:96:log_dist] [Rank 0] step=11880, skipped=230, lr=[9.991607116626455e-07, 9.991607116626455e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:33:31,411] [INFO] [timer.py:215:stop] epoch=12/micro_step=840/global_step=11880, RunningAvgSamplesPerSec=85.10857982102337, CurrSamplesPerSec=85.08465692654208, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:33:38,905] [INFO] [logging.py:96:log_dist] [Rank 0] step=11890, skipped=230, lr=[9.928947890324673e-07, 9.928947890324673e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:33:38,939] [INFO] [timer.py:215:stop] epoch=12/micro_step=850/global_step=11890, RunningAvgSamplesPerSec=85.10855592157841, CurrSamplesPerSec=85.1713962171, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:33:46,441] [INFO] [logging.py:96:log_dist] [Rank 0] step=11900, skipped=230, lr=[9.866463214638467e-07, 9.866463214638467e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:33:46,474] [INFO] [timer.py:215:stop] epoch=12/micro_step=860/global_step=11900, RunningAvgSamplesPerSec=85.1084613320051, CurrSamplesPerSec=85.30824847862156, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:33:53,978] [INFO] [logging.py:96:log_dist] [Rank 0] step=11910, skipped=230, lr=[9.804153374182609e-07, 9.804153374182609e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:33:54,011] [INFO] [timer.py:215:stop] epoch=12/micro_step=870/global_step=11910, RunningAvgSamplesPerSec=85.10835644830998, CurrSamplesPerSec=84.92288223852965, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:34:01,513] [INFO] [logging.py:96:log_dist] [Rank 0] step=11920, skipped=230, lr=[9.742018652775551e-07, 9.742018652775551e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:34:01,547] [INFO] [timer.py:215:stop] epoch=12/micro_step=880/global_step=11920, RunningAvgSamplesPerSec=85.108267447108, CurrSamplesPerSec=85.1366846876334, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:34:09,038] [INFO] [logging.py:96:log_dist] [Rank 0] step=11930, skipped=230, lr=[9.680059333438066e-07, 9.680059333438066e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:34:09,071] [INFO] [timer.py:215:stop] epoch=12/micro_step=890/global_step=11930, RunningAvgSamplesPerSec=85.10827212470186, CurrSamplesPerSec=85.13541561939728, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:34:11,278] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:34:11,976] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:34:16,454] [INFO] [logging.py:96:log_dist] [Rank 0] step=11940, skipped=232, lr=[9.630618357138125e-07, 9.630618357138125e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:34:16,488] [INFO] [timer.py:215:stop] epoch=12/micro_step=900/global_step=11940, RunningAvgSamplesPerSec=85.10930470619581, CurrSamplesPerSec=85.08622114931103, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:34:23,979] [INFO] [logging.py:96:log_dist] [Rank 0] step=11950, skipped=232, lr=[9.568975472187164e-07, 9.568975472187164e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:34:24,012] [INFO] [timer.py:215:stop] epoch=12/micro_step=910/global_step=11950, RunningAvgSamplesPerSec=85.1093196693995, CurrSamplesPerSec=85.22247493507565, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:34:31,516] [INFO] [logging.py:96:log_dist] [Rank 0] step=11960, skipped=232, lr=[9.507508777509438e-07, 9.507508777509438e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:34:31,550] [INFO] [timer.py:215:stop] epoch=12/micro_step=920/global_step=11960, RunningAvgSamplesPerSec=85.10920486866422, CurrSamplesPerSec=84.66104605028278, MemAllocated=5.0GB, MaxMemAllocated=23.99GB ***** Evaluating perplexity, Epoch 13/16 ***** ppl: 1.7827438116073608 Beginning of Epoch 14/16, Total Micro Batches 920 [2023-06-29 19:34:56,972] [INFO] [logging.py:96:log_dist] [Rank 0] step=11970, skipped=232, lr=[9.446218553082909e-07, 9.446218553082909e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:34:57,005] [INFO] [timer.py:215:stop] epoch=13/micro_step=10/global_step=11970, RunningAvgSamplesPerSec=85.10889909278292, CurrSamplesPerSec=84.68390827079624, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:35:04,496] [INFO] [logging.py:96:log_dist] [Rank 0] step=11980, skipped=232, lr=[9.385105078081679e-07, 9.385105078081679e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:35:04,530] [INFO] [timer.py:215:stop] epoch=13/micro_step=20/global_step=11980, RunningAvgSamplesPerSec=85.10890632120066, CurrSamplesPerSec=85.3257114013342, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:35:12,030] [INFO] [logging.py:96:log_dist] [Rank 0] step=11990, skipped=232, lr=[9.32416863087481e-07, 9.32416863087481e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:35:12,064] [INFO] [timer.py:215:stop] epoch=13/micro_step=30/global_step=11990, RunningAvgSamplesPerSec=85.10882949438127, CurrSamplesPerSec=84.84198402880453, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:35:19,560] [INFO] [logging.py:96:log_dist] [Rank 0] step=12000, skipped=232, lr=[9.26340948902499e-07, 9.26340948902499e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:35:19,594] [INFO] [timer.py:215:stop] epoch=13/micro_step=40/global_step=12000, RunningAvgSamplesPerSec=85.10878381471846, CurrSamplesPerSec=85.01421868190893, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:35:27,104] [INFO] [logging.py:96:log_dist] [Rank 0] step=12010, skipped=232, lr=[9.202827929287289e-07, 9.202827929287289e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:35:27,137] [INFO] [timer.py:215:stop] epoch=13/micro_step=50/global_step=12010, RunningAvgSamplesPerSec=85.10861914679961, CurrSamplesPerSec=85.26898302532764, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:35:34,643] [INFO] [logging.py:96:log_dist] [Rank 0] step=12020, skipped=232, lr=[9.142424227607926e-07, 9.142424227607926e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:35:34,677] [INFO] [timer.py:215:stop] epoch=13/micro_step=60/global_step=12020, RunningAvgSamplesPerSec=85.10848836113473, CurrSamplesPerSec=84.50739420734884, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:35:42,180] [INFO] [logging.py:96:log_dist] [Rank 0] step=12030, skipped=232, lr=[9.082198659122924e-07, 9.082198659122924e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:35:42,213] [INFO] [timer.py:215:stop] epoch=13/micro_step=70/global_step=12030, RunningAvgSamplesPerSec=85.10838396063072, CurrSamplesPerSec=85.0842254269339, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:35:45,920] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:35:46,617] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:35:49,595] [INFO] [logging.py:96:log_dist] [Rank 0] step=12040, skipped=234, lr=[9.034146644608342e-07, 9.034146644608342e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:35:49,629] [INFO] [timer.py:215:stop] epoch=13/micro_step=80/global_step=12040, RunningAvgSamplesPerSec=85.10941702439601, CurrSamplesPerSec=84.86623193748297, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:35:57,139] [INFO] [logging.py:96:log_dist] [Rank 0] step=12050, skipped=234, lr=[8.974242406625366e-07, 8.974242406625366e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:35:57,172] [INFO] [timer.py:215:stop] epoch=13/micro_step=90/global_step=12050, RunningAvgSamplesPerSec=85.10925403929697, CurrSamplesPerSec=84.95685496071106, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:36:04,682] [INFO] [logging.py:96:log_dist] [Rank 0] step=12060, skipped=234, lr=[8.914517067897149e-07, 8.914517067897149e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:36:04,716] [INFO] [timer.py:215:stop] epoch=13/micro_step=100/global_step=12060, RunningAvgSamplesPerSec=85.10909405536923, CurrSamplesPerSec=84.86003452752892, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:36:12,220] [INFO] [logging.py:96:log_dist] [Rank 0] step=12070, skipped=234, lr=[8.854970900469853e-07, 8.854970900469853e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:36:12,253] [INFO] [timer.py:215:stop] epoch=13/micro_step=110/global_step=12070, RunningAvgSamplesPerSec=85.10898742025853, CurrSamplesPerSec=85.2544404106028, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:36:19,755] [INFO] [logging.py:96:log_dist] [Rank 0] step=12080, skipped=234, lr=[8.79560417557352e-07, 8.79560417557352e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:36:19,788] [INFO] [timer.py:215:stop] epoch=13/micro_step=120/global_step=12080, RunningAvgSamplesPerSec=85.10890279647249, CurrSamplesPerSec=85.00404254200716, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:36:27,284] [INFO] [logging.py:96:log_dist] [Rank 0] step=12090, skipped=234, lr=[8.73641716362083e-07, 8.73641716362083e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:36:27,317] [INFO] [timer.py:215:stop] epoch=13/micro_step=130/global_step=12090, RunningAvgSamplesPerSec=85.10887512746443, CurrSamplesPerSec=85.25276169442573, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:36:34,816] [INFO] [logging.py:96:log_dist] [Rank 0] step=12100, skipped=234, lr=[8.677410134205861e-07, 8.677410134205861e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:36:34,850] [INFO] [timer.py:215:stop] epoch=13/micro_step=140/global_step=12100, RunningAvgSamplesPerSec=85.10881951631485, CurrSamplesPerSec=85.03541333188036, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:36:42,351] [INFO] [logging.py:96:log_dist] [Rank 0] step=12110, skipped=234, lr=[8.618583356102906e-07, 8.618583356102906e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:36:42,385] [INFO] [timer.py:215:stop] epoch=13/micro_step=150/global_step=12110, RunningAvgSamplesPerSec=85.10873365442818, CurrSamplesPerSec=84.7071570617946, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:36:49,885] [INFO] [logging.py:96:log_dist] [Rank 0] step=12120, skipped=234, lr=[8.559937097265223e-07, 8.559937097265223e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:36:49,919] [INFO] [timer.py:215:stop] epoch=13/micro_step=160/global_step=12120, RunningAvgSamplesPerSec=85.1086561400948, CurrSamplesPerSec=84.97987724498957, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:36:57,414] [INFO] [logging.py:96:log_dist] [Rank 0] step=12130, skipped=234, lr=[8.501471624823767e-07, 8.501471624823767e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:36:57,448] [INFO] [timer.py:215:stop] epoch=13/micro_step=170/global_step=12130, RunningAvgSamplesPerSec=85.10862446507309, CurrSamplesPerSec=85.10086830097508, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:37:02,671] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:37:03,368] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:37:04,843] [INFO] [logging.py:96:log_dist] [Rank 0] step=12140, skipped=236, lr=[8.454829592062268e-07, 8.454829592062268e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:37:04,876] [INFO] [timer.py:215:stop] epoch=13/micro_step=180/global_step=12140, RunningAvgSamplesPerSec=85.10953630922387, CurrSamplesPerSec=85.14308460816993, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:37:12,377] [INFO] [logging.py:96:log_dist] [Rank 0] step=12150, skipped=236, lr=[8.396690205674879e-07, 8.396690205674879e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:37:12,411] [INFO] [timer.py:215:stop] epoch=13/micro_step=190/global_step=12150, RunningAvgSamplesPerSec=85.10944923004148, CurrSamplesPerSec=85.32666067806028, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:37:19,916] [INFO] [logging.py:96:log_dist] [Rank 0] step=12160, skipped=236, lr=[8.33873234926574e-07, 8.33873234926574e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:37:19,950] [INFO] [timer.py:215:stop] epoch=13/micro_step=200/global_step=12160, RunningAvgSamplesPerSec=85.10932810309534, CurrSamplesPerSec=84.66641328973569, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:37:27,457] [INFO] [logging.py:96:log_dist] [Rank 0] step=12170, skipped=236, lr=[8.280956286830244e-07, 8.280956286830244e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:37:27,491] [INFO] [timer.py:215:stop] epoch=13/micro_step=210/global_step=12170, RunningAvgSamplesPerSec=85.1091863362839, CurrSamplesPerSec=84.76426426663492, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:37:34,992] [INFO] [logging.py:96:log_dist] [Rank 0] step=12180, skipped=236, lr=[8.223362281535643e-07, 8.223362281535643e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:37:35,026] [INFO] [timer.py:215:stop] epoch=13/micro_step=220/global_step=12180, RunningAvgSamplesPerSec=85.10910210184196, CurrSamplesPerSec=85.0929641471191, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:37:42,517] [INFO] [logging.py:96:log_dist] [Rank 0] step=12190, skipped=236, lr=[8.165950595719979e-07, 8.165950595719979e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:37:42,551] [INFO] [timer.py:215:stop] epoch=13/micro_step=230/global_step=12190, RunningAvgSamplesPerSec=85.10910602148554, CurrSamplesPerSec=85.15278083541482, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:37:50,061] [INFO] [logging.py:96:log_dist] [Rank 0] step=12200, skipped=236, lr=[8.108721490890804e-07, 8.108721490890804e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:37:50,094] [INFO] [timer.py:215:stop] epoch=13/micro_step=240/global_step=12200, RunningAvgSamplesPerSec=85.10894515075888, CurrSamplesPerSec=84.63309934090117, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:37:57,598] [INFO] [logging.py:96:log_dist] [Rank 0] step=12210, skipped=236, lr=[8.051675227724063e-07, 8.051675227724063e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:37:57,632] [INFO] [timer.py:215:stop] epoch=13/micro_step=250/global_step=12210, RunningAvgSamplesPerSec=85.10883610987028, CurrSamplesPerSec=85.34082096945049, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:38:05,131] [INFO] [logging.py:96:log_dist] [Rank 0] step=12220, skipped=236, lr=[7.994812066062806e-07, 7.994812066062806e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:38:05,165] [INFO] [timer.py:215:stop] epoch=13/micro_step=260/global_step=12220, RunningAvgSamplesPerSec=85.10876678532016, CurrSamplesPerSec=85.10936757888223, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:38:12,656] [INFO] [logging.py:96:log_dist] [Rank 0] step=12230, skipped=236, lr=[7.938132264916119e-07, 7.938132264916119e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:38:12,689] [INFO] [timer.py:215:stop] epoch=13/micro_step=270/global_step=12230, RunningAvgSamplesPerSec=85.10877661561318, CurrSamplesPerSec=85.02684802064701, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:38:19,401] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:38:20,099] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:38:20,099] [INFO] [logging.py:96:log_dist] [Rank 0] step=12240, skipped=238, lr=[7.892920617090187e-07, 7.892920617090187e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:38:20,100] [INFO] [timer.py:215:stop] epoch=13/micro_step=280/global_step=12240, RunningAvgSamplesPerSec=85.10984624496633, CurrSamplesPerSec=91.86476445481152, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:38:27,592] [INFO] [logging.py:96:log_dist] [Rank 0] step=12250, skipped=238, lr=[7.836571514905854e-07, 7.836571514905854e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:38:27,626] [INFO] [timer.py:215:stop] epoch=13/micro_step=290/global_step=12250, RunningAvgSamplesPerSec=85.10983799092105, CurrSamplesPerSec=85.29206640718087, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:38:35,129] [INFO] [logging.py:96:log_dist] [Rank 0] step=12260, skipped=238, lr=[7.780406494014457e-07, 7.780406494014457e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:38:35,162] [INFO] [timer.py:215:stop] epoch=13/micro_step=300/global_step=12260, RunningAvgSamplesPerSec=85.1097482412012, CurrSamplesPerSec=85.21947179160432, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:38:42,653] [INFO] [logging.py:96:log_dist] [Rank 0] step=12270, skipped=238, lr=[7.72442581024507e-07, 7.72442581024507e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:38:42,686] [INFO] [timer.py:215:stop] epoch=13/micro_step=310/global_step=12270, RunningAvgSamplesPerSec=85.10976552116654, CurrSamplesPerSec=84.89631967402096, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:38:50,183] [INFO] [logging.py:96:log_dist] [Rank 0] step=12280, skipped=238, lr=[7.668629718587102e-07, 7.668629718587102e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:38:50,217] [INFO] [timer.py:215:stop] epoch=13/micro_step=320/global_step=12280, RunningAvgSamplesPerSec=85.1097163220569, CurrSamplesPerSec=84.98302495176014, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:38:57,721] [INFO] [logging.py:96:log_dist] [Rank 0] step=12290, skipped=238, lr=[7.61301847318918e-07, 7.61301847318918e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:38:57,754] [INFO] [timer.py:215:stop] epoch=13/micro_step=330/global_step=12290, RunningAvgSamplesPerSec=85.10960794126994, CurrSamplesPerSec=85.18591051202537, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:39:05,253] [INFO] [logging.py:96:log_dist] [Rank 0] step=12300, skipped=238, lr=[7.557592327357927e-07, 7.557592327357927e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:39:05,287] [INFO] [timer.py:215:stop] epoch=13/micro_step=340/global_step=12300, RunningAvgSamplesPerSec=85.10954990367514, CurrSamplesPerSec=85.24025463273892, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:39:12,782] [INFO] [logging.py:96:log_dist] [Rank 0] step=12310, skipped=238, lr=[7.502351533556875e-07, 7.502351533556875e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:39:12,816] [INFO] [timer.py:215:stop] epoch=13/micro_step=350/global_step=12310, RunningAvgSamplesPerSec=85.1095195765093, CurrSamplesPerSec=85.1696937468827, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:39:20,303] [INFO] [logging.py:96:log_dist] [Rank 0] step=12320, skipped=238, lr=[7.447296343405291e-07, 7.447296343405291e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:39:20,337] [INFO] [timer.py:215:stop] epoch=13/micro_step=360/global_step=12320, RunningAvgSamplesPerSec=85.10956196739365, CurrSamplesPerSec=84.96984380465214, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:39:27,828] [INFO] [logging.py:96:log_dist] [Rank 0] step=12330, skipped=238, lr=[7.392427007676979e-07, 7.392427007676979e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:39:27,862] [INFO] [timer.py:215:stop] epoch=13/micro_step=370/global_step=12330, RunningAvgSamplesPerSec=85.10956137412342, CurrSamplesPerSec=84.98848691084227, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:39:35,365] [INFO] [logging.py:96:log_dist] [Rank 0] step=12340, skipped=238, lr=[7.337743776299272e-07, 7.337743776299272e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:39:35,399] [INFO] [timer.py:215:stop] epoch=13/micro_step=380/global_step=12340, RunningAvgSamplesPerSec=85.10945727661974, CurrSamplesPerSec=85.05925984136198, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:39:36,095] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:39:36,790] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:39:42,793] [INFO] [logging.py:96:log_dist] [Rank 0] step=12350, skipped=240, lr=[7.294131353739503e-07, 7.294131353739503e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:39:42,826] [INFO] [timer.py:215:stop] epoch=13/micro_step=390/global_step=12350, RunningAvgSamplesPerSec=85.11036728352144, CurrSamplesPerSec=84.94271427640201, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:39:50,335] [INFO] [logging.py:96:log_dist] [Rank 0] step=12360, skipped=240, lr=[7.239783737302979e-07, 7.239783737302979e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:39:50,369] [INFO] [timer.py:215:stop] epoch=13/micro_step=400/global_step=12360, RunningAvgSamplesPerSec=85.11021188772041, CurrSamplesPerSec=84.55674712461558, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:39:57,866] [INFO] [logging.py:96:log_dist] [Rank 0] step=12370, skipped=240, lr=[7.185622920500073e-07, 7.185622920500073e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:39:57,899] [INFO] [timer.py:215:stop] epoch=13/micro_step=410/global_step=12370, RunningAvgSamplesPerSec=85.11016408813164, CurrSamplesPerSec=85.11125653786253, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:40:05,399] [INFO] [logging.py:96:log_dist] [Rank 0] step=12380, skipped=240, lr=[7.131649150030759e-07, 7.131649150030759e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:40:05,431] [INFO] [timer.py:215:stop] epoch=13/micro_step=420/global_step=12380, RunningAvgSamplesPerSec=85.11010688467186, CurrSamplesPerSec=84.99943985041568, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:40:12,919] [INFO] [logging.py:96:log_dist] [Rank 0] step=12390, skipped=240, lr=[7.077862671743073e-07, 7.077862671743073e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:40:12,953] [INFO] [timer.py:215:stop] epoch=13/micro_step=430/global_step=12390, RunningAvgSamplesPerSec=85.11014804624271, CurrSamplesPerSec=85.23600523287567, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:40:20,450] [INFO] [logging.py:96:log_dist] [Rank 0] step=12400, skipped=240, lr=[7.024263730631927e-07, 7.024263730631927e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:40:20,483] [INFO] [timer.py:215:stop] epoch=13/micro_step=440/global_step=12400, RunningAvgSamplesPerSec=85.11010314328816, CurrSamplesPerSec=85.21208658232928, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:40:27,978] [INFO] [logging.py:96:log_dist] [Rank 0] step=12410, skipped=240, lr=[6.970852570838024e-07, 6.970852570838024e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:40:28,011] [INFO] [timer.py:215:stop] epoch=13/micro_step=450/global_step=12410, RunningAvgSamplesPerSec=85.11008372969962, CurrSamplesPerSec=84.98942869888683, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:40:35,508] [INFO] [logging.py:96:log_dist] [Rank 0] step=12420, skipped=240, lr=[6.917629435646699e-07, 6.917629435646699e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:40:35,542] [INFO] [timer.py:215:stop] epoch=13/micro_step=460/global_step=12420, RunningAvgSamplesPerSec=85.11004194973619, CurrSamplesPerSec=85.20545992015741, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:40:43,038] [INFO] [logging.py:96:log_dist] [Rank 0] step=12430, skipped=240, lr=[6.864594567486877e-07, 6.864594567486877e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:40:43,071] [INFO] [timer.py:215:stop] epoch=13/micro_step=470/global_step=12430, RunningAvgSamplesPerSec=85.11001004264595, CurrSamplesPerSec=85.29521018170819, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:40:50,571] [INFO] [logging.py:96:log_dist] [Rank 0] step=12440, skipped=240, lr=[6.811748207929931e-07, 6.811748207929931e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:40:50,605] [INFO] [timer.py:215:stop] epoch=13/micro_step=480/global_step=12440, RunningAvgSamplesPerSec=85.10993736674054, CurrSamplesPerSec=84.87720706133149, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:40:52,812] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:40:53,511] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:40:57,997] [INFO] [logging.py:96:log_dist] [Rank 0] step=12450, skipped=242, lr=[6.76960700826625e-07, 6.76960700826625e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:40:58,031] [INFO] [timer.py:215:stop] epoch=13/micro_step=490/global_step=12450, RunningAvgSamplesPerSec=85.11084449156277, CurrSamplesPerSec=85.095283985433, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:41:05,527] [INFO] [logging.py:96:log_dist] [Rank 0] step=12460, skipped=242, lr=[6.717100570212791e-07, 6.717100570212791e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:41:05,560] [INFO] [timer.py:215:stop] epoch=13/micro_step=500/global_step=12460, RunningAvgSamplesPerSec=85.11081187137424, CurrSamplesPerSec=84.90330114455661, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:41:13,057] [INFO] [logging.py:96:log_dist] [Rank 0] step=12470, skipped=242, lr=[6.664783312590531e-07, 6.664783312590531e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:41:13,090] [INFO] [timer.py:215:stop] epoch=13/micro_step=510/global_step=12470, RunningAvgSamplesPerSec=85.11078034892992, CurrSamplesPerSec=84.87039088023063, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:41:20,591] [INFO] [logging.py:96:log_dist] [Rank 0] step=12480, skipped=242, lr=[6.612655473702116e-07, 6.612655473702116e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:41:20,625] [INFO] [timer.py:215:stop] epoch=13/micro_step=520/global_step=12480, RunningAvgSamplesPerSec=85.11069662339034, CurrSamplesPerSec=85.26984978097691, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:41:28,120] [INFO] [logging.py:96:log_dist] [Rank 0] step=12490, skipped=242, lr=[6.560717290987487e-07, 6.560717290987487e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:41:28,154] [INFO] [timer.py:215:stop] epoch=13/micro_step=530/global_step=12490, RunningAvgSamplesPerSec=85.11066471005947, CurrSamplesPerSec=85.22047281591063, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:41:35,651] [INFO] [logging.py:96:log_dist] [Rank 0] step=12500, skipped=242, lr=[6.508969001022612e-07, 6.508969001022612e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:41:35,684] [INFO] [timer.py:215:stop] epoch=13/micro_step=540/global_step=12500, RunningAvgSamplesPerSec=85.1106190571926, CurrSamplesPerSec=85.39435784544517, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:41:43,192] [INFO] [logging.py:96:log_dist] [Rank 0] step=12510, skipped=242, lr=[6.457410839518574e-07, 6.457410839518574e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:41:43,226] [INFO] [timer.py:215:stop] epoch=13/micro_step=550/global_step=12510, RunningAvgSamplesPerSec=85.11047766328983, CurrSamplesPerSec=85.24447738604583, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:41:50,723] [INFO] [logging.py:96:log_dist] [Rank 0] step=12520, skipped=242, lr=[6.406043041320413e-07, 6.406043041320413e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:41:50,757] [INFO] [timer.py:215:stop] epoch=13/micro_step=560/global_step=12520, RunningAvgSamplesPerSec=85.110431383634, CurrSamplesPerSec=84.99903612931826, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:41:58,252] [INFO] [logging.py:96:log_dist] [Rank 0] step=12530, skipped=242, lr=[6.354865840406055e-07, 6.354865840406055e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:41:58,286] [INFO] [timer.py:215:stop] epoch=13/micro_step=570/global_step=12530, RunningAvgSamplesPerSec=85.11040251967962, CurrSamplesPerSec=84.65957752124646, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:42:05,775] [INFO] [logging.py:96:log_dist] [Rank 0] step=12540, skipped=242, lr=[6.303879469885276e-07, 6.303879469885276e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:42:05,809] [INFO] [timer.py:215:stop] epoch=13/micro_step=580/global_step=12540, RunningAvgSamplesPerSec=85.1104254036602, CurrSamplesPerSec=84.90652374901985, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:42:09,514] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:42:10,207] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:42:13,185] [INFO] [logging.py:96:log_dist] [Rank 0] step=12550, skipped=244, lr=[6.263227927446931e-07, 6.263227927446931e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:42:13,219] [INFO] [timer.py:215:stop] epoch=13/micro_step=590/global_step=12550, RunningAvgSamplesPerSec=85.11147629747802, CurrSamplesPerSec=84.93177595020694, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:42:20,726] [INFO] [logging.py:96:log_dist] [Rank 0] step=12560, skipped=244, lr=[6.212585636296005e-07, 6.212585636296005e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:42:20,759] [INFO] [timer.py:215:stop] epoch=13/micro_step=600/global_step=12560, RunningAvgSamplesPerSec=85.11134993505546, CurrSamplesPerSec=85.12564236160111, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:42:28,256] [INFO] [logging.py:96:log_dist] [Rank 0] step=12570, skipped=244, lr=[6.162134823618406e-07, 6.162134823618406e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:42:28,289] [INFO] [timer.py:215:stop] epoch=13/micro_step=610/global_step=12570, RunningAvgSamplesPerSec=85.11131367644217, CurrSamplesPerSec=85.01516104032905, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:42:35,777] [INFO] [logging.py:96:log_dist] [Rank 0] step=12580, skipped=244, lr=[6.11187571921523e-07, 6.11187571921523e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:42:35,811] [INFO] [timer.py:215:stop] epoch=13/micro_step=620/global_step=12580, RunningAvgSamplesPerSec=85.11134831045962, CurrSamplesPerSec=85.09150756910077, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:42:43,311] [INFO] [logging.py:96:log_dist] [Rank 0] step=12590, skipped=244, lr=[6.061808552014389e-07, 6.061808552014389e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:42:43,344] [INFO] [timer.py:215:stop] epoch=13/micro_step=630/global_step=12590, RunningAvgSamplesPerSec=85.1112756216723, CurrSamplesPerSec=84.94161225351912, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:42:50,856] [INFO] [logging.py:96:log_dist] [Rank 0] step=12600, skipped=244, lr=[6.011933550069471e-07, 6.011933550069471e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:42:50,889] [INFO] [timer.py:215:stop] epoch=13/micro_step=640/global_step=12600, RunningAvgSamplesPerSec=85.11111427503579, CurrSamplesPerSec=85.08970040272796, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:42:58,393] [INFO] [logging.py:96:log_dist] [Rank 0] step=12610, skipped=244, lr=[5.962250940558841e-07, 5.962250940558841e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:42:58,427] [INFO] [timer.py:215:stop] epoch=13/micro_step=650/global_step=12610, RunningAvgSamplesPerSec=85.11100635556477, CurrSamplesPerSec=84.81046101856421, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:43:05,927] [INFO] [logging.py:96:log_dist] [Rank 0] step=12620, skipped=244, lr=[5.912760949784454e-07, 5.912760949784454e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:43:05,961] [INFO] [timer.py:215:stop] epoch=13/micro_step=660/global_step=12620, RunningAvgSamplesPerSec=85.11092658312806, CurrSamplesPerSec=84.94080591342815, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:43:13,459] [INFO] [logging.py:96:log_dist] [Rank 0] step=12630, skipped=244, lr=[5.863463803170926e-07, 5.863463803170926e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:43:13,492] [INFO] [timer.py:215:stop] epoch=13/micro_step=670/global_step=12630, RunningAvgSamplesPerSec=85.11087829273465, CurrSamplesPerSec=85.09933051777561, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:43:20,990] [INFO] [logging.py:96:log_dist] [Rank 0] step=12640, skipped=244, lr=[5.814359725264495e-07, 5.814359725264495e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:43:21,020] [INFO] [timer.py:215:stop] epoch=13/micro_step=680/global_step=12640, RunningAvgSamplesPerSec=85.11085759809835, CurrSamplesPerSec=85.08983526323965, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:43:26,232] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:43:26,928] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:43:28,399] [INFO] [logging.py:96:log_dist] [Rank 0] step=12650, skipped=246, lr=[5.775215622742102e-07, 5.775215622742102e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:43:28,433] [INFO] [timer.py:215:stop] epoch=13/micro_step=690/global_step=12650, RunningAvgSamplesPerSec=85.11187308767924, CurrSamplesPerSec=85.34944965778882, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:43:35,934] [INFO] [logging.py:96:log_dist] [Rank 0] step=12660, skipped=246, lr=[5.726459631557105e-07, 5.726459631557105e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:43:35,968] [INFO] [timer.py:215:stop] epoch=13/micro_step=700/global_step=12660, RunningAvgSamplesPerSec=85.11178667276964, CurrSamplesPerSec=85.03328531044454, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:43:43,465] [INFO] [logging.py:96:log_dist] [Rank 0] step=12670, skipped=246, lr=[5.677897333126853e-07, 5.677897333126853e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:43:43,499] [INFO] [timer.py:215:stop] epoch=13/micro_step=710/global_step=12670, RunningAvgSamplesPerSec=85.11173486618694, CurrSamplesPerSec=85.06791255894785, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:43:50,992] [INFO] [logging.py:96:log_dist] [Rank 0] step=12680, skipped=246, lr=[5.629528948650338e-07, 5.629528948650338e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:43:51,025] [INFO] [timer.py:215:stop] epoch=13/micro_step=720/global_step=12680, RunningAvgSamplesPerSec=85.11173062766613, CurrSamplesPerSec=84.87353048119212, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:43:58,513] [INFO] [logging.py:96:log_dist] [Rank 0] step=12690, skipped=246, lr=[5.581354698443326e-07, 5.581354698443326e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:43:58,547] [INFO] [timer.py:215:stop] epoch=13/micro_step=730/global_step=12690, RunningAvgSamplesPerSec=85.11176475436908, CurrSamplesPerSec=85.05427387217372, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:44:06,043] [INFO] [logging.py:96:log_dist] [Rank 0] step=12700, skipped=246, lr=[5.533374801937277e-07, 5.533374801937277e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:44:06,077] [INFO] [timer.py:215:stop] epoch=13/micro_step=740/global_step=12700, RunningAvgSamplesPerSec=85.11172230879262, CurrSamplesPerSec=85.03204625841893, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:44:13,567] [INFO] [logging.py:96:log_dist] [Rank 0] step=12710, skipped=246, lr=[5.485589477678411e-07, 5.485589477678411e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:44:13,600] [INFO] [timer.py:215:stop] epoch=13/micro_step=750/global_step=12710, RunningAvgSamplesPerSec=85.11174133874616, CurrSamplesPerSec=85.1908848907073, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:44:21,093] [INFO] [logging.py:96:log_dist] [Rank 0] step=12720, skipped=246, lr=[5.437998943326664e-07, 5.437998943326664e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:44:21,126] [INFO] [timer.py:215:stop] epoch=13/micro_step=760/global_step=12720, RunningAvgSamplesPerSec=85.11174177023027, CurrSamplesPerSec=85.15032280842217, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:44:28,636] [INFO] [logging.py:96:log_dist] [Rank 0] step=12730, skipped=246, lr=[5.390603415654701e-07, 5.390603415654701e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:44:28,670] [INFO] [timer.py:215:stop] epoch=13/micro_step=770/global_step=12730, RunningAvgSamplesPerSec=85.11158657577342, CurrSamplesPerSec=85.02512439300111, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:44:36,171] [INFO] [logging.py:96:log_dist] [Rank 0] step=12740, skipped=246, lr=[5.343403110546955e-07, 5.343403110546955e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:44:36,205] [INFO] [timer.py:215:stop] epoch=13/micro_step=780/global_step=12740, RunningAvgSamplesPerSec=85.1114983826117, CurrSamplesPerSec=84.9504561211631, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:44:42,924] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:44:43,618] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:44:43,619] [INFO] [logging.py:96:log_dist] [Rank 0] step=12750, skipped=248, lr=[5.305783571213679e-07, 5.305783571213679e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:44:43,620] [INFO] [timer.py:215:stop] epoch=13/micro_step=790/global_step=12750, RunningAvgSamplesPerSec=85.11248584778744, CurrSamplesPerSec=92.1872858104936, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:44:51,107] [INFO] [logging.py:96:log_dist] [Rank 0] step=12760, skipped=248, lr=[5.258935207911255e-07, 5.258935207911255e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:44:51,140] [INFO] [timer.py:215:stop] epoch=13/micro_step=800/global_step=12760, RunningAvgSamplesPerSec=85.11253021852198, CurrSamplesPerSec=84.83951709808795, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:44:58,636] [INFO] [logging.py:96:log_dist] [Rank 0] step=12770, skipped=248, lr=[5.212282666915628e-07, 5.212282666915628e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:44:58,670] [INFO] [timer.py:215:stop] epoch=13/micro_step=810/global_step=12770, RunningAvgSamplesPerSec=85.11249025450921, CurrSamplesPerSec=84.82957038739303, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:45:06,157] [INFO] [logging.py:96:log_dist] [Rank 0] step=12780, skipped=248, lr=[5.165826160726964e-07, 5.165826160726964e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:45:06,191] [INFO] [timer.py:215:stop] epoch=13/micro_step=820/global_step=12780, RunningAvgSamplesPerSec=85.1125287465921, CurrSamplesPerSec=85.28935644272165, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:45:13,684] [INFO] [logging.py:96:log_dist] [Rank 0] step=12790, skipped=248, lr=[5.119565900952507e-07, 5.119565900952507e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:45:13,718] [INFO] [timer.py:215:stop] epoch=13/micro_step=830/global_step=12790, RunningAvgSamplesPerSec=85.11251082078735, CurrSamplesPerSec=85.36095729017939, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:45:21,223] [INFO] [logging.py:96:log_dist] [Rank 0] step=12800, skipped=248, lr=[5.073502098305568e-07, 5.073502098305568e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:45:21,256] [INFO] [timer.py:215:stop] epoch=13/micro_step=840/global_step=12800, RunningAvgSamplesPerSec=85.11240027321139, CurrSamplesPerSec=85.33333333333333, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:45:28,772] [INFO] [logging.py:96:log_dist] [Rank 0] step=12810, skipped=248, lr=[5.027634962604662e-07, 5.027634962604662e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:45:28,806] [INFO] [timer.py:215:stop] epoch=13/micro_step=850/global_step=12810, RunningAvgSamplesPerSec=85.11218543811137, CurrSamplesPerSec=85.06009538527546, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:45:36,306] [INFO] [logging.py:96:log_dist] [Rank 0] step=12820, skipped=248, lr=[4.981964702772441e-07, 4.981964702772441e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:45:36,339] [INFO] [timer.py:215:stop] epoch=13/micro_step=860/global_step=12820, RunningAvgSamplesPerSec=85.11211413598794, CurrSamplesPerSec=85.07047366966593, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:45:43,842] [INFO] [logging.py:96:log_dist] [Rank 0] step=12830, skipped=248, lr=[4.936491526834829e-07, 4.936491526834829e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:45:43,876] [INFO] [timer.py:215:stop] epoch=13/micro_step=870/global_step=12830, RunningAvgSamplesPerSec=85.11201562234145, CurrSamplesPerSec=84.78214781918521, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:45:51,373] [INFO] [logging.py:96:log_dist] [Rank 0] step=12840, skipped=248, lr=[4.891215641920053e-07, 4.891215641920053e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:45:51,407] [INFO] [timer.py:215:stop] epoch=13/micro_step=880/global_step=12840, RunningAvgSamplesPerSec=85.11196639158688, CurrSamplesPerSec=85.03541333188036, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:45:58,904] [INFO] [logging.py:96:log_dist] [Rank 0] step=12850, skipped=248, lr=[4.846137254257634e-07, 4.846137254257634e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:45:58,937] [INFO] [timer.py:215:stop] epoch=13/micro_step=890/global_step=12850, RunningAvgSamplesPerSec=85.1119300374625, CurrSamplesPerSec=84.87138372037877, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:45:59,633] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:46:00,330] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:46:06,314] [INFO] [logging.py:96:log_dist] [Rank 0] step=12860, skipped=250, lr=[4.810216880161397e-07, 4.810216880161397e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:46:06,348] [INFO] [timer.py:215:stop] epoch=13/micro_step=900/global_step=12860, RunningAvgSamplesPerSec=85.11293701076637, CurrSamplesPerSec=85.16301963915826, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:46:13,830] [INFO] [logging.py:96:log_dist] [Rank 0] step=12870, skipped=250, lr=[4.7654945043796583e-07, 4.7654945043796583e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:46:13,864] [INFO] [timer.py:215:stop] epoch=13/micro_step=910/global_step=12870, RunningAvgSamplesPerSec=85.11302300511541, CurrSamplesPerSec=85.5046291901768, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:46:21,356] [INFO] [logging.py:96:log_dist] [Rank 0] step=12880, skipped=250, lr=[4.720970198504239e-07, 4.720970198504239e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:46:21,390] [INFO] [timer.py:215:stop] epoch=13/micro_step=920/global_step=12880, RunningAvgSamplesPerSec=85.11301646779506, CurrSamplesPerSec=84.92435991656745, MemAllocated=5.0GB, MaxMemAllocated=23.99GB ***** Evaluating perplexity, Epoch 14/16 ***** ppl: 1.7731566429138184 Beginning of Epoch 15/16, Total Micro Batches 920 [2023-06-29 19:46:46,802] [INFO] [logging.py:96:log_dist] [Rank 0] step=12890, skipped=250, lr=[4.6766441653412963e-07, 4.6766441653412963e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:46:46,835] [INFO] [timer.py:215:stop] epoch=14/micro_step=10/global_step=12890, RunningAvgSamplesPerSec=85.11282932098771, CurrSamplesPerSec=84.6742383952629, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:46:54,346] [INFO] [logging.py:96:log_dist] [Rank 0] step=12900, skipped=250, lr=[4.632516606793859e-07, 4.632516606793859e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:46:54,379] [INFO] [timer.py:215:stop] epoch=14/micro_step=20/global_step=12900, RunningAvgSamplesPerSec=85.11267174498374, CurrSamplesPerSec=84.66595931776577, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:47:01,902] [INFO] [logging.py:96:log_dist] [Rank 0] step=12910, skipped=250, lr=[4.588587723860912e-07, 4.588587723860912e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:47:01,936] [INFO] [timer.py:215:stop] epoch=14/micro_step=30/global_step=12910, RunningAvgSamplesPerSec=85.11239122672046, CurrSamplesPerSec=83.75847919185615, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:47:09,437] [INFO] [logging.py:96:log_dist] [Rank 0] step=12920, skipped=250, lr=[4.544857716636472e-07, 4.544857716636472e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:47:09,471] [INFO] [timer.py:215:stop] epoch=14/micro_step=40/global_step=12920, RunningAvgSamplesPerSec=85.11230783481318, CurrSamplesPerSec=85.1348756018358, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:47:16,971] [INFO] [logging.py:96:log_dist] [Rank 0] step=12930, skipped=250, lr=[4.5013267843086914e-07, 4.5013267843086914e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:47:17,005] [INFO] [timer.py:215:stop] epoch=14/micro_step=50/global_step=12930, RunningAvgSamplesPerSec=85.11222948589757, CurrSamplesPerSec=85.20053792399047, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:47:24,510] [INFO] [logging.py:96:log_dist] [Rank 0] step=12940, skipped=250, lr=[4.457995125158961e-07, 4.457995125158961e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:47:24,543] [INFO] [timer.py:215:stop] epoch=14/micro_step=60/global_step=12940, RunningAvgSamplesPerSec=85.11211677136336, CurrSamplesPerSec=84.77349959039975, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:47:32,048] [INFO] [logging.py:96:log_dist] [Rank 0] step=12950, skipped=250, lr=[4.4148629365609697e-07, 4.4148629365609697e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:47:32,082] [INFO] [timer.py:215:stop] epoch=14/micro_step=70/global_step=12950, RunningAvgSamplesPerSec=85.11200436461908, CurrSamplesPerSec=85.02062713117355, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:47:34,285] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:47:34,983] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:47:39,465] [INFO] [logging.py:96:log_dist] [Rank 0] step=12960, skipped=252, lr=[4.3805009365349497e-07, 4.3805009365349497e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:47:39,499] [INFO] [timer.py:215:stop] epoch=14/micro_step=80/global_step=12960, RunningAvgSamplesPerSec=85.11295545845998, CurrSamplesPerSec=85.27469852037443, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:47:47,001] [INFO] [logging.py:96:log_dist] [Rank 0] step=12970, skipped=252, lr=[4.337728289411066e-07, 4.337728289411066e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:47:47,035] [INFO] [timer.py:215:stop] epoch=14/micro_step=90/global_step=12970, RunningAvgSamplesPerSec=85.11286521351984, CurrSamplesPerSec=84.85593024777567, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:47:54,547] [INFO] [logging.py:96:log_dist] [Rank 0] step=12980, skipped=252, lr=[4.2951556606487733e-07, 4.2951556606487733e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:47:54,581] [INFO] [timer.py:215:stop] epoch=14/micro_step=100/global_step=12980, RunningAvgSamplesPerSec=85.11268349501265, CurrSamplesPerSec=84.53714812989413, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:48:02,085] [INFO] [logging.py:96:log_dist] [Rank 0] step=12990, skipped=252, lr=[4.2527832441644477e-07, 4.2527832441644477e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:48:02,118] [INFO] [timer.py:215:stop] epoch=14/micro_step=110/global_step=12990, RunningAvgSamplesPerSec=85.11257925797874, CurrSamplesPerSec=85.25942279626038, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:48:09,623] [INFO] [logging.py:96:log_dist] [Rank 0] step=13000, skipped=252, lr=[4.210611232962461e-07, 4.210611232962461e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:48:09,657] [INFO] [timer.py:215:stop] epoch=14/micro_step=120/global_step=13000, RunningAvgSamplesPerSec=85.11247125548724, CurrSamplesPerSec=85.02964906462319, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:48:17,147] [INFO] [logging.py:96:log_dist] [Rank 0] step=13010, skipped=252, lr=[4.1686398191343745e-07, 4.1686398191343745e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:48:17,181] [INFO] [timer.py:215:stop] epoch=14/micro_step=130/global_step=13010, RunningAvgSamplesPerSec=85.11248315725015, CurrSamplesPerSec=85.14519112158263, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:48:24,685] [INFO] [logging.py:96:log_dist] [Rank 0] step=13020, skipped=252, lr=[4.1268691938580383e-07, 4.1268691938580383e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:48:24,719] [INFO] [timer.py:215:stop] epoch=14/micro_step=140/global_step=13020, RunningAvgSamplesPerSec=85.1123711098408, CurrSamplesPerSec=84.77459725542175, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:48:32,206] [INFO] [logging.py:96:log_dist] [Rank 0] step=13030, skipped=252, lr=[4.085299547396713e-07, 4.085299547396713e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:48:32,240] [INFO] [timer.py:215:stop] epoch=14/micro_step=150/global_step=13030, RunningAvgSamplesPerSec=85.11240876820113, CurrSamplesPerSec=85.07484139314154, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:48:39,744] [INFO] [logging.py:96:log_dist] [Rank 0] step=13040, skipped=252, lr=[4.0439310690981956e-07, 4.0439310690981956e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:48:39,777] [INFO] [timer.py:215:stop] epoch=14/micro_step=160/global_step=13040, RunningAvgSamplesPerSec=85.1123068271299, CurrSamplesPerSec=84.82343193755494, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:48:47,272] [INFO] [logging.py:96:log_dist] [Rank 0] step=13050, skipped=252, lr=[4.002763947394002e-07, 4.002763947394002e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:48:47,305] [INFO] [timer.py:215:stop] epoch=14/micro_step=170/global_step=13050, RunningAvgSamplesPerSec=85.11228162522772, CurrSamplesPerSec=85.34391407219192, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:48:51,010] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:48:51,706] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:48:54,689] [INFO] [logging.py:96:log_dist] [Rank 0] step=13060, skipped=254, lr=[3.9699753528190307e-07, 3.9699753528190307e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:48:54,722] [INFO] [timer.py:215:stop] epoch=14/micro_step=180/global_step=13060, RunningAvgSamplesPerSec=85.11322791590196, CurrSamplesPerSec=84.72052421317211, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:49:02,227] [INFO] [logging.py:96:log_dist] [Rank 0] step=13070, skipped=254, lr=[3.9291711449038386e-07, 3.9291711449038386e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:49:02,261] [INFO] [timer.py:215:stop] epoch=14/micro_step=190/global_step=13070, RunningAvgSamplesPerSec=85.11311382336434, CurrSamplesPerSec=85.13452459409453, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:49:09,761] [INFO] [logging.py:96:log_dist] [Rank 0] step=13080, skipped=254, lr=[3.8885688163091044e-07, 3.8885688163091044e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:49:09,794] [INFO] [timer.py:215:stop] epoch=14/micro_step=200/global_step=13080, RunningAvgSamplesPerSec=85.11304600693917, CurrSamplesPerSec=84.8359778393702, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:49:17,298] [INFO] [logging.py:96:log_dist] [Rank 0] step=13090, skipped=254, lr=[3.84816855197655e-07, 3.84816855197655e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:49:17,332] [INFO] [timer.py:215:stop] epoch=14/micro_step=210/global_step=13090, RunningAvgSamplesPerSec=85.11294355657336, CurrSamplesPerSec=85.09126481094577, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:49:24,835] [INFO] [logging.py:96:log_dist] [Rank 0] step=13100, skipped=254, lr=[3.807970535927492e-07, 3.807970535927492e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:49:24,867] [INFO] [timer.py:215:stop] epoch=14/micro_step=220/global_step=13100, RunningAvgSamplesPerSec=85.1128545069536, CurrSamplesPerSec=85.02278144873112, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:49:32,366] [INFO] [logging.py:96:log_dist] [Rank 0] step=13110, skipped=254, lr=[3.7679749512620537e-07, 3.7679749512620537e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:49:32,400] [INFO] [timer.py:215:stop] epoch=14/micro_step=230/global_step=13110, RunningAvgSamplesPerSec=85.11279337477846, CurrSamplesPerSec=84.83246568515447, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:49:39,901] [INFO] [logging.py:96:log_dist] [Rank 0] step=13120, skipped=254, lr=[3.728181980158254e-07, 3.728181980158254e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:49:39,935] [INFO] [timer.py:215:stop] epoch=14/micro_step=240/global_step=13120, RunningAvgSamplesPerSec=85.11271066921786, CurrSamplesPerSec=84.67028560145498, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:49:47,442] [INFO] [logging.py:96:log_dist] [Rank 0] step=13130, skipped=254, lr=[3.688591803871233e-07, 3.688591803871233e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:49:47,475] [INFO] [timer.py:215:stop] epoch=14/micro_step=250/global_step=13130, RunningAvgSamplesPerSec=85.11258285111104, CurrSamplesPerSec=85.20386426569937, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:49:54,974] [INFO] [logging.py:96:log_dist] [Rank 0] step=13140, skipped=254, lr=[3.6492046027324043e-07, 3.6492046027324043e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:49:55,007] [INFO] [timer.py:215:stop] epoch=14/micro_step=260/global_step=13140, RunningAvgSamplesPerSec=85.11254012518073, CurrSamplesPerSec=84.94389696691613, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:50:02,502] [INFO] [logging.py:96:log_dist] [Rank 0] step=13150, skipped=254, lr=[3.6100205561486546e-07, 3.6100205561486546e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:50:02,535] [INFO] [timer.py:215:stop] epoch=14/micro_step=270/global_step=13150, RunningAvgSamplesPerSec=85.11251808799706, CurrSamplesPerSec=85.08829787874744, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:50:07,749] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:50:08,448] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:50:09,919] [INFO] [logging.py:96:log_dist] [Rank 0] step=13160, skipped=256, lr=[3.5788197101319955e-07, 3.5788197101319955e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:50:09,953] [INFO] [timer.py:215:stop] epoch=14/micro_step=280/global_step=13160, RunningAvgSamplesPerSec=85.11345101315482, CurrSamplesPerSec=84.8972057201267, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:50:17,453] [INFO] [logging.py:96:log_dist] [Rank 0] step=13170, skipped=256, lr=[3.540001790898436e-07, 3.540001790898436e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:50:17,487] [INFO] [timer.py:215:stop] epoch=14/micro_step=290/global_step=13170, RunningAvgSamplesPerSec=85.11337767339134, CurrSamplesPerSec=84.98063052111192, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:50:24,994] [INFO] [logging.py:96:log_dist] [Rank 0] step=13180, skipped=256, lr=[3.5013875236336836e-07, 3.5013875236336836e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:50:25,028] [INFO] [timer.py:215:stop] epoch=14/micro_step=300/global_step=13180, RunningAvgSamplesPerSec=85.11325294898035, CurrSamplesPerSec=85.10246010001705, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:50:32,515] [INFO] [logging.py:96:log_dist] [Rank 0] step=13190, skipped=256, lr=[3.4629770842239534e-07, 3.4629770842239534e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:50:32,549] [INFO] [timer.py:215:stop] epoch=14/micro_step=310/global_step=13190, RunningAvgSamplesPerSec=85.11329186039269, CurrSamplesPerSec=85.169072227096, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:50:40,043] [INFO] [logging.py:96:log_dist] [Rank 0] step=13200, skipped=256, lr=[3.424770647627e-07, 3.424770647627e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:50:40,077] [INFO] [timer.py:215:stop] epoch=14/micro_step=320/global_step=13200, RunningAvgSamplesPerSec=85.11326701794042, CurrSamplesPerSec=85.16842369438093, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:50:47,581] [INFO] [logging.py:96:log_dist] [Rank 0] step=13210, skipped=256, lr=[3.3867683878713817e-07, 3.3867683878713817e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:50:47,614] [INFO] [timer.py:215:stop] epoch=14/micro_step=330/global_step=13210, RunningAvgSamplesPerSec=85.11317202412896, CurrSamplesPerSec=85.03905007074036, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:50:55,114] [INFO] [logging.py:96:log_dist] [Rank 0] step=13220, skipped=256, lr=[3.3489704780556226e-07, 3.3489704780556226e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:50:55,147] [INFO] [timer.py:215:stop] epoch=14/micro_step=340/global_step=13220, RunningAvgSamplesPerSec=85.11310494107435, CurrSamplesPerSec=85.057750513321, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:51:02,639] [INFO] [logging.py:96:log_dist] [Rank 0] step=13230, skipped=256, lr=[3.311377090347465e-07, 3.311377090347465e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:51:02,673] [INFO] [timer.py:215:stop] epoch=14/micro_step=350/global_step=13230, RunningAvgSamplesPerSec=85.11310630196448, CurrSamplesPerSec=84.95752716367008, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:51:10,167] [INFO] [logging.py:96:log_dist] [Rank 0] step=13240, skipped=256, lr=[3.2739883959830183e-07, 3.2739883959830183e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:51:10,201] [INFO] [timer.py:215:stop] epoch=14/micro_step=360/global_step=13240, RunningAvgSamplesPerSec=85.11308801487736, CurrSamplesPerSec=84.93040549861232, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:51:17,712] [INFO] [logging.py:96:log_dist] [Rank 0] step=13250, skipped=256, lr=[3.2368045652660754e-07, 3.2368045652660754e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:51:17,745] [INFO] [timer.py:215:stop] epoch=14/micro_step=370/global_step=13250, RunningAvgSamplesPerSec=85.11292585619363, CurrSamplesPerSec=85.09504120573003, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:51:24,469] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:51:25,166] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:51:25,167] [INFO] [logging.py:96:log_dist] [Rank 0] step=13260, skipped=258, lr=[3.207205116367155e-07, 3.207205116367155e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:51:25,168] [INFO] [timer.py:215:stop] epoch=14/micro_step=380/global_step=13260, RunningAvgSamplesPerSec=85.11380958880565, CurrSamplesPerSec=91.8861474256108, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:51:32,674] [INFO] [logging.py:96:log_dist] [Rank 0] step=13270, skipped=258, lr=[3.170390466402113e-07, 3.170390466402113e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:51:32,707] [INFO] [timer.py:215:stop] epoch=14/micro_step=390/global_step=13270, RunningAvgSamplesPerSec=85.11369763652748, CurrSamplesPerSec=84.72432125340083, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:51:40,210] [INFO] [logging.py:96:log_dist] [Rank 0] step=13280, skipped=258, lr=[3.133781151968328e-07, 3.133781151968328e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:51:40,243] [INFO] [timer.py:215:stop] epoch=14/micro_step=400/global_step=13280, RunningAvgSamplesPerSec=85.11361131602699, CurrSamplesPerSec=84.5164139800147, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:51:47,737] [INFO] [logging.py:96:log_dist] [Rank 0] step=13290, skipped=258, lr=[3.097377339819548e-07, 3.097377339819548e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:51:47,770] [INFO] [timer.py:215:stop] epoch=14/micro_step=410/global_step=13290, RunningAvgSamplesPerSec=85.11360553884256, CurrSamplesPerSec=84.98302495176014, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:51:55,266] [INFO] [logging.py:96:log_dist] [Rank 0] step=13300, skipped=258, lr=[3.0611791957734217e-07, 3.0611791957734217e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:51:55,299] [INFO] [timer.py:215:stop] epoch=14/micro_step=420/global_step=13300, RunningAvgSamplesPerSec=85.1135696517518, CurrSamplesPerSec=84.89664187046496, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:52:02,788] [INFO] [logging.py:96:log_dist] [Rank 0] step=13310, skipped=258, lr=[3.0251868847108435e-07, 3.0251868847108435e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:52:02,821] [INFO] [timer.py:215:stop] epoch=14/micro_step=430/global_step=13310, RunningAvgSamplesPerSec=85.11360562427174, CurrSamplesPerSec=85.40359513098366, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:52:10,316] [INFO] [logging.py:96:log_dist] [Rank 0] step=13320, skipped=258, lr=[2.989400570575103e-07, 2.989400570575103e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:52:10,350] [INFO] [timer.py:215:stop] epoch=14/micro_step=440/global_step=13320, RunningAvgSamplesPerSec=85.11357157826762, CurrSamplesPerSec=85.35976295761131, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:52:17,848] [INFO] [logging.py:96:log_dist] [Rank 0] step=13330, skipped=258, lr=[2.9538204163712096e-07, 2.9538204163712096e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:52:17,881] [INFO] [timer.py:215:stop] epoch=14/micro_step=450/global_step=13330, RunningAvgSamplesPerSec=85.11352493618159, CurrSamplesPerSec=85.01922688330404, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:52:25,391] [INFO] [logging.py:96:log_dist] [Rank 0] step=13340, skipped=258, lr=[2.918446584165116e-07, 2.918446584165116e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:52:25,424] [INFO] [timer.py:215:stop] epoch=14/micro_step=460/global_step=13340, RunningAvgSamplesPerSec=85.11337234795, CurrSamplesPerSec=84.90324743655475, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:52:32,924] [INFO] [logging.py:96:log_dist] [Rank 0] step=13350, skipped=258, lr=[2.883279235082994e-07, 2.883279235082994e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:52:32,958] [INFO] [timer.py:215:stop] epoch=14/micro_step=470/global_step=13350, RunningAvgSamplesPerSec=85.11330855995546, CurrSamplesPerSec=84.8641392043299, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:52:40,454] [INFO] [logging.py:96:log_dist] [Rank 0] step=13360, skipped=258, lr=[2.8483185293104695e-07, 2.8483185293104695e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:52:40,488] [INFO] [timer.py:215:stop] epoch=14/micro_step=480/global_step=13360, RunningAvgSamplesPerSec=85.11326723213989, CurrSamplesPerSec=84.87412085945513, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:52:41,186] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:52:41,882] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:52:47,879] [INFO] [logging.py:96:log_dist] [Rank 0] step=13370, skipped=260, lr=[2.8204988549192515e-07, 2.8204988549192515e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:52:47,912] [INFO] [timer.py:215:stop] epoch=14/micro_step=490/global_step=13370, RunningAvgSamplesPerSec=85.11412285654623, CurrSamplesPerSec=84.81078256363918, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:52:55,415] [INFO] [logging.py:96:log_dist] [Rank 0] step=13380, skipped=260, lr=[2.785910507766916e-07, 2.785910507766916e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:52:55,448] [INFO] [timer.py:215:stop] epoch=14/micro_step=500/global_step=13380, RunningAvgSamplesPerSec=85.11403203517234, CurrSamplesPerSec=84.782656593476, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:53:02,932] [INFO] [logging.py:96:log_dist] [Rank 0] step=13390, skipped=260, lr=[2.751529247434222e-07, 2.751529247434222e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:53:02,966] [INFO] [timer.py:215:stop] epoch=14/micro_step=510/global_step=13390, RunningAvgSamplesPerSec=85.11409878133588, CurrSamplesPerSec=85.10764060404493, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:53:10,462] [INFO] [logging.py:96:log_dist] [Rank 0] step=13400, skipped=260, lr=[2.717355230526228e-07, 2.717355230526228e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:53:10,493] [INFO] [timer.py:215:stop] epoch=14/micro_step=520/global_step=13400, RunningAvgSamplesPerSec=85.11408706578389, CurrSamplesPerSec=85.23627588265366, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:53:17,984] [INFO] [logging.py:96:log_dist] [Rank 0] step=13410, skipped=260, lr=[2.6833886127039926e-07, 2.6833886127039926e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:53:18,017] [INFO] [timer.py:215:stop] epoch=14/micro_step=530/global_step=13410, RunningAvgSamplesPerSec=85.11409767146425, CurrSamplesPerSec=85.16642411401656, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:53:25,515] [INFO] [logging.py:96:log_dist] [Rank 0] step=13420, skipped=260, lr=[2.64962954868387e-07, 2.64962954868387e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:53:25,548] [INFO] [timer.py:215:stop] epoch=14/micro_step=540/global_step=13420, RunningAvgSamplesPerSec=85.11405480923776, CurrSamplesPerSec=85.21517036021193, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:53:33,047] [INFO] [logging.py:96:log_dist] [Rank 0] step=13430, skipped=260, lr=[2.616078192236859e-07, 2.616078192236859e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:53:33,080] [INFO] [timer.py:215:stop] epoch=14/micro_step=550/global_step=13430, RunningAvgSamplesPerSec=85.11400432746385, CurrSamplesPerSec=84.84037514467418, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:53:40,581] [INFO] [logging.py:96:log_dist] [Rank 0] step=13440, skipped=260, lr=[2.5827346961878084e-07, 2.5827346961878084e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:53:40,615] [INFO] [timer.py:215:stop] epoch=14/micro_step=560/global_step=13440, RunningAvgSamplesPerSec=85.11392000685669, CurrSamplesPerSec=84.64692358218575, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:53:48,123] [INFO] [logging.py:96:log_dist] [Rank 0] step=13450, skipped=260, lr=[2.549599212414806e-07, 2.549599212414806e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:53:48,157] [INFO] [timer.py:215:stop] epoch=14/micro_step=570/global_step=13450, RunningAvgSamplesPerSec=85.11378424728362, CurrSamplesPerSec=84.99209272666367, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:53:55,659] [INFO] [logging.py:96:log_dist] [Rank 0] step=13460, skipped=260, lr=[2.516671891848453e-07, 2.516671891848453e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:53:55,693] [INFO] [timer.py:215:stop] epoch=14/micro_step=580/global_step=13460, RunningAvgSamplesPerSec=85.11370025605731, CurrSamplesPerSec=84.87803903241698, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:53:57,899] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:53:58,596] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:54:03,086] [INFO] [logging.py:96:log_dist] [Rank 0] step=13470, skipped=262, lr=[2.4904800137242146e-07, 2.4904800137242146e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:54:03,120] [INFO] [timer.py:215:stop] epoch=14/micro_step=590/global_step=13470, RunningAvgSamplesPerSec=85.11452393997736, CurrSamplesPerSec=84.84517516286446, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:54:10,616] [INFO] [logging.py:96:log_dist] [Rank 0] step=13480, skipped=262, lr=[2.457927764247885e-07, 2.457927764247885e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:54:10,650] [INFO] [timer.py:215:stop] epoch=14/micro_step=600/global_step=13480, RunningAvgSamplesPerSec=85.11448543914497, CurrSamplesPerSec=85.37521042813957, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:54:18,147] [INFO] [logging.py:96:log_dist] [Rank 0] step=13490, skipped=262, lr=[2.425584095537404e-07, 2.425584095537404e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:54:18,180] [INFO] [timer.py:215:stop] epoch=14/micro_step=610/global_step=13490, RunningAvgSamplesPerSec=85.11444216333274, CurrSamplesPerSec=85.05203711378358, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:54:25,670] [INFO] [logging.py:96:log_dist] [Rank 0] step=13500, skipped=262, lr=[2.393449154916669e-07, 2.393449154916669e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:54:25,703] [INFO] [timer.py:215:stop] epoch=14/micro_step=620/global_step=13500, RunningAvgSamplesPerSec=85.11446284613085, CurrSamplesPerSec=85.38466084596625, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:54:33,202] [INFO] [logging.py:96:log_dist] [Rank 0] step=13510, skipped=262, lr=[2.3615230887588423e-07, 2.3615230887588423e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:54:33,235] [INFO] [timer.py:215:stop] epoch=14/micro_step=630/global_step=13510, RunningAvgSamplesPerSec=85.11440675265331, CurrSamplesPerSec=84.93271648179471, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:54:40,727] [INFO] [logging.py:96:log_dist] [Rank 0] step=13520, skipped=262, lr=[2.3298060424856818e-07, 2.3298060424856818e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:54:40,760] [INFO] [timer.py:215:stop] epoch=14/micro_step=640/global_step=13520, RunningAvgSamplesPerSec=85.11440806566443, CurrSamplesPerSec=85.39791668058494, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:54:48,266] [INFO] [logging.py:96:log_dist] [Rank 0] step=13530, skipped=262, lr=[2.2982981605668658e-07, 2.2982981605668658e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:54:48,300] [INFO] [timer.py:215:stop] epoch=14/micro_step=650/global_step=13530, RunningAvgSamplesPerSec=85.1142887460768, CurrSamplesPerSec=85.20984151602696, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:54:55,804] [INFO] [logging.py:96:log_dist] [Rank 0] step=13540, skipped=262, lr=[2.2669995865193137e-07, 2.2669995865193137e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:54:55,838] [INFO] [timer.py:215:stop] epoch=14/micro_step=660/global_step=13540, RunningAvgSamplesPerSec=85.11418999030222, CurrSamplesPerSec=84.91761674946412, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:55:03,338] [INFO] [logging.py:96:log_dist] [Rank 0] step=13550, skipped=262, lr=[2.235910462906611e-07, 2.235910462906611e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:55:03,372] [INFO] [timer.py:215:stop] epoch=14/micro_step=670/global_step=13550, RunningAvgSamplesPerSec=85.11411782226976, CurrSamplesPerSec=85.0187691199661, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:55:10,856] [INFO] [logging.py:96:log_dist] [Rank 0] step=13560, skipped=262, lr=[2.20503093133825e-07, 2.20503093133825e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:55:10,890] [INFO] [timer.py:215:stop] epoch=14/micro_step=680/global_step=13560, RunningAvgSamplesPerSec=85.11417631536848, CurrSamplesPerSec=85.18172058945326, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:55:14,591] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:55:15,286] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:55:18,263] [INFO] [logging.py:96:log_dist] [Rank 0] step=13570, skipped=264, lr=[2.1804783069076385e-07, 2.1804783069076385e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:55:18,297] [INFO] [timer.py:215:stop] epoch=14/micro_step=690/global_step=13570, RunningAvgSamplesPerSec=85.1151614988386, CurrSamplesPerSec=85.17439597309048, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:55:25,797] [INFO] [logging.py:96:log_dist] [Rank 0] step=13580, skipped=264, lr=[2.1499763948273365e-07, 2.1499763948273365e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:55:25,830] [INFO] [timer.py:215:stop] epoch=14/micro_step=700/global_step=13580, RunningAvgSamplesPerSec=85.11509422917379, CurrSamplesPerSec=84.6489522291678, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:55:33,331] [INFO] [logging.py:96:log_dist] [Rank 0] step=13590, skipped=264, lr=[2.11968446621708e-07, 2.11968446621708e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:55:33,364] [INFO] [timer.py:215:stop] epoch=14/micro_step=710/global_step=13590, RunningAvgSamplesPerSec=85.11502129475487, CurrSamplesPerSec=85.0408281451763, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:55:40,855] [INFO] [logging.py:96:log_dist] [Rank 0] step=13600, skipped=264, lr=[2.0896026590551998e-07, 2.0896026590551998e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:55:40,889] [INFO] [timer.py:215:stop] epoch=14/micro_step=720/global_step=13600, RunningAvgSamplesPerSec=85.11502873339643, CurrSamplesPerSec=85.15810254485675, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:55:48,391] [INFO] [logging.py:96:log_dist] [Rank 0] step=13610, skipped=264, lr=[2.0597311103629377e-07, 2.0597311103629377e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:55:48,425] [INFO] [timer.py:215:stop] epoch=14/micro_step=730/global_step=13610, RunningAvgSamplesPerSec=85.11493654831119, CurrSamplesPerSec=85.09291019889584, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:55:55,912] [INFO] [logging.py:96:log_dist] [Rank 0] step=13620, skipped=264, lr=[2.0300699562037947e-07, 2.0300699562037947e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:55:55,945] [INFO] [timer.py:215:stop] epoch=14/micro_step=740/global_step=13620, RunningAvgSamplesPerSec=85.11497569344577, CurrSamplesPerSec=85.12558837193447, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:56:03,439] [INFO] [logging.py:96:log_dist] [Rank 0] step=13630, skipped=264, lr=[2.0006193316829777e-07, 2.0006193316829777e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:56:03,473] [INFO] [timer.py:215:stop] epoch=14/micro_step=750/global_step=13630, RunningAvgSamplesPerSec=85.11496192783436, CurrSamplesPerSec=84.89798438197276, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:56:10,963] [INFO] [logging.py:96:log_dist] [Rank 0] step=13640, skipped=264, lr=[1.9713793709466995e-07, 1.9713793709466995e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:56:10,997] [INFO] [timer.py:215:stop] epoch=14/micro_step=760/global_step=13640, RunningAvgSamplesPerSec=85.11497292247373, CurrSamplesPerSec=85.31453864565486, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:56:18,481] [INFO] [logging.py:96:log_dist] [Rank 0] step=13650, skipped=264, lr=[1.94235020718163e-07, 1.94235020718163e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:56:18,515] [INFO] [timer.py:215:stop] epoch=14/micro_step=770/global_step=13650, RunningAvgSamplesPerSec=85.11502925162014, CurrSamplesPerSec=85.2347602660345, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:56:26,008] [INFO] [logging.py:96:log_dist] [Rank 0] step=13660, skipped=264, lr=[1.9135319726142672e-07, 1.9135319726142672e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:56:26,041] [INFO] [timer.py:215:stop] epoch=14/micro_step=780/global_step=13660, RunningAvgSamplesPerSec=85.11501400687536, CurrSamplesPerSec=85.26873925347613, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:56:31,265] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:56:31,963] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:56:33,431] [INFO] [logging.py:96:log_dist] [Rank 0] step=13670, skipped=266, lr=[1.8906293422255878e-07, 1.8906293422255878e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:56:33,465] [INFO] [timer.py:215:stop] epoch=14/micro_step=790/global_step=13670, RunningAvgSamplesPerSec=85.11586713365303, CurrSamplesPerSec=85.33987137762769, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:56:40,953] [INFO] [logging.py:96:log_dist] [Rank 0] step=13680, skipped=266, lr=[1.862191110357683e-07, 1.862191110357683e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:56:40,987] [INFO] [timer.py:215:stop] epoch=14/micro_step=800/global_step=13680, RunningAvgSamplesPerSec=85.11588892418516, CurrSamplesPerSec=85.13495660403329, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:56:48,490] [INFO] [logging.py:96:log_dist] [Rank 0] step=13690, skipped=266, lr=[1.8339641728084565e-07, 1.8339641728084565e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:56:48,524] [INFO] [timer.py:215:stop] epoch=14/micro_step=810/global_step=13690, RunningAvgSamplesPerSec=85.11579390299005, CurrSamplesPerSec=84.84101869100392, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:56:56,023] [INFO] [logging.py:96:log_dist] [Rank 0] step=13700, skipped=266, lr=[1.8059486581502947e-07, 1.8059486581502947e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:56:56,057] [INFO] [timer.py:215:stop] epoch=14/micro_step=820/global_step=13700, RunningAvgSamplesPerSec=85.11572957742999, CurrSamplesPerSec=85.00261592347229, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:57:03,555] [INFO] [logging.py:96:log_dist] [Rank 0] step=13710, skipped=266, lr=[1.778144693992566e-07, 1.778144693992566e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:57:03,589] [INFO] [timer.py:215:stop] epoch=14/micro_step=830/global_step=13710, RunningAvgSamplesPerSec=85.11567466619172, CurrSamplesPerSec=85.03495539422308, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:57:11,079] [INFO] [logging.py:96:log_dist] [Rank 0] step=13720, skipped=266, lr=[1.7505524069810422e-07, 1.7505524069810422e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:57:11,113] [INFO] [timer.py:215:stop] epoch=14/micro_step=840/global_step=13720, RunningAvgSamplesPerSec=85.11568554949878, CurrSamplesPerSec=85.37371701905707, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:57:18,604] [INFO] [logging.py:96:log_dist] [Rank 0] step=13730, skipped=266, lr=[1.7231719227973094e-07, 1.7231719227973094e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:57:18,638] [INFO] [timer.py:215:stop] epoch=14/micro_step=850/global_step=13730, RunningAvgSamplesPerSec=85.11568278310146, CurrSamplesPerSec=85.22252904771005, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:57:26,134] [INFO] [logging.py:96:log_dist] [Rank 0] step=13740, skipped=266, lr=[1.6960033661581946e-07, 1.6960033661581946e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:57:26,168] [INFO] [timer.py:215:stop] epoch=14/micro_step=860/global_step=13740, RunningAvgSamplesPerSec=85.11564453960784, CurrSamplesPerSec=85.34963961709654, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:57:33,665] [INFO] [logging.py:96:log_dist] [Rank 0] step=13750, skipped=266, lr=[1.6690468608152353e-07, 1.6690468608152353e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:57:33,699] [INFO] [timer.py:215:stop] epoch=14/micro_step=870/global_step=13750, RunningAvgSamplesPerSec=85.1155944456777, CurrSamplesPerSec=85.17190967385751, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:57:41,196] [INFO] [logging.py:96:log_dist] [Rank 0] step=13760, skipped=266, lr=[1.6423025295540422e-07, 1.6423025295540422e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:57:41,229] [INFO] [timer.py:215:stop] epoch=14/micro_step=880/global_step=13760, RunningAvgSamplesPerSec=85.11555190245657, CurrSamplesPerSec=84.83844456426169, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:57:47,945] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:57:48,641] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:57:48,641] [INFO] [logging.py:96:log_dist] [Rank 0] step=13770, skipped=268, lr=[1.6210599117789524e-07, 1.6210599117789524e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:57:48,642] [INFO] [timer.py:215:stop] epoch=14/micro_step=890/global_step=13770, RunningAvgSamplesPerSec=85.11647746342341, CurrSamplesPerSec=92.04014669618145, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:57:56,139] [INFO] [logging.py:96:log_dist] [Rank 0] step=13780, skipped=268, lr=[1.5946978001995052e-07, 1.5946978001995052e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:57:56,173] [INFO] [timer.py:215:stop] epoch=14/micro_step=900/global_step=13780, RunningAvgSamplesPerSec=85.11642371538908, CurrSamplesPerSec=84.86354896492159, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:58:03,668] [INFO] [logging.py:96:log_dist] [Rank 0] step=13790, skipped=268, lr=[1.568548201358361e-07, 1.568548201358361e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:58:03,701] [INFO] [timer.py:215:stop] epoch=14/micro_step=910/global_step=13790, RunningAvgSamplesPerSec=85.11640403607716, CurrSamplesPerSec=85.29659243178524, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:58:11,185] [INFO] [logging.py:96:log_dist] [Rank 0] step=13800, skipped=268, lr=[1.542611234365716e-07, 1.542611234365716e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:58:11,219] [INFO] [timer.py:215:stop] epoch=14/micro_step=920/global_step=13800, RunningAvgSamplesPerSec=85.11646328676642, CurrSamplesPerSec=84.8854469779523, MemAllocated=5.0GB, MaxMemAllocated=23.99GB ***** Evaluating perplexity, Epoch 15/16 ***** ppl: 1.774248719215393 Beginning of Epoch 16/16, Total Micro Batches 920 [2023-06-29 19:58:36,599] [INFO] [logging.py:96:log_dist] [Rank 0] step=13810, skipped=268, lr=[1.5168870173632736e-07, 1.5168870173632736e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:58:36,633] [INFO] [timer.py:215:stop] epoch=15/micro_step=10/global_step=13810, RunningAvgSamplesPerSec=85.11618851219333, CurrSamplesPerSec=84.934490112324, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:58:44,134] [INFO] [logging.py:96:log_dist] [Rank 0] step=13820, skipped=268, lr=[1.4913756675236278e-07, 1.4913756675236278e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:58:44,168] [INFO] [timer.py:215:stop] epoch=15/micro_step=20/global_step=13820, RunningAvgSamplesPerSec=85.1161035485245, CurrSamplesPerSec=85.29038620893003, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:58:51,668] [INFO] [logging.py:96:log_dist] [Rank 0] step=13830, skipped=268, lr=[1.4660773010498093e-07, 1.4660773010498093e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:58:51,702] [INFO] [timer.py:215:stop] epoch=15/micro_step=30/global_step=13830, RunningAvgSamplesPerSec=85.11603163432875, CurrSamplesPerSec=84.63550091198157, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:58:59,200] [INFO] [logging.py:96:log_dist] [Rank 0] step=13840, skipped=268, lr=[1.440992033174695e-07, 1.440992033174695e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:58:59,233] [INFO] [timer.py:215:stop] epoch=15/micro_step=40/global_step=13840, RunningAvgSamplesPerSec=85.11597740059723, CurrSamplesPerSec=85.06904482045799, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:59:06,752] [INFO] [logging.py:96:log_dist] [Rank 0] step=13850, skipped=268, lr=[1.4161199781605266e-07, 1.4161199781605266e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:59:06,786] [INFO] [timer.py:215:stop] epoch=15/micro_step=50/global_step=13850, RunningAvgSamplesPerSec=85.11575458579416, CurrSamplesPerSec=84.86472945194869, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:59:14,288] [INFO] [logging.py:96:log_dist] [Rank 0] step=13860, skipped=268, lr=[1.391461249298353e-07, 1.391461249298353e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:59:14,321] [INFO] [timer.py:215:stop] epoch=15/micro_step=60/global_step=13860, RunningAvgSamplesPerSec=85.11566778007709, CurrSamplesPerSec=84.9469613637313, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:59:21,824] [INFO] [logging.py:96:log_dist] [Rank 0] step=13870, skipped=268, lr=[1.367015958907538e-07, 1.367015958907538e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:59:21,858] [INFO] [timer.py:215:stop] epoch=15/micro_step=70/global_step=13870, RunningAvgSamplesPerSec=85.11557476907299, CurrSamplesPerSec=84.9771870821346, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:59:22,556] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 19:59:23,254] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 19:59:29,243] [INFO] [logging.py:96:log_dist] [Rank 0] step=13880, skipped=270, lr=[1.3476134771522473e-07, 1.3476134771522473e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:59:29,275] [INFO] [timer.py:215:stop] epoch=15/micro_step=80/global_step=13880, RunningAvgSamplesPerSec=85.1164594194291, CurrSamplesPerSec=84.98377828368739, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:59:36,780] [INFO] [logging.py:96:log_dist] [Rank 0] step=13890, skipped=270, lr=[1.3235526559511019e-07, 1.3235526559511019e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:59:36,813] [INFO] [timer.py:215:stop] epoch=15/micro_step=90/global_step=13890, RunningAvgSamplesPerSec=85.11634691044416, CurrSamplesPerSec=84.9703279377127, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:59:44,328] [INFO] [logging.py:96:log_dist] [Rank 0] step=13900, skipped=270, lr=[1.299705582541815e-07, 1.299705582541815e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:59:44,361] [INFO] [timer.py:215:stop] epoch=15/micro_step=100/global_step=13900, RunningAvgSamplesPerSec=85.11615895485099, CurrSamplesPerSec=84.94128971564564, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:59:51,854] [INFO] [logging.py:96:log_dist] [Rank 0] step=13910, skipped=270, lr=[1.2760723655467044e-07, 1.2760723655467044e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:59:51,888] [INFO] [timer.py:215:stop] epoch=15/micro_step=110/global_step=13910, RunningAvgSamplesPerSec=85.11614297934595, CurrSamplesPerSec=85.23443549805708, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 19:59:59,383] [INFO] [logging.py:96:log_dist] [Rank 0] step=13920, skipped=270, lr=[1.2526531126139658e-07, 1.2526531126139658e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 19:59:59,416] [INFO] [timer.py:215:stop] epoch=15/micro_step=120/global_step=13920, RunningAvgSamplesPerSec=85.11612024954357, CurrSamplesPerSec=85.12809896388305, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:00:06,913] [INFO] [logging.py:96:log_dist] [Rank 0] step=13930, skipped=270, lr=[1.2294479304172215e-07, 1.2294479304172215e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:00:06,947] [INFO] [timer.py:215:stop] epoch=15/micro_step=130/global_step=13930, RunningAvgSamplesPerSec=85.11607829904455, CurrSamplesPerSec=85.11660004730894, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:00:14,454] [INFO] [logging.py:96:log_dist] [Rank 0] step=13940, skipped=270, lr=[1.206456924654996e-07, 1.206456924654996e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:00:14,487] [INFO] [timer.py:215:stop] epoch=15/micro_step=140/global_step=13940, RunningAvgSamplesPerSec=85.1159546428762, CurrSamplesPerSec=84.94064464724703, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:00:21,981] [INFO] [logging.py:96:log_dist] [Rank 0] step=13950, skipped=270, lr=[1.183680200050256e-07, 1.183680200050256e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:00:22,015] [INFO] [timer.py:215:stop] epoch=15/micro_step=150/global_step=13950, RunningAvgSamplesPerSec=85.11593367213959, CurrSamplesPerSec=85.17247718589738, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:00:29,514] [INFO] [logging.py:96:log_dist] [Rank 0] step=13960, skipped=270, lr=[1.1611178603499272e-07, 1.1611178603499272e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:00:29,547] [INFO] [timer.py:215:stop] epoch=15/micro_step=160/global_step=13960, RunningAvgSamplesPerSec=85.11587516040503, CurrSamplesPerSec=85.17028825255811, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:00:30,253] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768, reducing to 16384 [2023-06-29 20:00:36,994] [INFO] [logging.py:96:log_dist] [Rank 0] step=13970, skipped=271, lr=[1.1409951386728322e-07, 1.1409951386728322e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:00:37,027] [INFO] [timer.py:215:stop] epoch=15/micro_step=170/global_step=13970, RunningAvgSamplesPerSec=85.11624747446125, CurrSamplesPerSec=85.12434662850319, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:00:44,516] [INFO] [logging.py:96:log_dist] [Rank 0] step=13980, skipped=271, lr=[1.1188404126158961e-07, 1.1188404126158961e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:00:44,550] [INFO] [timer.py:215:stop] epoch=15/micro_step=180/global_step=13980, RunningAvgSamplesPerSec=85.11626956225463, CurrSamplesPerSec=85.07964101387337, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:00:52,044] [INFO] [logging.py:96:log_dist] [Rank 0] step=13990, skipped=271, lr=[1.0969003668055979e-07, 1.0969003668055979e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:00:52,077] [INFO] [timer.py:215:stop] epoch=15/micro_step=190/global_step=13990, RunningAvgSamplesPerSec=85.11624654692835, CurrSamplesPerSec=85.1931289581882, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:00:59,578] [INFO] [logging.py:96:log_dist] [Rank 0] step=14000, skipped=271, lr=[1.0751751011778247e-07, 1.0751751011778247e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:00:59,611] [INFO] [timer.py:215:stop] epoch=15/micro_step=200/global_step=14000, RunningAvgSamplesPerSec=85.11617575653858, CurrSamplesPerSec=84.85753972192911, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:01:07,104] [INFO] [logging.py:96:log_dist] [Rank 0] step=14010, skipped=271, lr=[1.0536647146901515e-07, 1.0536647146901515e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:01:07,137] [INFO] [timer.py:215:stop] epoch=15/micro_step=210/global_step=14010, RunningAvgSamplesPerSec=85.1161675358247, CurrSamplesPerSec=85.27597174447018, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:01:14,632] [INFO] [logging.py:96:log_dist] [Rank 0] step=14020, skipped=271, lr=[1.0323693053214024e-07, 1.0323693053214024e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:01:14,665] [INFO] [timer.py:215:stop] epoch=15/micro_step=220/global_step=14020, RunningAvgSamplesPerSec=85.11614563219419, CurrSamplesPerSec=85.2683329701543, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:01:22,166] [INFO] [logging.py:96:log_dist] [Rank 0] step=14030, skipped=271, lr=[1.0112889700711685e-07, 1.0112889700711685e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:01:22,200] [INFO] [timer.py:215:stop] epoch=15/micro_step=230/global_step=14030, RunningAvgSamplesPerSec=85.11607436884069, CurrSamplesPerSec=84.95180033520515, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:01:29,703] [INFO] [logging.py:96:log_dist] [Rank 0] step=14040, skipped=271, lr=[9.904238049594058e-08, 9.904238049594058e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:01:29,736] [INFO] [timer.py:215:stop] epoch=15/micro_step=240/global_step=14040, RunningAvgSamplesPerSec=85.11597929448267, CurrSamplesPerSec=84.80571851190045, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:01:37,231] [INFO] [logging.py:96:log_dist] [Rank 0] step=14050, skipped=271, lr=[9.697739050259745e-08, 9.697739050259745e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:01:37,264] [INFO] [timer.py:215:stop] epoch=15/micro_step=250/global_step=14050, RunningAvgSamplesPerSec=85.115955994436, CurrSamplesPerSec=85.25519856215963, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:01:44,754] [INFO] [logging.py:96:log_dist] [Rank 0] step=14060, skipped=271, lr=[9.493393643302004e-08, 9.493393643302004e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:01:44,787] [INFO] [timer.py:215:stop] epoch=15/micro_step=260/global_step=14060, RunningAvgSamplesPerSec=85.11597800245991, CurrSamplesPerSec=85.23411073255456, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:01:52,273] [INFO] [logging.py:96:log_dist] [Rank 0] step=14070, skipped=271, lr=[9.291202759504828e-08, 9.291202759504828e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:01:52,307] [INFO] [timer.py:215:stop] epoch=15/micro_step=270/global_step=14070, RunningAvgSamplesPerSec=85.11602451809054, CurrSamplesPerSec=85.10348535804685, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:01:59,799] [INFO] [logging.py:96:log_dist] [Rank 0] step=14080, skipped=271, lr=[9.091167319838243e-08, 9.091167319838243e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:01:59,832] [INFO] [timer.py:215:stop] epoch=15/micro_step=280/global_step=14080, RunningAvgSamplesPerSec=85.116022250649, CurrSamplesPerSec=84.78354026859945, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:02:07,331] [INFO] [logging.py:96:log_dist] [Rank 0] step=14090, skipped=271, lr=[8.89328823545444e-08, 8.89328823545444e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:02:07,365] [INFO] [timer.py:215:stop] epoch=15/micro_step=290/global_step=14090, RunningAvgSamplesPerSec=85.11596243422487, CurrSamplesPerSec=84.99322297072237, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:02:14,861] [INFO] [logging.py:96:log_dist] [Rank 0] step=14100, skipped=271, lr=[8.697566407683387e-08, 8.697566407683387e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:02:14,894] [INFO] [timer.py:215:stop] epoch=15/micro_step=300/global_step=14100, RunningAvgSamplesPerSec=85.11593342443295, CurrSamplesPerSec=85.06505508058723, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:02:22,397] [INFO] [logging.py:96:log_dist] [Rank 0] step=14110, skipped=271, lr=[8.504002728029084e-08, 8.504002728029084e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:02:22,430] [INFO] [timer.py:215:stop] epoch=15/micro_step=310/global_step=14110, RunningAvgSamplesPerSec=85.11584718045464, CurrSamplesPerSec=84.83206354864684, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:02:29,928] [INFO] [logging.py:96:log_dist] [Rank 0] step=14120, skipped=271, lr=[8.312598078165002e-08, 8.312598078165002e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:02:29,962] [INFO] [timer.py:215:stop] epoch=15/micro_step=320/global_step=14120, RunningAvgSamplesPerSec=85.11579585075359, CurrSamplesPerSec=85.38026125326375, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:02:37,465] [INFO] [logging.py:96:log_dist] [Rank 0] step=14130, skipped=271, lr=[8.123353329930495e-08, 8.123353329930495e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:02:37,498] [INFO] [timer.py:215:stop] epoch=15/micro_step=330/global_step=14130, RunningAvgSamplesPerSec=85.11570626956284, CurrSamplesPerSec=85.20705563438177, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:02:44,987] [INFO] [logging.py:96:log_dist] [Rank 0] step=14140, skipped=271, lr=[7.936269345326577e-08, 7.936269345326577e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:02:45,020] [INFO] [timer.py:215:stop] epoch=15/micro_step=340/global_step=14140, RunningAvgSamplesPerSec=85.11573455651471, CurrSamplesPerSec=85.1487292183973, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:02:52,516] [INFO] [logging.py:96:log_dist] [Rank 0] step=14150, skipped=271, lr=[7.751346976512104e-08, 7.751346976512104e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:02:52,549] [INFO] [timer.py:215:stop] epoch=15/micro_step=350/global_step=14150, RunningAvgSamplesPerSec=85.11570408982274, CurrSamplesPerSec=85.21895776123431, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:03:00,052] [INFO] [logging.py:96:log_dist] [Rank 0] step=14160, skipped=271, lr=[7.568587065800038e-08, 7.568587065800038e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:03:00,086] [INFO] [timer.py:215:stop] epoch=15/micro_step=360/global_step=14160, RunningAvgSamplesPerSec=85.11561132460403, CurrSamplesPerSec=84.7677707757816, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:03:01,533] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 20:03:02,231] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 20:03:07,467] [INFO] [logging.py:96:log_dist] [Rank 0] step=14170, skipped=273, lr=[7.423936666810114e-08, 7.423936666810114e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:03:07,500] [INFO] [timer.py:215:stop] epoch=15/micro_step=370/global_step=14170, RunningAvgSamplesPerSec=85.11649924780508, CurrSamplesPerSec=84.99284621936276, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:03:14,995] [INFO] [logging.py:96:log_dist] [Rank 0] step=14180, skipped=273, lr=[7.245071271867132e-08, 7.245071271867132e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:03:15,028] [INFO] [timer.py:215:stop] epoch=15/micro_step=380/global_step=14180, RunningAvgSamplesPerSec=85.11647392886918, CurrSamplesPerSec=85.05085141006079, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:03:22,534] [INFO] [logging.py:96:log_dist] [Rank 0] step=14190, skipped=273, lr=[7.068370641088817e-08, 7.068370641088817e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:03:22,568] [INFO] [timer.py:215:stop] epoch=15/micro_step=390/global_step=14190, RunningAvgSamplesPerSec=85.11636979014212, CurrSamplesPerSec=84.76597733409584, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:03:30,073] [INFO] [logging.py:96:log_dist] [Rank 0] step=14200, skipped=273, lr=[6.893835579338344e-08, 6.893835579338344e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:03:30,107] [INFO] [timer.py:215:stop] epoch=15/micro_step=400/global_step=14200, RunningAvgSamplesPerSec=85.11625875745064, CurrSamplesPerSec=84.70074232729674, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:03:37,608] [INFO] [logging.py:96:log_dist] [Rank 0] step=14210, skipped=273, lr=[6.721466881614827e-08, 6.721466881614827e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:03:37,642] [INFO] [timer.py:215:stop] epoch=15/micro_step=410/global_step=14210, RunningAvgSamplesPerSec=85.11618273433115, CurrSamplesPerSec=84.93234026665992, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:03:45,148] [INFO] [logging.py:96:log_dist] [Rank 0] step=14220, skipped=273, lr=[6.551265333049732e-08, 6.551265333049732e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:03:45,181] [INFO] [timer.py:215:stop] epoch=15/micro_step=420/global_step=14220, RunningAvgSamplesPerSec=85.11606409141476, CurrSamplesPerSec=85.0024813392899, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:03:52,677] [INFO] [logging.py:96:log_dist] [Rank 0] step=14230, skipped=273, lr=[6.38323170890318e-08, 6.38323170890318e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:03:52,710] [INFO] [timer.py:215:stop] epoch=15/micro_step=430/global_step=14230, RunningAvgSamplesPerSec=85.11603613445033, CurrSamplesPerSec=85.3001973339265, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:04:00,213] [INFO] [logging.py:96:log_dist] [Rank 0] step=14240, skipped=273, lr=[6.21736677456052e-08, 6.21736677456052e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:04:00,246] [INFO] [timer.py:215:stop] epoch=15/micro_step=440/global_step=14240, RunningAvgSamplesPerSec=85.11595242340064, CurrSamplesPerSec=84.82270824874038, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:04:07,743] [INFO] [logging.py:96:log_dist] [Rank 0] step=14250, skipped=273, lr=[6.053671285528843e-08, 6.053671285528843e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:04:07,776] [INFO] [timer.py:215:stop] epoch=15/micro_step=450/global_step=14250, RunningAvgSamplesPerSec=85.1159145789173, CurrSamplesPerSec=84.96984380465214, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:04:15,275] [INFO] [logging.py:96:log_dist] [Rank 0] step=14260, skipped=273, lr=[5.892145987433506e-08, 5.892145987433506e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:04:15,308] [INFO] [timer.py:215:stop] epoch=15/micro_step=460/global_step=14260, RunningAvgSamplesPerSec=85.11586309447424, CurrSamplesPerSec=84.94701512705242, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:04:18,260] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 20:04:18,957] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 20:04:22,687] [INFO] [logging.py:96:log_dist] [Rank 0] step=14270, skipped=275, lr=[5.764488781181105e-08, 5.764488781181105e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:04:22,720] [INFO] [timer.py:215:stop] epoch=15/micro_step=470/global_step=14270, RunningAvgSamplesPerSec=85.11676187070789, CurrSamplesPerSec=85.27410255620153, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:04:30,217] [INFO] [logging.py:96:log_dist] [Rank 0] step=14280, skipped=275, lr=[5.606871674191729e-08, 5.606871674191729e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:04:30,251] [INFO] [timer.py:215:stop] epoch=15/micro_step=480/global_step=14280, RunningAvgSamplesPerSec=85.11672109965752, CurrSamplesPerSec=84.88150128080144, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:04:37,750] [INFO] [logging.py:96:log_dist] [Rank 0] step=14290, skipped=275, lr=[5.451426793290241e-08, 5.451426793290241e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:04:37,783] [INFO] [timer.py:215:stop] epoch=15/micro_step=490/global_step=14290, RunningAvgSamplesPerSec=85.1166603478516, CurrSamplesPerSec=85.03193851647102, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:04:45,284] [INFO] [logging.py:96:log_dist] [Rank 0] step=14300, skipped=275, lr=[5.298154846520809e-08, 5.298154846520809e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:04:45,318] [INFO] [timer.py:215:stop] epoch=15/micro_step=500/global_step=14300, RunningAvgSamplesPerSec=85.11658640551927, CurrSamplesPerSec=84.75299722031733, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:04:52,813] [INFO] [logging.py:96:log_dist] [Rank 0] step=14310, skipped=275, lr=[5.1470565320301137e-08, 5.1470565320301137e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:04:52,846] [INFO] [timer.py:215:stop] epoch=15/micro_step=510/global_step=14310, RunningAvgSamplesPerSec=85.11656601100682, CurrSamplesPerSec=84.73178262666028, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:05:00,338] [INFO] [logging.py:96:log_dist] [Rank 0] step=14320, skipped=275, lr=[4.998132538063975e-08, 4.998132538063975e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:05:00,372] [INFO] [timer.py:215:stop] epoch=15/micro_step=520/global_step=14320, RunningAvgSamplesPerSec=85.11656106975057, CurrSamplesPerSec=85.0833624408473, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:05:07,870] [INFO] [logging.py:96:log_dist] [Rank 0] step=14330, skipped=275, lr=[4.851383542964191e-08, 4.851383542964191e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:05:07,903] [INFO] [timer.py:215:stop] epoch=15/micro_step=530/global_step=14330, RunningAvgSamplesPerSec=85.11651035304789, CurrSamplesPerSec=84.85016346671446, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:05:15,395] [INFO] [logging.py:96:log_dist] [Rank 0] step=14340, skipped=275, lr=[4.706810215165701e-08, 4.706810215165701e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:05:15,429] [INFO] [timer.py:215:stop] epoch=15/micro_step=540/global_step=14340, RunningAvgSamplesPerSec=85.11650715350153, CurrSamplesPerSec=85.41318772159869, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:05:22,920] [INFO] [logging.py:96:log_dist] [Rank 0] step=14350, skipped=275, lr=[4.5644132131933135e-08, 4.5644132131933135e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:05:22,954] [INFO] [timer.py:215:stop] epoch=15/micro_step=550/global_step=14350, RunningAvgSamplesPerSec=85.11650845784541, CurrSamplesPerSec=85.24688670838681, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:05:30,457] [INFO] [logging.py:96:log_dist] [Rank 0] step=14360, skipped=275, lr=[4.4241931856588175e-08, 4.4241931856588175e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:05:30,491] [INFO] [timer.py:215:stop] epoch=15/micro_step=560/global_step=14360, RunningAvgSamplesPerSec=85.116414546885, CurrSamplesPerSec=84.95360164871768, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:05:34,957] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 20:05:35,653] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 20:05:37,879] [INFO] [logging.py:96:log_dist] [Rank 0] step=14370, skipped=277, lr=[4.3135850147631915e-08, 4.3135850147631915e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:05:37,912] [INFO] [timer.py:215:stop] epoch=15/micro_step=570/global_step=14370, RunningAvgSamplesPerSec=85.11722965085833, CurrSamplesPerSec=85.18653227759144, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:05:45,418] [INFO] [logging.py:96:log_dist] [Rank 0] step=14380, skipped=277, lr=[4.1772851440644845e-08, 4.1772851440644845e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:05:45,451] [INFO] [timer.py:215:stop] epoch=15/micro_step=580/global_step=14380, RunningAvgSamplesPerSec=85.11711980589102, CurrSamplesPerSec=84.95026793459186, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:05:52,946] [INFO] [logging.py:96:log_dist] [Rank 0] step=14390, skipped=277, lr=[4.043164011154094e-08, 4.043164011154094e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:05:52,980] [INFO] [timer.py:215:stop] epoch=15/micro_step=590/global_step=14390, RunningAvgSamplesPerSec=85.1170891575004, CurrSamplesPerSec=84.86540019784557, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:06:00,486] [INFO] [logging.py:96:log_dist] [Rank 0] step=14400, skipped=277, lr=[3.911222226947448e-08, 3.911222226947448e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:06:00,520] [INFO] [timer.py:215:stop] epoch=15/micro_step=600/global_step=14400, RunningAvgSamplesPerSec=85.11697430502495, CurrSamplesPerSec=85.13447059316046, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:06:08,013] [INFO] [logging.py:96:log_dist] [Rank 0] step=14410, skipped=277, lr=[3.781460392433294e-08, 3.781460392433294e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:06:08,046] [INFO] [timer.py:215:stop] epoch=15/micro_step=610/global_step=14410, RunningAvgSamplesPerSec=85.11696318246585, CurrSamplesPerSec=85.08800119690706, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:06:15,533] [INFO] [logging.py:96:log_dist] [Rank 0] step=14420, skipped=277, lr=[3.653879098670754e-08, 3.653879098670754e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:06:15,566] [INFO] [timer.py:215:stop] epoch=15/micro_step=620/global_step=14420, RunningAvgSamplesPerSec=85.11700881326398, CurrSamplesPerSec=85.02488201346789, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:06:23,067] [INFO] [logging.py:96:log_dist] [Rank 0] step=14430, skipped=277, lr=[3.528478926786696e-08, 3.528478926786696e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:06:23,100] [INFO] [timer.py:215:stop] epoch=15/micro_step=630/global_step=14430, RunningAvgSamplesPerSec=85.11693673916966, CurrSamplesPerSec=85.13687370103334, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:06:30,599] [INFO] [logging.py:96:log_dist] [Rank 0] step=14440, skipped=277, lr=[3.405260447973165e-08, 3.405260447973165e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:06:30,632] [INFO] [timer.py:215:stop] epoch=15/micro_step=640/global_step=14440, RunningAvgSamplesPerSec=85.1168824058402, CurrSamplesPerSec=85.25335736034064, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:06:38,136] [INFO] [logging.py:96:log_dist] [Rank 0] step=14450, skipped=277, lr=[3.284224223484543e-08, 3.284224223484543e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:06:38,169] [INFO] [timer.py:215:stop] epoch=15/micro_step=650/global_step=14450, RunningAvgSamplesPerSec=85.1167879329165, CurrSamplesPerSec=84.65583967721497, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:06:45,666] [INFO] [logging.py:96:log_dist] [Rank 0] step=14460, skipped=277, lr=[3.1653708046352495e-08, 3.1653708046352495e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:06:45,699] [INFO] [timer.py:215:stop] epoch=15/micro_step=660/global_step=14460, RunningAvgSamplesPerSec=85.1167496544121, CurrSamplesPerSec=84.92264044156471, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:06:51,676] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 20:06:52,373] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 20:06:53,094] [INFO] [logging.py:96:log_dist] [Rank 0] step=14470, skipped=279, lr=[3.07186005375209e-08, 3.07186005375209e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:06:53,128] [INFO] [timer.py:215:stop] epoch=15/micro_step=670/global_step=14470, RunningAvgSamplesPerSec=85.11750829447121, CurrSamplesPerSec=85.03040322337964, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:07:00,630] [INFO] [logging.py:96:log_dist] [Rank 0] step=14480, skipped=279, lr=[2.956937042627529e-08, 2.956937042627529e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:07:00,664] [INFO] [timer.py:215:stop] epoch=15/micro_step=680/global_step=14480, RunningAvgSamplesPerSec=85.11741980998299, CurrSamplesPerSec=85.06238647699372, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:07:08,160] [INFO] [logging.py:96:log_dist] [Rank 0] step=14490, skipped=279, lr=[2.8441983279202135e-08, 2.8441983279202135e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:07:08,193] [INFO] [timer.py:215:stop] epoch=15/micro_step=690/global_step=14490, RunningAvgSamplesPerSec=85.11738764945355, CurrSamplesPerSec=85.06063445457636, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:07:15,690] [INFO] [logging.py:96:log_dist] [Rank 0] step=14500, skipped=279, lr=[2.7336444231497413e-08, 2.7336444231497413e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:07:15,724] [INFO] [timer.py:215:stop] epoch=15/micro_step=700/global_step=14500, RunningAvgSamplesPerSec=85.11734684891273, CurrSamplesPerSec=85.0483184264971, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:07:23,226] [INFO] [logging.py:96:log_dist] [Rank 0] step=14510, skipped=279, lr=[2.6252758318841213e-08, 2.6252758318841213e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:07:23,260] [INFO] [timer.py:215:stop] epoch=15/micro_step=710/global_step=14510, RunningAvgSamplesPerSec=85.11726082070241, CurrSamplesPerSec=85.12480552946407, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:07:30,759] [INFO] [logging.py:96:log_dist] [Rank 0] step=14520, skipped=279, lr=[2.5190930477372e-08, 2.5190930477372e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:07:30,793] [INFO] [timer.py:215:stop] epoch=15/micro_step=720/global_step=14520, RunningAvgSamplesPerSec=85.11720173893079, CurrSamplesPerSec=85.11719381199829, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:07:38,282] [INFO] [logging.py:96:log_dist] [Rank 0] step=14530, skipped=279, lr=[2.4150965543665742e-08, 2.4150965543665742e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:07:38,316] [INFO] [timer.py:215:stop] epoch=15/micro_step=730/global_step=14530, RunningAvgSamplesPerSec=85.11721832139222, CurrSamplesPerSec=85.13563162834008, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:07:45,799] [INFO] [logging.py:96:log_dist] [Rank 0] step=14540, skipped=279, lr=[2.3132868254715004e-08, 2.3132868254715004e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:07:45,833] [INFO] [timer.py:215:stop] epoch=15/micro_step=740/global_step=14540, RunningAvgSamplesPerSec=85.11728012165302, CurrSamplesPerSec=85.34236749279502, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:07:53,327] [INFO] [logging.py:96:log_dist] [Rank 0] step=14550, skipped=279, lr=[2.213664324790646e-08, 2.213664324790646e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:07:53,361] [INFO] [timer.py:215:stop] epoch=15/micro_step=750/global_step=14550, RunningAvgSamplesPerSec=85.11725699654684, CurrSamplesPerSec=84.34968696699234, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:08:00,869] [INFO] [logging.py:96:log_dist] [Rank 0] step=14560, skipped=279, lr=[2.1162295060997317e-08, 2.1162295060997317e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:08:00,902] [INFO] [timer.py:215:stop] epoch=15/micro_step=760/global_step=14560, RunningAvgSamplesPerSec=85.11712799758442, CurrSamplesPerSec=84.90469757645553, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:08:08,387] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 20:08:08,388] [INFO] [logging.py:96:log_dist] [Rank 0] step=14570, skipped=280, lr=[2.0304090043951997e-08, 2.0304090043951997e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:08:08,388] [INFO] [timer.py:215:stop] epoch=15/micro_step=770/global_step=14570, RunningAvgSamplesPerSec=85.11743473391668, CurrSamplesPerSec=91.80108334225346, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:08:09,086] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 20:08:15,833] [INFO] [logging.py:96:log_dist] [Rank 0] step=14580, skipped=281, lr=[1.946361201353225e-08, 1.946361201353225e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:08:15,867] [INFO] [timer.py:215:stop] epoch=15/micro_step=780/global_step=14580, RunningAvgSamplesPerSec=85.11779202465549, CurrSamplesPerSec=85.1944808631308, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:08:23,361] [INFO] [logging.py:96:log_dist] [Rank 0] step=14590, skipped=281, lr=[1.8550542214940644e-08, 1.8550542214940644e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:08:23,394] [INFO] [timer.py:215:stop] epoch=15/micro_step=790/global_step=14590, RunningAvgSamplesPerSec=85.11777974592161, CurrSamplesPerSec=85.02838318535576, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:08:30,888] [INFO] [logging.py:96:log_dist] [Rank 0] step=14600, skipped=281, lr=[1.765936557077271e-08, 1.765936557077271e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:08:30,922] [INFO] [timer.py:215:stop] epoch=15/micro_step=800/global_step=14600, RunningAvgSamplesPerSec=85.11775353056733, CurrSamplesPerSec=85.01744971204302, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:08:38,428] [INFO] [logging.py:96:log_dist] [Rank 0] step=14610, skipped=281, lr=[1.6790086140297347e-08, 1.6790086140297347e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:08:38,462] [INFO] [timer.py:215:stop] epoch=15/micro_step=810/global_step=14610, RunningAvgSamplesPerSec=85.11763952155785, CurrSamplesPerSec=84.87055187975773, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:08:45,949] [INFO] [logging.py:96:log_dist] [Rank 0] step=14620, skipped=281, lr=[1.594270788304202e-08, 1.594270788304202e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:08:45,983] [INFO] [timer.py:215:stop] epoch=15/micro_step=820/global_step=14620, RunningAvgSamplesPerSec=85.1176655676015, CurrSamplesPerSec=85.0745178426887, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:08:53,474] [INFO] [logging.py:96:log_dist] [Rank 0] step=14630, skipped=281, lr=[1.51172346587751e-08, 1.51172346587751e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:08:53,507] [INFO] [timer.py:215:stop] epoch=15/micro_step=830/global_step=14630, RunningAvgSamplesPerSec=85.1176696880444, CurrSamplesPerSec=85.09166940864029, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:09:01,002] [INFO] [logging.py:96:log_dist] [Rank 0] step=14640, skipped=281, lr=[1.4313670227489249e-08, 1.4313670227489249e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:09:01,036] [INFO] [timer.py:215:stop] epoch=15/micro_step=840/global_step=14640, RunningAvgSamplesPerSec=85.11764291339894, CurrSamplesPerSec=85.22951015379927, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:09:08,543] [INFO] [logging.py:96:log_dist] [Rank 0] step=14650, skipped=281, lr=[1.3532018249383205e-08, 1.3532018249383205e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:09:08,576] [INFO] [timer.py:215:stop] epoch=15/micro_step=850/global_step=14650, RunningAvgSamplesPerSec=85.11753442303933, CurrSamplesPerSec=85.10086830097508, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:09:16,082] [INFO] [logging.py:96:log_dist] [Rank 0] step=14660, skipped=281, lr=[1.2772282284845187e-08, 1.2772282284845187e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:09:16,116] [INFO] [timer.py:215:stop] epoch=15/micro_step=860/global_step=14660, RunningAvgSamplesPerSec=85.11741755949191, CurrSamplesPerSec=85.05333064645934, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:09:23,610] [INFO] [logging.py:96:log_dist] [Rank 0] step=14670, skipped=281, lr=[1.2034465794436808e-08, 1.2034465794436808e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:09:23,643] [INFO] [timer.py:215:stop] epoch=15/micro_step=870/global_step=14670, RunningAvgSamplesPerSec=85.11739585278514, CurrSamplesPerSec=84.94314434186936, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:09:25,091] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-06-29 20:09:25,789] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-06-29 20:09:31,032] [INFO] [logging.py:96:log_dist] [Rank 0] step=14680, skipped=283, lr=[1.1459996885243141e-08, 1.1459996885243141e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:09:31,066] [INFO] [timer.py:215:stop] epoch=15/micro_step=880/global_step=14680, RunningAvgSamplesPerSec=85.1181874057722, CurrSamplesPerSec=84.92390317422856, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:09:38,562] [INFO] [logging.py:96:log_dist] [Rank 0] step=14690, skipped=283, lr=[1.076164385017997e-08, 1.076164385017997e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:09:38,595] [INFO] [timer.py:215:stop] epoch=15/micro_step=890/global_step=14690, RunningAvgSamplesPerSec=85.11815849883044, CurrSamplesPerSec=84.76086510340785, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:09:46,100] [INFO] [logging.py:96:log_dist] [Rank 0] step=14700, skipped=283, lr=[1.0085219447612204e-08, 1.0085219447612204e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:09:46,133] [INFO] [timer.py:215:stop] epoch=15/micro_step=900/global_step=14700, RunningAvgSamplesPerSec=85.11805498163017, CurrSamplesPerSec=84.81817877335746, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:09:53,614] [INFO] [logging.py:96:log_dist] [Rank 0] step=14710, skipped=283, lr=[9.430726758620686e-09, 9.430726758620686e-09], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:09:53,648] [INFO] [timer.py:215:stop] epoch=15/micro_step=910/global_step=14710, RunningAvgSamplesPerSec=85.11813627649522, CurrSamplesPerSec=85.4161229956989, MemAllocated=5.0GB, MaxMemAllocated=23.99GB [2023-06-29 20:10:01,140] [INFO] [logging.py:96:log_dist] [Rank 0] step=14720, skipped=283, lr=[8.798168764390025e-09, 8.798168764390025e-09], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-29 20:10:01,173] [INFO] [timer.py:215:stop] epoch=15/micro_step=920/global_step=14720, RunningAvgSamplesPerSec=85.11813006501022, CurrSamplesPerSec=84.78276370463111, MemAllocated=5.0GB, MaxMemAllocated=23.99GB ***** Evaluating perplexity, Epoch 16/16 ***** ppl: 1.778785228729248 saving the final model ... [2023-06-29 20:10:23,255] [INFO] [launch.py:347:main] Process 888724 exits successfully. [2023-06-29 20:10:24,257] [INFO] [launch.py:347:main] Process 888727 exits successfully. [2023-06-29 20:10:24,257] [INFO] [launch.py:347:main] Process 888729 exits successfully. [2023-06-29 20:10:24,257] [INFO] [launch.py:347:main] Process 888726 exits successfully. [2023-06-29 20:10:24,257] [INFO] [launch.py:347:main] Process 888725 exits successfully. [2023-06-29 20:10:24,257] [INFO] [launch.py:347:main] Process 888723 exits successfully. [2023-06-29 20:10:25,258] [INFO] [launch.py:347:main] Process 888728 exits successfully. [2023-06-29 20:10:28,262] [INFO] [launch.py:347:main] Process 888722 exits successfully.