/home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( [2023-04-21 21:43:51,001] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-04-21 21:43:51,650] [INFO] [runner.py:540:main] cmd = /home/AdamG012/.conda/envs/py39/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets --data_split 2,4,4 --model_name_or_path facebook/opt-1.3b --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --max_seq_len 512 --learning_rate 9.65e-6 --weight_decay 0. --num_train_epochs 16 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --zero_stage 2 --deepspeed --output_dir ./output /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( [2023-04-21 21:43:55,230] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} [2023-04-21 21:43:55,231] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=8, node_rank=0 [2023-04-21 21:43:55,231] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) [2023-04-21 21:43:55,231] [INFO] [launch.py:247:main] dist_world_size=8 [2023-04-21 21:43:55,231] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( [2023-04-21 21:44:13,618] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Found cached dataset parquet (/reward/Dahoas___parquet/default-b9d2c4937d617106/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec) 0%| | 0/2 [00:00 Using /home/AdamG012/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...[2023-04-21 21:47:10,243] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer [2023-04-21 21:47:10,244] [INFO] [stage_1_and_2.py:133:__init__] Reduce bucket size 500,000,000 Using /home/AdamG012/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...[2023-04-21 21:47:10,244] [INFO] [stage_1_and_2.py:134:__init__] Allgather bucket size 500,000,000 Using /home/AdamG012/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...[2023-04-21 21:47:10,244] [INFO] [stage_1_and_2.py:135:__init__] CPU Offload: False Using /home/AdamG012/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...[2023-04-21 21:47:10,244] [INFO] [stage_1_and_2.py:136:__init__] Round robin gradient partitioning: False Using /home/AdamG012/.cache/torch_extensions/py39_cu113 as PyTorch extensions root... Using /home/AdamG012/.cache/torch_extensions/py39_cu113 as PyTorch extensions root... Using /home/AdamG012/.cache/torch_extensions/py39_cu113 as PyTorch extensions root... huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) Emitting ninja build file /home/AdamG012/.cache/torch_extensions/py39_cu113/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.8648874759674072 seconds Loading extension module utils... Loading extension module utils... Loading extension module utils... Loading extension module utils... Loading extension module utils... Loading extension module utils... Time to load utils op: 0.9071669578552246 seconds Time to load utils op: 0.9073724746704102 seconds Time to load utils op: 0.9068405628204346 secondsTime to load utils op: 0.9081339836120605 seconds Time to load utils op: 0.907646894454956 seconds Time to load utils op: 0.9078972339630127 seconds Loading extension module utils... Time to load utils op: 0.905179500579834 seconds Rank: 6 partition count [8, 8] and sizes[(164401920, False), (67840, False)] Rank: 0 partition count [8, 8] and sizes[(164401920, False), (67840, False)] Rank: 2 partition count [8, 8] and sizes[(164401920, False), (67840, False)] Rank: 4 partition count [8, 8] and sizes[(164401920, False), (67840, False)] Rank: 7 partition count [8, 8] and sizes[(164401920, False), (67840, False)] Rank: 3 partition count [8, 8] and sizes[(164401920, False), (67840, False)] Rank: 5 partition count [8, 8] and sizes[(164401920, False), (67840, False)] Rank: 1 partition count [8, 8] and sizes[(164401920, False), (67840, False)] Using /home/AdamG012/.cache/torch_extensions/py39_cu113 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0011208057403564453 seconds Using /home/AdamG012/.cache/torch_extensions/py39_cu113 as PyTorch extensions root... Using /home/AdamG012/.cache/torch_extensions/py39_cu113 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0008709430694580078 seconds Using /home/AdamG012/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...Using /home/AdamG012/.cache/torch_extensions/py39_cu113 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0007677078247070312 seconds No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0010838508605957031 seconds No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Using /home/AdamG012/.cache/torch_extensions/py39_cu113 as PyTorch extensions root... Time to load utils op: 0.0012142658233642578 seconds Using /home/AdamG012/.cache/torch_extensions/py39_cu113 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0010640621185302734 seconds No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.001260995864868164 seconds [2023-04-21 21:47:22,799] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states [2023-04-21 21:47:22,799] [INFO] [utils.py:786:see_memory_usage] MA 3.06 GB Max_MA 3.06 GB CA 3.07 GB Max_CA 3 GB [2023-04-21 21:47:22,800] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 92.22 GB, percent = 9.2% [2023-04-21 21:47:23,117] [INFO] [utils.py:785:see_memory_usage] After initializing optimizer states [2023-04-21 21:47:23,118] [INFO] [utils.py:786:see_memory_usage] MA 4.29 GB Max_MA 4.91 GB CA 4.91 GB Max_CA 5 GB [2023-04-21 21:47:23,118] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 92.22 GB, percent = 9.2% [2023-04-21 21:47:23,118] [INFO] [stage_1_and_2.py:489:__init__] optimizer state initialized [2023-04-21 21:47:23,413] [INFO] [utils.py:785:see_memory_usage] After initializing ZeRO optimizer [2023-04-21 21:47:23,413] [INFO] [utils.py:786:see_memory_usage] MA 4.29 GB Max_MA 4.29 GB CA 4.91 GB Max_CA 5 GB [2023-04-21 21:47:23,413] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 92.22 GB, percent = 9.2% [2023-04-21 21:47:23,415] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam [2023-04-21 21:47:23,415] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler [2023-04-21 21:47:23,415] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = [2023-04-21 21:47:23,415] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[9.65e-06, 9.65e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:47:23,416] [INFO] [config.py:953:print] DeepSpeedEngine configuration: [2023-04-21 21:47:23,416] [INFO] [config.py:957:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2023-04-21 21:47:23,416] [INFO] [config.py:957:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2023-04-21 21:47:23,416] [INFO] [config.py:957:print] amp_enabled .................. False [2023-04-21 21:47:23,416] [INFO] [config.py:957:print] amp_params ................... False [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] bfloat16_enabled ............. False [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] checkpoint_parallel_write_pipeline False [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] checkpoint_tag_validation_enabled True [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] checkpoint_tag_validation_fail False [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] comms_config ................. [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] communication_data_type ...... None [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] curriculum_enabled_legacy .... False [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] curriculum_params_legacy ..... False [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] data_efficiency_enabled ...... False [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] dataloader_drop_last ......... False [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] disable_allgather ............ False [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] dump_state ................... False [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 100, 'delayed_shift': 2, 'min_scale': 1} [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] eigenvalue_enabled ........... False [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] eigenvalue_gas_boundary_resolution 1 [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] eigenvalue_layer_name ........ bert.encoder.layer [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] eigenvalue_layer_num ......... 0 [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] eigenvalue_max_iter .......... 100 [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] eigenvalue_stability ......... 1e-06 [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] eigenvalue_tol ............... 0.01 [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] eigenvalue_verbose ........... False [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] elasticity_enabled ........... False [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] flops_profiler_config ........ { "enabled": false, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] fp16_auto_cast ............... False [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] fp16_enabled ................. True [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] fp16_master_weights_and_gradients False [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] global_rank .................. 0 [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] grad_accum_dtype ............. None [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] gradient_accumulation_steps .. 1 [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] gradient_clipping ............ 1.0 [2023-04-21 21:47:23,417] [INFO] [config.py:957:print] gradient_predivide_factor .... 1.0 [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] initial_dynamic_scale ........ 65536 [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] load_universal_checkpoint .... False [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] loss_scale ................... 0 [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] memory_breakdown ............. False [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] optimizer_legacy_fusion ...... False [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] optimizer_name ............... None [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] optimizer_params ............. None [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] pld_enabled .................. False [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] pld_params ................... False [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] prescale_gradients ........... False [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] scheduler_name ............... None [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] scheduler_params ............. None [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] sparse_attention ............. None [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] sparse_gradients_enabled ..... False [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] steps_per_print .............. 10 [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] train_batch_size ............. 64 [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] train_micro_batch_size_per_gpu 8 [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] use_node_local_storage ....... False [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] wall_clock_breakdown ......... False [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] world_size ................... 8 [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] zero_allow_untested_optimizer False [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=30000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=30000000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False memory_efficient_linear=False [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] zero_enabled ................. True [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] zero_force_ds_cpu_optimizer .. True [2023-04-21 21:47:23,418] [INFO] [config.py:957:print] zero_optimization_stage ...... 2 [2023-04-21 21:47:23,418] [INFO] [config.py:943:print_user_config] json = { "train_batch_size": 64, "train_micro_batch_size_per_gpu": 8, "steps_per_print": 10, "zero_optimization": { "stage": 2, "offload_param": { "device": "none" }, "offload_optimizer": { "device": "none" }, "stage3_param_persistence_threshold": 1.000000e+04, "stage3_max_live_parameters": 3.000000e+07, "stage3_prefetch_bucket_size": 3.000000e+07, "memory_efficient_linear": false }, "fp16": { "enabled": true, "loss_scale_window": 100 }, "gradient_clipping": 1.0, "prescale_gradients": false, "wall_clock_breakdown": false, "hybrid_engine": { "enabled": false, "max_out_tokens": 512, "inference_tp_size": 1, "release_inference_cache": false, "pin_parameters": true, "tp_gather_partition_size": 8 } } Using /home/AdamG012/.cache/torch_extensions/py39_cu113 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0008244514465332031 seconds ***** Running training ***** ***** Evaluating perplexity, Epoch 0/16 ***** ppl: 4937.50439453125 Beginning of Epoch 1/16, Total Micro Batches 920 [2023-04-21 21:47:33,149] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 21:47:33,487] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 21:47:33,821] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768, reducing to 16384 [2023-04-21 21:47:34,156] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16384, reducing to 8192 [2023-04-21 21:47:34,491] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8192, reducing to 4096 [2023-04-21 21:47:35,569] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4096, reducing to 2048 [2023-04-21 21:47:36,276] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=6, lr=[9.649998241787337e-06, 9.649998241787337e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:47:36,294] [INFO] [timer.py:199:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=182.5958316986737, CurrSamplesPerSec=176.91490093678698, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:47:39,901] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=6, lr=[9.649978461909591e-06, 9.649978461909591e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:47:39,919] [INFO] [timer.py:199:stop] epoch=0/micro_step=20/global_step=20, RunningAvgSamplesPerSec=179.31820386765057, CurrSamplesPerSec=177.04488589895792, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:47:43,548] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=6, lr=[9.649936704478667e-06, 9.649936704478667e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:47:43,566] [INFO] [timer.py:199:stop] epoch=0/micro_step=30/global_step=30, RunningAvgSamplesPerSec=178.00196469721402, CurrSamplesPerSec=176.94428836791113, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:47:47,173] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=6, lr=[9.649872969684765e-06, 9.649872969684765e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:47:47,191] [INFO] [timer.py:199:stop] epoch=0/micro_step=40/global_step=40, RunningAvgSamplesPerSec=177.6800031869205, CurrSamplesPerSec=177.15319681667924, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:47:50,792] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=6, lr=[9.649787257818198e-06, 9.649787257818198e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:47:50,811] [INFO] [timer.py:199:stop] epoch=0/micro_step=50/global_step=50, RunningAvgSamplesPerSec=177.54238690381163, CurrSamplesPerSec=177.02340304302868, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:47:54,420] [INFO] [logging.py:96:log_dist] [Rank 0] step=60, skipped=6, lr=[9.649679569269376e-06, 9.649679569269376e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:47:54,438] [INFO] [timer.py:199:stop] epoch=0/micro_step=60/global_step=60, RunningAvgSamplesPerSec=177.38516479920085, CurrSamplesPerSec=176.04368244707746, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:47:58,041] [INFO] [logging.py:96:log_dist] [Rank 0] step=70, skipped=6, lr=[9.649549904528819e-06, 9.649549904528819e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:47:58,060] [INFO] [timer.py:199:stop] epoch=0/micro_step=70/global_step=70, RunningAvgSamplesPerSec=177.3183044173705, CurrSamplesPerSec=176.94988708080479, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:48:01,662] [INFO] [logging.py:96:log_dist] [Rank 0] step=80, skipped=6, lr=[9.649398264187143e-06, 9.649398264187143e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:48:01,680] [INFO] [timer.py:199:stop] epoch=0/micro_step=80/global_step=80, RunningAvgSamplesPerSec=177.2722089262673, CurrSamplesPerSec=177.17611448131018, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:48:05,288] [INFO] [logging.py:96:log_dist] [Rank 0] step=90, skipped=6, lr=[9.64922464893506e-06, 9.64922464893506e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:48:05,307] [INFO] [timer.py:199:stop] epoch=0/micro_step=90/global_step=90, RunningAvgSamplesPerSec=177.2087898449562, CurrSamplesPerSec=176.97951950209657, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:48:08,911] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=6, lr=[9.649029059563382e-06, 9.649029059563382e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:48:08,929] [INFO] [timer.py:199:stop] epoch=0/micro_step=100/global_step=100, RunningAvgSamplesPerSec=177.1746551050676, CurrSamplesPerSec=176.8711877070994, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:48:12,533] [INFO] [logging.py:96:log_dist] [Rank 0] step=110, skipped=6, lr=[9.648811496963009e-06, 9.648811496963009e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:48:12,551] [INFO] [timer.py:199:stop] epoch=0/micro_step=110/global_step=110, RunningAvgSamplesPerSec=177.15147671534058, CurrSamplesPerSec=176.85638838963354, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:48:16,179] [INFO] [logging.py:96:log_dist] [Rank 0] step=120, skipped=6, lr=[9.64857196212493e-06, 9.64857196212493e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:48:16,197] [INFO] [timer.py:199:stop] epoch=0/micro_step=120/global_step=120, RunningAvgSamplesPerSec=177.02908987236867, CurrSamplesPerSec=176.88598950158215, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:48:19,804] [INFO] [logging.py:96:log_dist] [Rank 0] step=130, skipped=6, lr=[9.648310456140211e-06, 9.648310456140211e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:48:19,823] [INFO] [timer.py:199:stop] epoch=0/micro_step=130/global_step=130, RunningAvgSamplesPerSec=177.00800831849185, CurrSamplesPerSec=176.69201462848235, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:48:23,426] [INFO] [logging.py:96:log_dist] [Rank 0] step=140, skipped=6, lr=[9.648026980200002e-06, 9.648026980200002e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:48:23,444] [INFO] [timer.py:199:stop] epoch=0/micro_step=140/global_step=140, RunningAvgSamplesPerSec=177.00351794326136, CurrSamplesPerSec=177.02375326517543, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:48:27,046] [INFO] [logging.py:96:log_dist] [Rank 0] step=150, skipped=6, lr=[9.647721535595524e-06, 9.647721535595524e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:48:27,064] [INFO] [timer.py:199:stop] epoch=0/micro_step=150/global_step=150, RunningAvgSamplesPerSec=177.0034166469544, CurrSamplesPerSec=176.96738534239063, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:48:30,664] [INFO] [logging.py:96:log_dist] [Rank 0] step=160, skipped=6, lr=[9.647394123718063e-06, 9.647394123718063e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:48:30,682] [INFO] [timer.py:199:stop] epoch=0/micro_step=160/global_step=160, RunningAvgSamplesPerSec=177.00903183065697, CurrSamplesPerSec=177.0368292169793, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:48:34,283] [INFO] [logging.py:96:log_dist] [Rank 0] step=170, skipped=6, lr=[9.647044746058962e-06, 9.647044746058962e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:48:34,302] [INFO] [timer.py:199:stop] epoch=0/micro_step=170/global_step=170, RunningAvgSamplesPerSec=177.01166862268389, CurrSamplesPerSec=177.02235238490286, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:48:37,903] [INFO] [logging.py:96:log_dist] [Rank 0] step=180, skipped=6, lr=[9.646673404209623e-06, 9.646673404209623e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:48:37,921] [INFO] [timer.py:199:stop] epoch=0/micro_step=180/global_step=180, RunningAvgSamplesPerSec=177.01221174849942, CurrSamplesPerSec=176.66933829486587, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:48:41,536] [INFO] [logging.py:96:log_dist] [Rank 0] step=190, skipped=6, lr=[9.64628009986149e-06, 9.64628009986149e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:48:41,554] [INFO] [timer.py:199:stop] epoch=0/micro_step=190/global_step=190, RunningAvgSamplesPerSec=176.97689728542647, CurrSamplesPerSec=176.94417173182657, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:48:45,157] [INFO] [logging.py:96:log_dist] [Rank 0] step=200, skipped=6, lr=[9.645864834806044e-06, 9.645864834806044e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:48:45,175] [INFO] [timer.py:199:stop] epoch=0/micro_step=200/global_step=200, RunningAvgSamplesPerSec=176.9752111899041, CurrSamplesPerSec=176.5318932057351, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:48:48,805] [INFO] [logging.py:96:log_dist] [Rank 0] step=210, skipped=6, lr=[9.6454276109348e-06, 9.6454276109348e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:48:48,824] [INFO] [timer.py:199:stop] epoch=0/micro_step=210/global_step=210, RunningAvgSamplesPerSec=176.9100140702119, CurrSamplesPerSec=177.09557807735496, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:48:52,426] [INFO] [logging.py:96:log_dist] [Rank 0] step=220, skipped=6, lr=[9.644968430239294e-06, 9.644968430239294e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:48:52,444] [INFO] [timer.py:199:stop] epoch=0/micro_step=220/global_step=220, RunningAvgSamplesPerSec=176.91355847206793, CurrSamplesPerSec=177.30718639961373, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:48:56,045] [INFO] [logging.py:96:log_dist] [Rank 0] step=230, skipped=6, lr=[9.644487294811071e-06, 9.644487294811071e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:48:56,063] [INFO] [timer.py:199:stop] epoch=0/micro_step=230/global_step=230, RunningAvgSamplesPerSec=176.92030140329123, CurrSamplesPerSec=176.9692520278894, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:48:59,665] [INFO] [logging.py:96:log_dist] [Rank 0] step=240, skipped=6, lr=[9.643984206841679e-06, 9.643984206841679e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:48:59,683] [INFO] [timer.py:199:stop] epoch=0/micro_step=240/global_step=240, RunningAvgSamplesPerSec=176.923832848225, CurrSamplesPerSec=176.8380965932859, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:49:03,284] [INFO] [logging.py:96:log_dist] [Rank 0] step=250, skipped=6, lr=[9.643459168622665e-06, 9.643459168622665e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:49:03,302] [INFO] [timer.py:199:stop] epoch=0/micro_step=250/global_step=250, RunningAvgSamplesPerSec=176.92950267791142, CurrSamplesPerSec=177.0381135664247, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:49:06,903] [INFO] [logging.py:96:log_dist] [Rank 0] step=260, skipped=6, lr=[9.64291218254555e-06, 9.64291218254555e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:49:06,921] [INFO] [timer.py:199:stop] epoch=0/micro_step=260/global_step=260, RunningAvgSamplesPerSec=176.93316884851214, CurrSamplesPerSec=177.04173319694215, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:49:10,521] [INFO] [logging.py:96:log_dist] [Rank 0] step=270, skipped=6, lr=[9.64234325110183e-06, 9.64234325110183e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:49:10,539] [INFO] [timer.py:199:stop] epoch=0/micro_step=270/global_step=270, RunningAvgSamplesPerSec=176.94012786236848, CurrSamplesPerSec=177.17716696544315, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:49:14,140] [INFO] [logging.py:96:log_dist] [Rank 0] step=280, skipped=6, lr=[9.641752376882963e-06, 9.641752376882963e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:49:14,158] [INFO] [timer.py:199:stop] epoch=0/micro_step=280/global_step=280, RunningAvgSamplesPerSec=176.9445715676773, CurrSamplesPerSec=176.71225381059529, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:49:17,762] [INFO] [logging.py:96:log_dist] [Rank 0] step=290, skipped=6, lr=[9.64113956258035e-06, 9.64113956258035e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:49:17,780] [INFO] [timer.py:199:stop] epoch=0/micro_step=290/global_step=290, RunningAvgSamplesPerSec=176.94319936545318, CurrSamplesPerSec=176.9931724209323, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:49:21,392] [INFO] [logging.py:96:log_dist] [Rank 0] step=300, skipped=6, lr=[9.640504810985339e-06, 9.640504810985339e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:49:21,410] [INFO] [timer.py:199:stop] epoch=0/micro_step=300/global_step=300, RunningAvgSamplesPerSec=176.92973705083983, CurrSamplesPerSec=176.79698299311215, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:49:25,014] [INFO] [logging.py:96:log_dist] [Rank 0] step=310, skipped=6, lr=[9.639848124989188e-06, 9.639848124989188e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:49:25,032] [INFO] [timer.py:199:stop] epoch=0/micro_step=310/global_step=310, RunningAvgSamplesPerSec=176.92920344820175, CurrSamplesPerSec=176.79907897475348, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:49:28,632] [INFO] [logging.py:96:log_dist] [Rank 0] step=320, skipped=6, lr=[9.639169507583073e-06, 9.639169507583073e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:49:28,651] [INFO] [timer.py:199:stop] epoch=0/micro_step=320/global_step=320, RunningAvgSamplesPerSec=176.93301095675025, CurrSamplesPerSec=177.06917714605353, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:49:32,250] [INFO] [logging.py:96:log_dist] [Rank 0] step=330, skipped=6, lr=[9.638468961858065e-06, 9.638468961858065e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:49:32,269] [INFO] [timer.py:199:stop] epoch=0/micro_step=330/global_step=330, RunningAvgSamplesPerSec=176.9378163644765, CurrSamplesPerSec=176.96481871412882, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:49:35,870] [INFO] [logging.py:96:log_dist] [Rank 0] step=340, skipped=6, lr=[9.637746491005118e-06, 9.637746491005118e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:49:35,889] [INFO] [timer.py:199:stop] epoch=0/micro_step=340/global_step=340, RunningAvgSamplesPerSec=176.9399877727932, CurrSamplesPerSec=176.57288359240206, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:49:39,489] [INFO] [logging.py:96:log_dist] [Rank 0] step=350, skipped=6, lr=[9.637002098315053e-06, 9.637002098315053e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:49:39,508] [INFO] [timer.py:199:stop] epoch=0/micro_step=350/global_step=350, RunningAvgSamplesPerSec=176.94295443566105, CurrSamplesPerSec=176.94195567543483, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:49:43,110] [INFO] [logging.py:96:log_dist] [Rank 0] step=360, skipped=6, lr=[9.636235787178543e-06, 9.636235787178543e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:49:43,128] [INFO] [timer.py:199:stop] epoch=0/micro_step=360/global_step=360, RunningAvgSamplesPerSec=176.94334160011334, CurrSamplesPerSec=177.13554490360454, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:49:46,735] [INFO] [logging.py:96:log_dist] [Rank 0] step=370, skipped=6, lr=[9.635447561086101e-06, 9.635447561086101e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:49:46,753] [INFO] [timer.py:199:stop] epoch=0/micro_step=370/global_step=370, RunningAvgSamplesPerSec=176.93888111589976, CurrSamplesPerSec=176.9288937340907, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:49:50,355] [INFO] [logging.py:96:log_dist] [Rank 0] step=380, skipped=6, lr=[9.634637423628059e-06, 9.634637423628059e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:49:50,373] [INFO] [timer.py:199:stop] epoch=0/micro_step=380/global_step=380, RunningAvgSamplesPerSec=176.94104082011054, CurrSamplesPerSec=176.73133406632476, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:49:53,983] [INFO] [logging.py:96:log_dist] [Rank 0] step=390, skipped=6, lr=[9.633805378494556e-06, 9.633805378494556e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:49:54,001] [INFO] [timer.py:199:stop] epoch=0/micro_step=390/global_step=390, RunningAvgSamplesPerSec=176.93217166246248, CurrSamplesPerSec=176.9091878476014, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:49:57,606] [INFO] [logging.py:96:log_dist] [Rank 0] step=400, skipped=6, lr=[9.632951429475518e-06, 9.632951429475518e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:49:57,624] [INFO] [timer.py:199:stop] epoch=0/micro_step=400/global_step=400, RunningAvgSamplesPerSec=176.93050908369997, CurrSamplesPerSec=176.93274214140754, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:50:01,229] [INFO] [logging.py:96:log_dist] [Rank 0] step=410, skipped=6, lr=[9.632075580460647e-06, 9.632075580460647e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:50:01,247] [INFO] [timer.py:199:stop] epoch=0/micro_step=410/global_step=410, RunningAvgSamplesPerSec=176.92921642907595, CurrSamplesPerSec=176.75332800860735, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:50:04,846] [INFO] [logging.py:96:log_dist] [Rank 0] step=420, skipped=6, lr=[9.631177835439391e-06, 9.631177835439391e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:50:04,865] [INFO] [timer.py:199:stop] epoch=0/micro_step=420/global_step=420, RunningAvgSamplesPerSec=176.93366925483977, CurrSamplesPerSec=177.34174207980382, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:50:08,466] [INFO] [logging.py:96:log_dist] [Rank 0] step=430, skipped=6, lr=[9.630258198500938e-06, 9.630258198500938e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:50:08,484] [INFO] [timer.py:199:stop] epoch=0/micro_step=430/global_step=430, RunningAvgSamplesPerSec=176.9364002453582, CurrSamplesPerSec=177.24291353088068, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:50:12,084] [INFO] [logging.py:96:log_dist] [Rank 0] step=440, skipped=6, lr=[9.629316673834193e-06, 9.629316673834193e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:50:12,103] [INFO] [timer.py:199:stop] epoch=0/micro_step=440/global_step=440, RunningAvgSamplesPerSec=176.93921297642407, CurrSamplesPerSec=177.04535297547284, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:50:15,701] [INFO] [logging.py:96:log_dist] [Rank 0] step=450, skipped=6, lr=[9.628353265727755e-06, 9.628353265727755e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:50:15,720] [INFO] [timer.py:199:stop] epoch=0/micro_step=450/global_step=450, RunningAvgSamplesPerSec=176.94366119041777, CurrSamplesPerSec=176.96131888647756, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:50:19,321] [INFO] [logging.py:96:log_dist] [Rank 0] step=460, skipped=6, lr=[9.627367978569902e-06, 9.627367978569902e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:50:19,340] [INFO] [timer.py:199:stop] epoch=0/micro_step=460/global_step=460, RunningAvgSamplesPerSec=176.94471442198966, CurrSamplesPerSec=177.01546504604167, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:50:22,941] [INFO] [logging.py:96:log_dist] [Rank 0] step=470, skipped=6, lr=[9.626360816848576e-06, 9.626360816848576e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:50:22,959] [INFO] [timer.py:199:stop] epoch=0/micro_step=470/global_step=470, RunningAvgSamplesPerSec=176.94676567491223, CurrSamplesPerSec=176.88307556667263, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:50:26,572] [INFO] [logging.py:96:log_dist] [Rank 0] step=480, skipped=6, lr=[9.625331785151348e-06, 9.625331785151348e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:50:26,590] [INFO] [timer.py:199:stop] epoch=0/micro_step=480/global_step=480, RunningAvgSamplesPerSec=176.93627402096547, CurrSamplesPerSec=176.8791127692619, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:50:30,191] [INFO] [logging.py:96:log_dist] [Rank 0] step=490, skipped=6, lr=[9.624280888165412e-06, 9.624280888165412e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:50:30,209] [INFO] [timer.py:199:stop] epoch=0/micro_step=490/global_step=490, RunningAvgSamplesPerSec=176.93831048858283, CurrSamplesPerSec=176.84124204681854, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:50:33,812] [INFO] [logging.py:96:log_dist] [Rank 0] step=500, skipped=6, lr=[9.623208130677554e-06, 9.623208130677554e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:50:33,830] [INFO] [timer.py:199:stop] epoch=0/micro_step=500/global_step=500, RunningAvgSamplesPerSec=176.93894364212719, CurrSamplesPerSec=177.1399867756466, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:50:37,071] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 21:50:37,404] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 21:50:37,405] [INFO] [logging.py:96:log_dist] [Rank 0] step=510, skipped=8, lr=[9.622334188406173e-06, 9.622334188406173e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:50:37,405] [INFO] [timer.py:199:stop] epoch=0/micro_step=510/global_step=510, RunningAvgSamplesPerSec=176.9830432493414, CurrSamplesPerSec=192.2671099781615, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:50:41,008] [INFO] [logging.py:96:log_dist] [Rank 0] step=520, skipped=8, lr=[9.621222094395383e-06, 9.621222094395383e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:50:41,027] [INFO] [timer.py:199:stop] epoch=0/micro_step=520/global_step=520, RunningAvgSamplesPerSec=176.9820478116591, CurrSamplesPerSec=176.89379932059083, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:50:44,628] [INFO] [logging.py:96:log_dist] [Rank 0] step=530, skipped=8, lr=[9.620088153815335e-06, 9.620088153815335e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:50:44,646] [INFO] [timer.py:199:stop] epoch=0/micro_step=530/global_step=530, RunningAvgSamplesPerSec=176.98310469219305, CurrSamplesPerSec=176.92632822221753, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:50:48,246] [INFO] [logging.py:96:log_dist] [Rank 0] step=540, skipped=8, lr=[9.618932371831077e-06, 9.618932371831077e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:50:48,265] [INFO] [timer.py:199:stop] epoch=0/micro_step=540/global_step=540, RunningAvgSamplesPerSec=176.98453959941804, CurrSamplesPerSec=177.0638044692811, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:50:51,887] [INFO] [logging.py:96:log_dist] [Rank 0] step=550, skipped=8, lr=[9.61775475370714e-06, 9.61775475370714e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:50:51,906] [INFO] [timer.py:199:stop] epoch=0/micro_step=550/global_step=550, RunningAvgSamplesPerSec=176.966507792401, CurrSamplesPerSec=177.15869183870922, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:50:55,507] [INFO] [logging.py:96:log_dist] [Rank 0] step=560, skipped=8, lr=[9.61655530480752e-06, 9.61655530480752e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:50:55,525] [INFO] [timer.py:199:stop] epoch=0/micro_step=560/global_step=560, RunningAvgSamplesPerSec=176.9676427201855, CurrSamplesPerSec=177.190031670845, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:50:59,149] [INFO] [logging.py:96:log_dist] [Rank 0] step=570, skipped=8, lr=[9.615334030595654e-06, 9.615334030595654e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:50:59,167] [INFO] [timer.py:199:stop] epoch=0/micro_step=570/global_step=570, RunningAvgSamplesPerSec=176.952878175726, CurrSamplesPerSec=177.19342358618786, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:51:02,768] [INFO] [logging.py:96:log_dist] [Rank 0] step=580, skipped=8, lr=[9.614090936634385e-06, 9.614090936634385e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:51:02,786] [INFO] [timer.py:199:stop] epoch=0/micro_step=580/global_step=580, RunningAvgSamplesPerSec=176.95444941394672, CurrSamplesPerSec=177.0473380781557, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:51:06,387] [INFO] [logging.py:96:log_dist] [Rank 0] step=590, skipped=8, lr=[9.612826028585952e-06, 9.612826028585952e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:51:06,405] [INFO] [timer.py:199:stop] epoch=0/micro_step=590/global_step=590, RunningAvgSamplesPerSec=176.95559863290003, CurrSamplesPerSec=177.12876563364, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:51:10,008] [INFO] [logging.py:96:log_dist] [Rank 0] step=600, skipped=8, lr=[9.611539312211953e-06, 9.611539312211953e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:51:10,026] [INFO] [timer.py:199:stop] epoch=0/micro_step=600/global_step=600, RunningAvgSamplesPerSec=176.9563126568298, CurrSamplesPerSec=176.89088512836733, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:51:13,628] [INFO] [logging.py:96:log_dist] [Rank 0] step=610, skipped=8, lr=[9.610230793373317e-06, 9.610230793373317e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:51:13,646] [INFO] [timer.py:199:stop] epoch=0/micro_step=610/global_step=610, RunningAvgSamplesPerSec=176.9565104948679, CurrSamplesPerSec=177.17903807922104, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:51:13,979] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 21:51:14,313] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 21:51:17,189] [INFO] [logging.py:96:log_dist] [Rank 0] step=620, skipped=10, lr=[9.60916828452982e-06, 9.60916828452982e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:51:17,208] [INFO] [timer.py:199:stop] epoch=0/micro_step=620/global_step=620, RunningAvgSamplesPerSec=177.00368913090927, CurrSamplesPerSec=177.1535475525814, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:51:20,808] [INFO] [logging.py:96:log_dist] [Rank 0] step=630, skipped=10, lr=[9.607820536341373e-06, 9.607820536341373e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:51:20,826] [INFO] [timer.py:199:stop] epoch=0/micro_step=630/global_step=630, RunningAvgSamplesPerSec=177.004496635929, CurrSamplesPerSec=177.018850281221, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:51:24,429] [INFO] [logging.py:96:log_dist] [Rank 0] step=640, skipped=10, lr=[9.606451002627145e-06, 9.606451002627145e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:51:24,448] [INFO] [timer.py:199:stop] epoch=0/micro_step=640/global_step=640, RunningAvgSamplesPerSec=177.00349188126967, CurrSamplesPerSec=176.95291986489016, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:51:28,048] [INFO] [logging.py:96:log_dist] [Rank 0] step=650, skipped=10, lr=[9.605059689625296e-06, 9.605059689625296e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:51:28,066] [INFO] [timer.py:199:stop] epoch=0/micro_step=650/global_step=650, RunningAvgSamplesPerSec=177.00477372680578, CurrSamplesPerSec=176.95746923594663, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:51:31,677] [INFO] [logging.py:96:log_dist] [Rank 0] step=660, skipped=10, lr=[9.603646603673193e-06, 9.603646603673193e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:51:31,695] [INFO] [timer.py:199:stop] epoch=0/micro_step=660/global_step=660, RunningAvgSamplesPerSec=176.99988544394958, CurrSamplesPerSec=177.04687099116663, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:51:35,296] [INFO] [logging.py:96:log_dist] [Rank 0] step=670, skipped=10, lr=[9.60221175120738e-06, 9.60221175120738e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:51:35,314] [INFO] [timer.py:199:stop] epoch=0/micro_step=670/global_step=670, RunningAvgSamplesPerSec=177.0001586611037, CurrSamplesPerSec=177.1813770270199, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:51:38,915] [INFO] [logging.py:96:log_dist] [Rank 0] step=680, skipped=10, lr=[9.600755138763538e-06, 9.600755138763538e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:51:38,933] [INFO] [timer.py:199:stop] epoch=0/micro_step=680/global_step=680, RunningAvgSamplesPerSec=177.00110910720164, CurrSamplesPerSec=176.95711927602193, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:51:42,533] [INFO] [logging.py:96:log_dist] [Rank 0] step=690, skipped=10, lr=[9.599276772976471e-06, 9.599276772976471e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:51:42,552] [INFO] [timer.py:199:stop] epoch=0/micro_step=690/global_step=690, RunningAvgSamplesPerSec=177.0026192293441, CurrSamplesPerSec=177.08845138660283, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:51:46,154] [INFO] [logging.py:96:log_dist] [Rank 0] step=700, skipped=10, lr=[9.59777666058007e-06, 9.59777666058007e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:51:46,172] [INFO] [timer.py:199:stop] epoch=0/micro_step=700/global_step=700, RunningAvgSamplesPerSec=177.00281997477563, CurrSamplesPerSec=177.08810090841322, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:51:49,776] [INFO] [logging.py:96:log_dist] [Rank 0] step=710, skipped=10, lr=[9.596254808407273e-06, 9.596254808407273e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:51:49,795] [INFO] [timer.py:199:stop] epoch=0/micro_step=710/global_step=710, RunningAvgSamplesPerSec=177.00088637114132, CurrSamplesPerSec=176.7414575820564, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:51:50,854] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 21:51:51,189] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 21:51:53,344] [INFO] [logging.py:96:log_dist] [Rank 0] step=720, skipped=12, lr=[9.595021678684986e-06, 9.595021678684986e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:51:53,362] [INFO] [timer.py:199:stop] epoch=0/micro_step=720/global_step=720, RunningAvgSamplesPerSec=177.03667456942364, CurrSamplesPerSec=177.14279228329602, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:51:56,974] [INFO] [logging.py:96:log_dist] [Rank 0] step=730, skipped=12, lr=[9.593460712449759e-06, 9.593460712449759e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:51:56,992] [INFO] [timer.py:199:stop] epoch=0/micro_step=730/global_step=730, RunningAvgSamplesPerSec=177.02996063744445, CurrSamplesPerSec=176.95560279900985, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:52:00,594] [INFO] [logging.py:96:log_dist] [Rank 0] step=740, skipped=12, lr=[9.59187802609708e-06, 9.59187802609708e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:52:00,612] [INFO] [timer.py:199:stop] epoch=0/micro_step=740/global_step=740, RunningAvgSamplesPerSec=177.02985231973727, CurrSamplesPerSec=176.86932308979678, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:52:04,214] [INFO] [logging.py:96:log_dist] [Rank 0] step=750, skipped=12, lr=[9.590273626836016e-06, 9.590273626836016e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:52:04,233] [INFO] [timer.py:199:stop] epoch=0/micro_step=750/global_step=750, RunningAvgSamplesPerSec=177.02908979829672, CurrSamplesPerSec=177.11894830050292, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:52:07,852] [INFO] [logging.py:96:log_dist] [Rank 0] step=760, skipped=12, lr=[9.588647521974525e-06, 9.588647521974525e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:52:07,871] [INFO] [timer.py:199:stop] epoch=0/micro_step=760/global_step=760, RunningAvgSamplesPerSec=177.01731488191984, CurrSamplesPerSec=176.9332086263979, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:52:11,476] [INFO] [logging.py:96:log_dist] [Rank 0] step=770, skipped=12, lr=[9.586999718919445e-06, 9.586999718919445e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:52:11,495] [INFO] [timer.py:199:stop] epoch=0/micro_step=770/global_step=770, RunningAvgSamplesPerSec=177.01474111011888, CurrSamplesPerSec=176.78731883175087, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:52:15,097] [INFO] [logging.py:96:log_dist] [Rank 0] step=780, skipped=12, lr=[9.585330225176441e-06, 9.585330225176441e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:52:15,116] [INFO] [timer.py:199:stop] epoch=0/micro_step=780/global_step=780, RunningAvgSamplesPerSec=177.01422213517228, CurrSamplesPerSec=177.07980666283615, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:52:18,715] [INFO] [logging.py:96:log_dist] [Rank 0] step=790, skipped=12, lr=[9.583639048349978e-06, 9.583639048349978e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:52:18,734] [INFO] [timer.py:199:stop] epoch=0/micro_step=790/global_step=790, RunningAvgSamplesPerSec=177.01548060016944, CurrSamplesPerSec=177.15296299351604, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:52:22,334] [INFO] [logging.py:96:log_dist] [Rank 0] step=800, skipped=12, lr=[9.58192619614329e-06, 9.58192619614329e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:52:22,353] [INFO] [timer.py:199:stop] epoch=0/micro_step=800/global_step=800, RunningAvgSamplesPerSec=177.01609258180682, CurrSamplesPerSec=176.9510535239195, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:52:25,954] [INFO] [logging.py:96:log_dist] [Rank 0] step=810, skipped=12, lr=[9.580191676358337e-06, 9.580191676358337e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:52:25,973] [INFO] [timer.py:199:stop] epoch=0/micro_step=810/global_step=810, RunningAvgSamplesPerSec=177.01631654359025, CurrSamplesPerSec=177.04220025682238, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:52:27,753] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 21:52:28,086] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 21:52:29,517] [INFO] [logging.py:96:log_dist] [Rank 0] step=820, skipped=14, lr=[9.578788465179952e-06, 9.578788465179952e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:52:29,535] [INFO] [timer.py:199:stop] epoch=0/micro_step=820/global_step=820, RunningAvgSamplesPerSec=177.0506173209007, CurrSamplesPerSec=176.94312201398483, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:52:33,138] [INFO] [logging.py:96:log_dist] [Rank 0] step=830, skipped=14, lr=[9.57701496373008e-06, 9.57701496373008e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:52:33,156] [INFO] [timer.py:199:stop] epoch=0/micro_step=830/global_step=830, RunningAvgSamplesPerSec=177.04973645070874, CurrSamplesPerSec=176.92446244213147, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:52:36,759] [INFO] [logging.py:96:log_dist] [Rank 0] step=840, skipped=14, lr=[9.575219817072382e-06, 9.575219817072382e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:52:36,777] [INFO] [timer.py:199:stop] epoch=0/micro_step=840/global_step=840, RunningAvgSamplesPerSec=177.04892635957177, CurrSamplesPerSec=177.15869183870922, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:52:40,388] [INFO] [logging.py:96:log_dist] [Rank 0] step=850, skipped=14, lr=[9.573403033383666e-06, 9.573403033383666e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:52:40,407] [INFO] [timer.py:199:stop] epoch=0/micro_step=850/global_step=850, RunningAvgSamplesPerSec=177.04304212746516, CurrSamplesPerSec=176.86279723882745, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:52:44,009] [INFO] [logging.py:96:log_dist] [Rank 0] step=860, skipped=14, lr=[9.571564620939298e-06, 9.571564620939298e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:52:44,028] [INFO] [timer.py:199:stop] epoch=0/micro_step=860/global_step=860, RunningAvgSamplesPerSec=177.04186996684422, CurrSamplesPerSec=176.98278667556298, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:52:47,630] [INFO] [logging.py:96:log_dist] [Rank 0] step=870, skipped=14, lr=[9.56970458811316e-06, 9.56970458811316e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:52:47,649] [INFO] [timer.py:199:stop] epoch=0/micro_step=870/global_step=870, RunningAvgSamplesPerSec=177.04105695994454, CurrSamplesPerSec=176.8742177940689, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:52:51,248] [INFO] [logging.py:96:log_dist] [Rank 0] step=880, skipped=14, lr=[9.567822943377617e-06, 9.567822943377617e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:52:51,267] [INFO] [timer.py:199:stop] epoch=0/micro_step=880/global_step=880, RunningAvgSamplesPerSec=177.04209346586364, CurrSamplesPerSec=177.035311373478, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:52:54,868] [INFO] [logging.py:96:log_dist] [Rank 0] step=890, skipped=14, lr=[9.565919695303474e-06, 9.565919695303474e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:52:54,886] [INFO] [timer.py:199:stop] epoch=0/micro_step=890/global_step=890, RunningAvgSamplesPerSec=177.0420669235985, CurrSamplesPerSec=177.0402152693505, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:52:58,494] [INFO] [logging.py:96:log_dist] [Rank 0] step=900, skipped=14, lr=[9.563994852559934e-06, 9.563994852559934e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:52:58,513] [INFO] [timer.py:199:stop] epoch=0/micro_step=900/global_step=900, RunningAvgSamplesPerSec=177.0383183515423, CurrSamplesPerSec=176.87584942236313, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:53:02,121] [INFO] [logging.py:96:log_dist] [Rank 0] step=910, skipped=14, lr=[9.562048423914571e-06, 9.562048423914571e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:53:02,140] [INFO] [timer.py:199:stop] epoch=0/micro_step=910/global_step=910, RunningAvgSamplesPerSec=177.03441346119916, CurrSamplesPerSec=177.1816109251962, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:53:04,645] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 21:53:04,978] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 21:53:05,684] [INFO] [logging.py:96:log_dist] [Rank 0] step=920, skipped=16, lr=[9.560475745103543e-06, 9.560475745103543e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:53:05,702] [INFO] [timer.py:199:stop] epoch=0/micro_step=920/global_step=920, RunningAvgSamplesPerSec=177.0650225426097, CurrSamplesPerSec=177.1267787045719, MemAllocated=4.98GB, MaxMemAllocated=24.08GB ***** Evaluating perplexity, Epoch 1/16 ***** ppl: 2.0015385150909424 Beginning of Epoch 2/16, Total Micro Batches 920 [2023-04-21 21:53:17,466] [INFO] [logging.py:96:log_dist] [Rank 0] step=930, skipped=16, lr=[9.55849048424299e-06, 9.55849048424299e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:53:17,484] [INFO] [timer.py:199:stop] epoch=1/micro_step=10/global_step=930, RunningAvgSamplesPerSec=177.0515808558939, CurrSamplesPerSec=177.16453798106622, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:53:21,086] [INFO] [logging.py:96:log_dist] [Rank 0] step=940, skipped=16, lr=[9.556483662552754e-06, 9.556483662552754e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:53:21,104] [INFO] [timer.py:199:stop] epoch=1/micro_step=20/global_step=940, RunningAvgSamplesPerSec=177.05155342563174, CurrSamplesPerSec=177.018850281221, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:53:24,706] [INFO] [logging.py:96:log_dist] [Rank 0] step=950, skipped=16, lr=[9.554455289173818e-06, 9.554455289173818e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:53:24,724] [INFO] [timer.py:199:stop] epoch=1/micro_step=30/global_step=950, RunningAvgSamplesPerSec=177.05126333217828, CurrSamplesPerSec=176.95256992295933, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:53:28,325] [INFO] [logging.py:96:log_dist] [Rank 0] step=960, skipped=16, lr=[9.552405373345324e-06, 9.552405373345324e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:53:28,343] [INFO] [timer.py:199:stop] epoch=1/micro_step=40/global_step=960, RunningAvgSamplesPerSec=177.0515804915154, CurrSamplesPerSec=177.05761461548153, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:53:31,943] [INFO] [logging.py:96:log_dist] [Rank 0] step=970, skipped=16, lr=[9.550333924404544e-06, 9.550333924404544e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:53:31,961] [INFO] [timer.py:199:stop] epoch=1/micro_step=50/global_step=970, RunningAvgSamplesPerSec=177.0521193473982, CurrSamplesPerSec=177.00951201547244, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:53:35,565] [INFO] [logging.py:96:log_dist] [Rank 0] step=980, skipped=16, lr=[9.548240951786835e-06, 9.548240951786835e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:53:35,583] [INFO] [timer.py:199:stop] epoch=1/micro_step=60/global_step=980, RunningAvgSamplesPerSec=177.05062208447754, CurrSamplesPerSec=176.9348413432353, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:53:39,187] [INFO] [logging.py:96:log_dist] [Rank 0] step=990, skipped=16, lr=[9.546126465025589e-06, 9.546126465025589e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:53:39,205] [INFO] [timer.py:199:stop] epoch=1/micro_step=70/global_step=990, RunningAvgSamplesPerSec=177.04941789799003, CurrSamplesPerSec=176.91000398060837, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:53:42,809] [INFO] [logging.py:96:log_dist] [Rank 0] step=1000, skipped=16, lr=[9.543990473752193e-06, 9.543990473752193e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:53:42,827] [INFO] [timer.py:199:stop] epoch=1/micro_step=80/global_step=1000, RunningAvgSamplesPerSec=177.04825318733924, CurrSamplesPerSec=176.9775359199144, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:53:46,439] [INFO] [logging.py:96:log_dist] [Rank 0] step=1010, skipped=16, lr=[9.54183298769599e-06, 9.54183298769599e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:53:46,457] [INFO] [timer.py:199:stop] epoch=1/micro_step=90/global_step=1010, RunningAvgSamplesPerSec=177.0428475651157, CurrSamplesPerSec=176.76578203259066, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:53:49,689] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 21:53:50,023] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 21:53:50,023] [INFO] [logging.py:96:log_dist] [Rank 0] step=1020, skipped=18, lr=[9.540091529208031e-06, 9.540091529208031e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:53:50,024] [INFO] [timer.py:199:stop] epoch=1/micro_step=100/global_step=1020, RunningAvgSamplesPerSec=177.06810953751008, CurrSamplesPerSec=192.03094981242972, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:53:53,626] [INFO] [logging.py:96:log_dist] [Rank 0] step=1030, skipped=18, lr=[9.53789537737321e-06, 9.53789537737321e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:53:53,645] [INFO] [timer.py:199:stop] epoch=1/micro_step=110/global_step=1030, RunningAvgSamplesPerSec=177.06694307364648, CurrSamplesPerSec=176.88622262052243, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:53:57,246] [INFO] [logging.py:96:log_dist] [Rank 0] step=1040, skipped=18, lr=[9.535677758518463e-06, 9.535677758518463e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:53:57,264] [INFO] [timer.py:199:stop] epoch=1/micro_step=120/global_step=1040, RunningAvgSamplesPerSec=177.06655062837228, CurrSamplesPerSec=177.03636218543758, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:54:00,867] [INFO] [logging.py:96:log_dist] [Rank 0] step=1050, skipped=18, lr=[9.53343868274494e-06, 9.53343868274494e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:54:00,885] [INFO] [timer.py:199:stop] epoch=1/micro_step=130/global_step=1050, RunningAvgSamplesPerSec=177.06568177084293, CurrSamplesPerSec=176.4381397831228, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:54:04,490] [INFO] [logging.py:96:log_dist] [Rank 0] step=1060, skipped=18, lr=[9.531178160251531e-06, 9.531178160251531e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:54:04,508] [INFO] [timer.py:199:stop] epoch=1/micro_step=140/global_step=1060, RunningAvgSamplesPerSec=177.0634902967856, CurrSamplesPerSec=176.30707532453886, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:54:08,110] [INFO] [logging.py:96:log_dist] [Rank 0] step=1070, skipped=18, lr=[9.528896201334807e-06, 9.528896201334807e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:54:08,128] [INFO] [timer.py:199:stop] epoch=1/micro_step=150/global_step=1070, RunningAvgSamplesPerSec=177.06289265139742, CurrSamplesPerSec=176.96458538797856, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:54:11,730] [INFO] [logging.py:96:log_dist] [Rank 0] step=1080, skipped=18, lr=[9.526592816388989e-06, 9.526592816388989e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:54:11,748] [INFO] [timer.py:199:stop] epoch=1/micro_step=160/global_step=1080, RunningAvgSamplesPerSec=177.06230826479646, CurrSamplesPerSec=176.98231992911093, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:54:15,350] [INFO] [logging.py:96:log_dist] [Rank 0] step=1090, skipped=18, lr=[9.524268015905887e-06, 9.524268015905887e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:54:15,368] [INFO] [timer.py:199:stop] epoch=1/micro_step=170/global_step=1090, RunningAvgSamplesPerSec=177.0620227380239, CurrSamplesPerSec=176.740643003404, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:54:18,999] [INFO] [logging.py:96:log_dist] [Rank 0] step=1100, skipped=18, lr=[9.521921810474856e-06, 9.521921810474856e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:54:19,017] [INFO] [timer.py:199:stop] epoch=1/micro_step=180/global_step=1100, RunningAvgSamplesPerSec=177.04869096400887, CurrSamplesPerSec=177.12373995810032, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:54:22,620] [INFO] [logging.py:96:log_dist] [Rank 0] step=1110, skipped=18, lr=[9.519554210782758e-06, 9.519554210782758e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:54:22,639] [INFO] [timer.py:199:stop] epoch=1/micro_step=190/global_step=1110, RunningAvgSamplesPerSec=177.04780305979938, CurrSamplesPerSec=176.89601417089958, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:54:26,249] [INFO] [logging.py:96:log_dist] [Rank 0] step=1120, skipped=18, lr=[9.517165227613896e-06, 9.517165227613896e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:54:26,267] [INFO] [timer.py:199:stop] epoch=1/micro_step=200/global_step=1120, RunningAvgSamplesPerSec=177.0435574808127, CurrSamplesPerSec=176.84450411682317, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:54:26,600] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 21:54:26,933] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 21:54:29,813] [INFO] [logging.py:96:log_dist] [Rank 0] step=1130, skipped=20, lr=[9.515238652284776e-06, 9.515238652284776e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:54:29,831] [INFO] [timer.py:199:stop] epoch=1/micro_step=210/global_step=1130, RunningAvgSamplesPerSec=177.0674023705547, CurrSamplesPerSec=176.64027458538777, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:54:33,433] [INFO] [logging.py:96:log_dist] [Rank 0] step=1140, skipped=20, lr=[9.512811206345068e-06, 9.512811206345068e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:54:33,452] [INFO] [timer.py:199:stop] epoch=1/micro_step=220/global_step=1140, RunningAvgSamplesPerSec=177.06682165906216, CurrSamplesPerSec=177.18184482399005, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:54:37,051] [INFO] [logging.py:96:log_dist] [Rank 0] step=1150, skipped=20, lr=[9.51036240764267e-06, 9.51036240764267e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:54:37,069] [INFO] [timer.py:199:stop] epoch=1/micro_step=230/global_step=1150, RunningAvgSamplesPerSec=177.06731944167902, CurrSamplesPerSec=176.85359194645022, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:54:40,668] [INFO] [logging.py:96:log_dist] [Rank 0] step=1160, skipped=20, lr=[9.507892267331749e-06, 9.507892267331749e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:54:40,687] [INFO] [timer.py:199:stop] epoch=1/micro_step=240/global_step=1160, RunningAvgSamplesPerSec=177.0677733277868, CurrSamplesPerSec=177.25403209564644, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:54:44,287] [INFO] [logging.py:96:log_dist] [Rank 0] step=1170, skipped=20, lr=[9.505400796663676e-06, 9.505400796663676e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:54:44,305] [INFO] [timer.py:199:stop] epoch=1/micro_step=250/global_step=1170, RunningAvgSamplesPerSec=177.0677321471426, CurrSamplesPerSec=176.90487412679585, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:54:47,908] [INFO] [logging.py:96:log_dist] [Rank 0] step=1180, skipped=20, lr=[9.502888006986986e-06, 9.502888006986986e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:54:47,927] [INFO] [timer.py:199:stop] epoch=1/micro_step=260/global_step=1180, RunningAvgSamplesPerSec=177.06652249454305, CurrSamplesPerSec=176.87468397050984, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:54:51,549] [INFO] [logging.py:96:log_dist] [Rank 0] step=1190, skipped=20, lr=[9.500353909747319e-06, 9.500353909747319e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:54:51,567] [INFO] [timer.py:199:stop] epoch=1/micro_step=270/global_step=1190, RunningAvgSamplesPerSec=177.05749311134278, CurrSamplesPerSec=177.0149981272124, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:54:55,168] [INFO] [logging.py:96:log_dist] [Rank 0] step=1200, skipped=20, lr=[9.497798516487371e-06, 9.497798516487371e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:54:55,186] [INFO] [timer.py:199:stop] epoch=1/micro_step=280/global_step=1200, RunningAvgSamplesPerSec=177.05748827654614, CurrSamplesPerSec=177.20734349522382, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:54:58,786] [INFO] [logging.py:96:log_dist] [Rank 0] step=1210, skipped=20, lr=[9.49522183884684e-06, 9.49522183884684e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:54:58,805] [INFO] [timer.py:199:stop] epoch=1/micro_step=290/global_step=1210, RunningAvgSamplesPerSec=177.05753050667872, CurrSamplesPerSec=177.04488589895792, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:55:02,405] [INFO] [logging.py:96:log_dist] [Rank 0] step=1220, skipped=20, lr=[9.492623888562372e-06, 9.492623888562372e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:55:02,423] [INFO] [timer.py:199:stop] epoch=1/micro_step=300/global_step=1220, RunningAvgSamplesPerSec=177.05775249548168, CurrSamplesPerSec=177.18289737620452, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:55:03,479] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 21:55:03,813] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 21:55:05,974] [INFO] [logging.py:96:log_dist] [Rank 0] step=1230, skipped=22, lr=[9.490530219980049e-06, 9.490530219980049e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:55:05,992] [INFO] [timer.py:199:stop] epoch=1/micro_step=310/global_step=1230, RunningAvgSamplesPerSec=177.0789948601567, CurrSamplesPerSec=176.9003274596211, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:55:09,610] [INFO] [logging.py:96:log_dist] [Rank 0] step=1240, skipped=22, lr=[9.487894008822105e-06, 9.487894008822105e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:55:09,629] [INFO] [timer.py:199:stop] epoch=1/micro_step=320/global_step=1240, RunningAvgSamplesPerSec=177.0716796258037, CurrSamplesPerSec=177.1095994711166, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:55:13,229] [INFO] [logging.py:96:log_dist] [Rank 0] step=1250, skipped=22, lr=[9.485236558398151e-06, 9.485236558398151e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:55:13,247] [INFO] [timer.py:199:stop] epoch=1/micro_step=330/global_step=1250, RunningAvgSamplesPerSec=177.0717209706708, CurrSamplesPerSec=177.2258287712293, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:55:16,847] [INFO] [logging.py:96:log_dist] [Rank 0] step=1260, skipped=22, lr=[9.482557880812749e-06, 9.482557880812749e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:55:16,865] [INFO] [timer.py:199:stop] epoch=1/micro_step=340/global_step=1260, RunningAvgSamplesPerSec=177.07195404196622, CurrSamplesPerSec=177.17096918331936, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:55:20,466] [INFO] [logging.py:96:log_dist] [Rank 0] step=1270, skipped=22, lr=[9.479857988267154e-06, 9.479857988267154e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:55:20,484] [INFO] [timer.py:199:stop] epoch=1/micro_step=350/global_step=1270, RunningAvgSamplesPerSec=177.07183532678224, CurrSamplesPerSec=177.00986218265743, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:55:24,113] [INFO] [logging.py:96:log_dist] [Rank 0] step=1280, skipped=22, lr=[9.477136893059248e-06, 9.477136893059248e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:55:24,131] [INFO] [timer.py:199:stop] epoch=1/micro_step=360/global_step=1280, RunningAvgSamplesPerSec=177.06090779338572, CurrSamplesPerSec=175.81722663344658, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:55:27,731] [INFO] [logging.py:96:log_dist] [Rank 0] step=1290, skipped=22, lr=[9.474394607583496e-06, 9.474394607583496e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:55:27,750] [INFO] [timer.py:199:stop] epoch=1/micro_step=370/global_step=1290, RunningAvgSamplesPerSec=177.06114962693684, CurrSamplesPerSec=177.0862317148312, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:55:31,350] [INFO] [logging.py:96:log_dist] [Rank 0] step=1300, skipped=22, lr=[9.47163114433088e-06, 9.47163114433088e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:55:31,368] [INFO] [timer.py:199:stop] epoch=1/micro_step=380/global_step=1300, RunningAvgSamplesPerSec=177.06135255375102, CurrSamplesPerSec=177.0441852888064, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:55:34,968] [INFO] [logging.py:96:log_dist] [Rank 0] step=1310, skipped=22, lr=[9.468846515888848e-06, 9.468846515888848e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:55:34,986] [INFO] [timer.py:199:stop] epoch=1/micro_step=390/global_step=1310, RunningAvgSamplesPerSec=177.0613498695657, CurrSamplesPerSec=177.0773535841554, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:55:38,588] [INFO] [logging.py:96:log_dist] [Rank 0] step=1320, skipped=22, lr=[9.466040734941254e-06, 9.466040734941254e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:55:38,607] [INFO] [timer.py:199:stop] epoch=1/micro_step=400/global_step=1320, RunningAvgSamplesPerSec=177.060538114194, CurrSamplesPerSec=176.77439612254037, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:55:40,387] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 21:55:40,721] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 21:55:42,151] [INFO] [logging.py:96:log_dist] [Rank 0] step=1330, skipped=24, lr=[9.463780888964232e-06, 9.463780888964232e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:55:42,169] [INFO] [timer.py:199:stop] epoch=1/micro_step=410/global_step=1330, RunningAvgSamplesPerSec=177.0813343297598, CurrSamplesPerSec=177.17576365604455, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:55:45,770] [INFO] [logging.py:96:log_dist] [Rank 0] step=1340, skipped=24, lr=[9.460937065777442e-06, 9.460937065777442e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:55:45,789] [INFO] [timer.py:199:stop] epoch=1/micro_step=420/global_step=1340, RunningAvgSamplesPerSec=177.08089746568393, CurrSamplesPerSec=177.05574606541623, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:55:49,388] [INFO] [logging.py:96:log_dist] [Rank 0] step=1350, skipped=24, lr=[9.458072126112267e-06, 9.458072126112267e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:55:49,406] [INFO] [timer.py:199:stop] epoch=1/micro_step=430/global_step=1350, RunningAvgSamplesPerSec=177.0812704140199, CurrSamplesPerSec=177.2964124674961, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:55:53,006] [INFO] [logging.py:96:log_dist] [Rank 0] step=1360, skipped=24, lr=[9.455186083018376e-06, 9.455186083018376e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:55:53,025] [INFO] [timer.py:199:stop] epoch=1/micro_step=440/global_step=1360, RunningAvgSamplesPerSec=177.08105223830512, CurrSamplesPerSec=176.75193140507784, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:55:56,642] [INFO] [logging.py:96:log_dist] [Rank 0] step=1370, skipped=24, lr=[9.45227894964156e-06, 9.45227894964156e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:55:56,660] [INFO] [timer.py:199:stop] epoch=1/micro_step=450/global_step=1370, RunningAvgSamplesPerSec=177.0763120197155, CurrSamplesPerSec=174.46439615605092, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:56:00,264] [INFO] [logging.py:96:log_dist] [Rank 0] step=1380, skipped=24, lr=[9.449350739223678e-06, 9.449350739223678e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:56:00,282] [INFO] [timer.py:199:stop] epoch=1/micro_step=460/global_step=1380, RunningAvgSamplesPerSec=177.07488209431506, CurrSamplesPerSec=176.98290336256068, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:56:03,882] [INFO] [logging.py:96:log_dist] [Rank 0] step=1390, skipped=24, lr=[9.446401465102589e-06, 9.446401465102589e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:56:03,900] [INFO] [timer.py:199:stop] epoch=1/micro_step=470/global_step=1390, RunningAvgSamplesPerSec=177.07513503214005, CurrSamplesPerSec=177.18371603657266, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:56:07,501] [INFO] [logging.py:96:log_dist] [Rank 0] step=1400, skipped=24, lr=[9.443431140712103e-06, 9.443431140712103e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:56:07,519] [INFO] [timer.py:199:stop] epoch=1/micro_step=480/global_step=1400, RunningAvgSamplesPerSec=177.0749583927877, CurrSamplesPerSec=177.00717760297735, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:56:11,120] [INFO] [logging.py:96:log_dist] [Rank 0] step=1410, skipped=24, lr=[9.440439779581911e-06, 9.440439779581911e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:56:11,138] [INFO] [timer.py:199:stop] epoch=1/micro_step=490/global_step=1410, RunningAvgSamplesPerSec=177.0747726484048, CurrSamplesPerSec=176.99340582247217, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:56:14,743] [INFO] [logging.py:96:log_dist] [Rank 0] step=1420, skipped=24, lr=[9.437427395337521e-06, 9.437427395337521e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:56:14,762] [INFO] [timer.py:199:stop] epoch=1/micro_step=500/global_step=1420, RunningAvgSamplesPerSec=177.0731053830277, CurrSamplesPerSec=176.25093136707747, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:56:17,266] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 21:56:17,599] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 21:56:18,305] [INFO] [logging.py:96:log_dist] [Rank 0] step=1430, skipped=26, lr=[9.435002360517267e-06, 9.435002360517267e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:56:18,323] [INFO] [timer.py:199:stop] epoch=1/micro_step=510/global_step=1430, RunningAvgSamplesPerSec=177.09250473002857, CurrSamplesPerSec=177.19038255261708, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:56:21,925] [INFO] [logging.py:96:log_dist] [Rank 0] step=1440, skipped=26, lr=[9.431952169309237e-06, 9.431952169309237e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:56:21,943] [INFO] [timer.py:199:stop] epoch=1/micro_step=520/global_step=1440, RunningAvgSamplesPerSec=177.09206836355582, CurrSamplesPerSec=177.20827936588412, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:56:25,545] [INFO] [logging.py:96:log_dist] [Rank 0] step=1450, skipped=26, lr=[9.428880993647682e-06, 9.428880993647682e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:56:25,563] [INFO] [timer.py:199:stop] epoch=1/micro_step=530/global_step=1450, RunningAvgSamplesPerSec=177.09132005109518, CurrSamplesPerSec=176.97555238219562, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:56:29,179] [INFO] [logging.py:96:log_dist] [Rank 0] step=1460, skipped=26, lr=[9.425788847521664e-06, 9.425788847521664e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:56:29,197] [INFO] [timer.py:199:stop] epoch=1/micro_step=540/global_step=1460, RunningAvgSamplesPerSec=177.0861678546193, CurrSamplesPerSec=176.89648046224247, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:56:32,798] [INFO] [logging.py:96:log_dist] [Rank 0] step=1470, skipped=26, lr=[9.422675745015768e-06, 9.422675745015768e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:56:32,817] [INFO] [timer.py:199:stop] epoch=1/micro_step=550/global_step=1470, RunningAvgSamplesPerSec=177.0855057895082, CurrSamplesPerSec=177.01243011768057, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:56:36,417] [INFO] [logging.py:96:log_dist] [Rank 0] step=1480, skipped=26, lr=[9.419541700310026e-06, 9.419541700310026e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:56:36,435] [INFO] [timer.py:199:stop] epoch=1/micro_step=560/global_step=1480, RunningAvgSamplesPerSec=177.08540281029647, CurrSamplesPerSec=176.97590241503278, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:56:40,036] [INFO] [logging.py:96:log_dist] [Rank 0] step=1490, skipped=26, lr=[9.416386727679873e-06, 9.416386727679873e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:56:40,054] [INFO] [timer.py:199:stop] epoch=1/micro_step=570/global_step=1490, RunningAvgSamplesPerSec=177.08496857119874, CurrSamplesPerSec=177.1562365739711, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:56:43,658] [INFO] [logging.py:96:log_dist] [Rank 0] step=1500, skipped=26, lr=[9.413210841496058e-06, 9.413210841496058e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:56:43,676] [INFO] [timer.py:199:stop] epoch=1/micro_step=580/global_step=1500, RunningAvgSamplesPerSec=177.08386969953926, CurrSamplesPerSec=176.99644009850886, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:56:47,275] [INFO] [logging.py:96:log_dist] [Rank 0] step=1510, skipped=26, lr=[9.410014056224598e-06, 9.410014056224598e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:56:47,294] [INFO] [timer.py:199:stop] epoch=1/micro_step=590/global_step=1510, RunningAvgSamplesPerSec=177.083964069093, CurrSamplesPerSec=177.0548118051731, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:56:50,896] [INFO] [logging.py:96:log_dist] [Rank 0] step=1520, skipped=26, lr=[9.406796386426702e-06, 9.406796386426702e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:56:50,914] [INFO] [timer.py:199:stop] epoch=1/micro_step=600/global_step=1520, RunningAvgSamplesPerSec=177.08328662982905, CurrSamplesPerSec=177.00460982034824, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:56:54,142] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 21:56:54,476] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 21:56:54,476] [INFO] [logging.py:96:log_dist] [Rank 0] step=1530, skipped=28, lr=[9.404207223575212e-06, 9.404207223575212e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:56:54,477] [INFO] [timer.py:199:stop] epoch=1/micro_step=610/global_step=1530, RunningAvgSamplesPerSec=177.1011944740485, CurrSamplesPerSec=191.95391426285204, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:56:58,077] [INFO] [logging.py:96:log_dist] [Rank 0] step=1540, skipped=28, lr=[9.40095199862758e-06, 9.40095199862758e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:56:58,095] [INFO] [timer.py:199:stop] epoch=1/micro_step=620/global_step=1540, RunningAvgSamplesPerSec=177.10118750800265, CurrSamplesPerSec=177.00542683401272, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:57:01,698] [INFO] [logging.py:96:log_dist] [Rank 0] step=1550, skipped=28, lr=[9.397675930430762e-06, 9.397675930430762e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:57:01,716] [INFO] [timer.py:199:stop] epoch=1/micro_step=630/global_step=1550, RunningAvgSamplesPerSec=177.10006949737243, CurrSamplesPerSec=177.01161303936405, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:57:05,318] [INFO] [logging.py:96:log_dist] [Rank 0] step=1560, skipped=28, lr=[9.3943790339071e-06, 9.3943790339071e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:57:05,336] [INFO] [timer.py:199:stop] epoch=1/micro_step=640/global_step=1560, RunningAvgSamplesPerSec=177.099533781486, CurrSamplesPerSec=176.81107360336713, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:57:08,938] [INFO] [logging.py:96:log_dist] [Rank 0] step=1570, skipped=28, lr=[9.391061324073802e-06, 9.391061324073802e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:57:08,956] [INFO] [timer.py:199:stop] epoch=1/micro_step=650/global_step=1570, RunningAvgSamplesPerSec=177.09884052096842, CurrSamplesPerSec=176.631324888962, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:57:12,561] [INFO] [logging.py:96:log_dist] [Rank 0] step=1580, skipped=28, lr=[9.387722816042882e-06, 9.387722816042882e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:57:12,579] [INFO] [timer.py:199:stop] epoch=1/micro_step=660/global_step=1580, RunningAvgSamplesPerSec=177.09739593508232, CurrSamplesPerSec=177.00659400947427, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:57:16,181] [INFO] [logging.py:96:log_dist] [Rank 0] step=1590, skipped=28, lr=[9.384363525021092e-06, 9.384363525021092e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:57:16,199] [INFO] [timer.py:199:stop] epoch=1/micro_step=670/global_step=1590, RunningAvgSamplesPerSec=177.09670796227698, CurrSamplesPerSec=176.90394145806533, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:57:19,804] [INFO] [logging.py:96:log_dist] [Rank 0] step=1600, skipped=28, lr=[9.380983466309844e-06, 9.380983466309844e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:57:19,823] [INFO] [timer.py:199:stop] epoch=1/micro_step=680/global_step=1600, RunningAvgSamplesPerSec=177.09505151403047, CurrSamplesPerSec=176.9816198140488, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:57:23,424] [INFO] [logging.py:96:log_dist] [Rank 0] step=1610, skipped=28, lr=[9.377582655305148e-06, 9.377582655305148e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:57:23,443] [INFO] [timer.py:199:stop] epoch=1/micro_step=690/global_step=1610, RunningAvgSamplesPerSec=177.09435821353887, CurrSamplesPerSec=177.10714556693523, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:57:27,045] [INFO] [logging.py:96:log_dist] [Rank 0] step=1620, skipped=28, lr=[9.374161107497545e-06, 9.374161107497545e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:57:27,063] [INFO] [timer.py:199:stop] epoch=1/micro_step=700/global_step=1620, RunningAvgSamplesPerSec=177.0936773874742, CurrSamplesPerSec=176.89694675604363, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:57:30,667] [INFO] [logging.py:96:log_dist] [Rank 0] step=1630, skipped=28, lr=[9.370718838472023e-06, 9.370718838472023e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:57:30,685] [INFO] [timer.py:199:stop] epoch=1/micro_step=710/global_step=1630, RunningAvgSamplesPerSec=177.0925163604059, CurrSamplesPerSec=176.91478433944172, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:57:31,018] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 21:57:31,351] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 21:57:34,231] [INFO] [logging.py:96:log_dist] [Rank 0] step=1640, skipped=30, lr=[9.367950114508076e-06, 9.367950114508076e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:57:34,249] [INFO] [timer.py:199:stop] epoch=1/micro_step=720/global_step=1640, RunningAvgSamplesPerSec=177.10853594734402, CurrSamplesPerSec=177.15518433848803, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:57:37,869] [INFO] [logging.py:96:log_dist] [Rank 0] step=1650, skipped=30, lr=[9.36447058686571e-06, 9.36447058686571e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:57:37,887] [INFO] [timer.py:199:stop] epoch=1/micro_step=730/global_step=1650, RunningAvgSamplesPerSec=177.10271767854036, CurrSamplesPerSec=176.91863213298518, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:57:41,488] [INFO] [logging.py:96:log_dist] [Rank 0] step=1660, skipped=30, lr=[9.360970382145298e-06, 9.360970382145298e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:57:41,506] [INFO] [timer.py:199:stop] epoch=1/micro_step=740/global_step=1660, RunningAvgSamplesPerSec=177.1023390915943, CurrSamplesPerSec=177.13671379570204, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:57:45,110] [INFO] [logging.py:96:log_dist] [Rank 0] step=1670, skipped=30, lr=[9.357449516290109e-06, 9.357449516290109e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:57:45,128] [INFO] [timer.py:199:stop] epoch=1/micro_step=750/global_step=1670, RunningAvgSamplesPerSec=177.1011066517827, CurrSamplesPerSec=176.43141378875194, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:57:48,732] [INFO] [logging.py:96:log_dist] [Rank 0] step=1680, skipped=30, lr=[9.353908005337526e-06, 9.353908005337526e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:57:48,750] [INFO] [timer.py:199:stop] epoch=1/micro_step=760/global_step=1680, RunningAvgSamplesPerSec=177.0998898234157, CurrSamplesPerSec=177.17669919317365, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:57:52,352] [INFO] [logging.py:96:log_dist] [Rank 0] step=1690, skipped=30, lr=[9.350345865418965e-06, 9.350345865418965e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:57:52,370] [INFO] [timer.py:199:stop] epoch=1/micro_step=770/global_step=1690, RunningAvgSamplesPerSec=177.09937282708282, CurrSamplesPerSec=176.89438217055795, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:57:55,969] [INFO] [logging.py:96:log_dist] [Rank 0] step=1700, skipped=30, lr=[9.346763112759811e-06, 9.346763112759811e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:57:55,988] [INFO] [timer.py:199:stop] epoch=1/micro_step=780/global_step=1700, RunningAvgSamplesPerSec=177.09951039461455, CurrSamplesPerSec=176.92901035003456, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:57:59,587] [INFO] [logging.py:96:log_dist] [Rank 0] step=1710, skipped=30, lr=[9.343159763679335e-06, 9.343159763679335e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:57:59,606] [INFO] [timer.py:199:stop] epoch=1/micro_step=790/global_step=1710, RunningAvgSamplesPerSec=177.09943681767749, CurrSamplesPerSec=176.9452214621233, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:58:03,206] [INFO] [logging.py:96:log_dist] [Rank 0] step=1720, skipped=30, lr=[9.339535834590625e-06, 9.339535834590625e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:58:03,225] [INFO] [timer.py:199:stop] epoch=1/micro_step=800/global_step=1720, RunningAvgSamplesPerSec=177.0991929170509, CurrSamplesPerSec=176.96341876645627, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:58:06,825] [INFO] [logging.py:96:log_dist] [Rank 0] step=1730, skipped=30, lr=[9.335891342000508e-06, 9.335891342000508e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:58:06,844] [INFO] [timer.py:199:stop] epoch=1/micro_step=810/global_step=1730, RunningAvgSamplesPerSec=177.0989069431106, CurrSamplesPerSec=177.16418720164415, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:58:07,914] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 21:58:08,248] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 21:58:10,403] [INFO] [logging.py:96:log_dist] [Rank 0] step=1740, skipped=32, lr=[9.33296095335979e-06, 9.33296095335979e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:58:10,421] [INFO] [timer.py:199:stop] epoch=1/micro_step=820/global_step=1740, RunningAvgSamplesPerSec=177.1104128150397, CurrSamplesPerSec=176.87200348953206, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:58:14,021] [INFO] [logging.py:96:log_dist] [Rank 0] step=1750, skipped=32, lr=[9.329279488363285e-06, 9.329279488363285e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:58:14,039] [INFO] [timer.py:199:stop] epoch=1/micro_step=830/global_step=1750, RunningAvgSamplesPerSec=177.11027726817554, CurrSamplesPerSec=176.94067272078902, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:58:17,643] [INFO] [logging.py:96:log_dist] [Rank 0] step=1760, skipped=32, lr=[9.325577506582558e-06, 9.325577506582558e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:58:17,661] [INFO] [timer.py:199:stop] epoch=1/micro_step=840/global_step=1760, RunningAvgSamplesPerSec=177.10897545328038, CurrSamplesPerSec=176.77556025314288, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:58:21,263] [INFO] [logging.py:96:log_dist] [Rank 0] step=1770, skipped=32, lr=[9.321855024879961e-06, 9.321855024879961e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:58:21,281] [INFO] [timer.py:199:stop] epoch=1/micro_step=850/global_step=1770, RunningAvgSamplesPerSec=177.10833589496264, CurrSamplesPerSec=176.8380965932859, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:58:24,889] [INFO] [logging.py:96:log_dist] [Rank 0] step=1780, skipped=32, lr=[9.318112060211228e-06, 9.318112060211228e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:58:24,907] [INFO] [timer.py:199:stop] epoch=1/micro_step=860/global_step=1780, RunningAvgSamplesPerSec=177.10593697609755, CurrSamplesPerSec=177.06730835193483, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:58:28,507] [INFO] [logging.py:96:log_dist] [Rank 0] step=1790, skipped=32, lr=[9.314348629625388e-06, 9.314348629625388e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:58:28,525] [INFO] [timer.py:199:stop] epoch=1/micro_step=870/global_step=1790, RunningAvgSamplesPerSec=177.10599954613346, CurrSamplesPerSec=176.93344186981554, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:58:32,128] [INFO] [logging.py:96:log_dist] [Rank 0] step=1800, skipped=32, lr=[9.310564750264693e-06, 9.310564750264693e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:58:32,146] [INFO] [timer.py:199:stop] epoch=1/micro_step=880/global_step=1800, RunningAvgSamplesPerSec=177.10497156784356, CurrSamplesPerSec=177.14735142276598, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:58:35,746] [INFO] [logging.py:96:log_dist] [Rank 0] step=1810, skipped=32, lr=[9.30676043936454e-06, 9.30676043936454e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:58:35,764] [INFO] [timer.py:199:stop] epoch=1/micro_step=890/global_step=1810, RunningAvgSamplesPerSec=177.1049771938188, CurrSamplesPerSec=177.08494666713284, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:58:39,365] [INFO] [logging.py:96:log_dist] [Rank 0] step=1820, skipped=32, lr=[9.302935714253385e-06, 9.302935714253385e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:58:39,383] [INFO] [timer.py:199:stop] epoch=1/micro_step=900/global_step=1820, RunningAvgSamplesPerSec=177.10470355692496, CurrSamplesPerSec=176.94825408627923, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:58:42,994] [INFO] [logging.py:96:log_dist] [Rank 0] step=1830, skipped=32, lr=[9.29909059235268e-06, 9.29909059235268e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:58:43,013] [INFO] [timer.py:199:stop] epoch=1/micro_step=910/global_step=1830, RunningAvgSamplesPerSec=177.1016405108574, CurrSamplesPerSec=177.20863031993022, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:58:44,793] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 21:58:45,126] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 21:58:46,556] [INFO] [logging.py:96:log_dist] [Rank 0] step=1840, skipped=34, lr=[9.295999820910157e-06, 9.295999820910157e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:58:46,575] [INFO] [timer.py:199:stop] epoch=1/micro_step=920/global_step=1840, RunningAvgSamplesPerSec=177.11652217682106, CurrSamplesPerSec=176.94813744496645, MemAllocated=4.98GB, MaxMemAllocated=24.08GB ***** Evaluating perplexity, Epoch 2/16 ***** ppl: 1.9456664323806763 Beginning of Epoch 3/16, Total Micro Batches 920 [2023-04-21 21:58:58,336] [INFO] [logging.py:96:log_dist] [Rank 0] step=1850, skipped=34, lr=[9.2921180289868e-06, 9.2921180289868e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:58:58,355] [INFO] [timer.py:199:stop] epoch=2/micro_step=10/global_step=1850, RunningAvgSamplesPerSec=177.1090407479315, CurrSamplesPerSec=175.75713624420467, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:59:01,963] [INFO] [logging.py:96:log_dist] [Rank 0] step=1860, skipped=34, lr=[9.288215889547945e-06, 9.288215889547945e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:59:01,982] [INFO] [timer.py:199:stop] epoch=2/micro_step=20/global_step=1860, RunningAvgSamplesPerSec=177.10663105943513, CurrSamplesPerSec=176.99037165046778, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:59:05,585] [INFO] [logging.py:96:log_dist] [Rank 0] step=1870, skipped=34, lr=[9.284293420367653e-06, 9.284293420367653e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:59:05,603] [INFO] [timer.py:199:stop] epoch=2/micro_step=30/global_step=1870, RunningAvgSamplesPerSec=177.10570977551907, CurrSamplesPerSec=177.17997365093026, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:59:09,204] [INFO] [logging.py:96:log_dist] [Rank 0] step=1880, skipped=34, lr=[9.280350639312594e-06, 9.280350639312594e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:59:09,222] [INFO] [timer.py:199:stop] epoch=2/micro_step=40/global_step=1880, RunningAvgSamplesPerSec=177.10528692195305, CurrSamplesPerSec=177.08845138660283, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:59:12,825] [INFO] [logging.py:96:log_dist] [Rank 0] step=1890, skipped=34, lr=[9.276387564341946e-06, 9.276387564341946e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:59:12,843] [INFO] [timer.py:199:stop] epoch=2/micro_step=50/global_step=1890, RunningAvgSamplesPerSec=177.10445042320345, CurrSamplesPerSec=176.97881940919035, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:59:16,471] [INFO] [logging.py:96:log_dist] [Rank 0] step=1900, skipped=34, lr=[9.272404213507338e-06, 9.272404213507338e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:59:16,489] [INFO] [timer.py:199:stop] epoch=2/micro_step=60/global_step=1900, RunningAvgSamplesPerSec=177.0973847400903, CurrSamplesPerSec=176.8831921222257, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:59:20,092] [INFO] [logging.py:96:log_dist] [Rank 0] step=1910, skipped=34, lr=[9.268400604952746e-06, 9.268400604952746e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:59:20,111] [INFO] [timer.py:199:stop] epoch=2/micro_step=70/global_step=1910, RunningAvgSamplesPerSec=177.09627143992515, CurrSamplesPerSec=176.87724798486067, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:59:23,713] [INFO] [logging.py:96:log_dist] [Rank 0] step=1920, skipped=34, lr=[9.264376756914422e-06, 9.264376756914422e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:59:23,731] [INFO] [timer.py:199:stop] epoch=2/micro_step=80/global_step=1920, RunningAvgSamplesPerSec=177.09581388110252, CurrSamplesPerSec=177.004026243777, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:59:27,345] [INFO] [logging.py:96:log_dist] [Rank 0] step=1930, skipped=34, lr=[9.260332687720804e-06, 9.260332687720804e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:59:27,366] [INFO] [timer.py:199:stop] epoch=2/micro_step=90/global_step=1930, RunningAvgSamplesPerSec=177.09191049411547, CurrSamplesPerSec=170.89059290057378, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:59:29,872] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 21:59:30,206] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 21:59:30,912] [INFO] [logging.py:96:log_dist] [Rank 0] step=1940, skipped=36, lr=[9.257082885509618e-06, 9.257082885509618e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:59:30,930] [INFO] [timer.py:199:stop] epoch=2/micro_step=100/global_step=1940, RunningAvgSamplesPerSec=177.10574837087822, CurrSamplesPerSec=177.11333888443892, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:59:34,531] [INFO] [logging.py:96:log_dist] [Rank 0] step=1950, skipped=36, lr=[9.253002464718097e-06, 9.253002464718097e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:59:34,549] [INFO] [timer.py:199:stop] epoch=2/micro_step=110/global_step=1950, RunningAvgSamplesPerSec=177.10554008047671, CurrSamplesPerSec=176.77102023049636, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:59:38,160] [INFO] [logging.py:96:log_dist] [Rank 0] step=1960, skipped=36, lr=[9.248901874580661e-06, 9.248901874580661e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:59:38,179] [INFO] [timer.py:199:stop] epoch=2/micro_step=120/global_step=1960, RunningAvgSamplesPerSec=177.10274078913628, CurrSamplesPerSec=176.57543886422525, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:59:41,780] [INFO] [logging.py:96:log_dist] [Rank 0] step=1970, skipped=36, lr=[9.244781133775306e-06, 9.244781133775306e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:59:41,798] [INFO] [timer.py:199:stop] epoch=2/micro_step=130/global_step=1970, RunningAvgSamplesPerSec=177.10238277147292, CurrSamplesPerSec=176.76310485132197, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:59:45,401] [INFO] [logging.py:96:log_dist] [Rank 0] step=1980, skipped=36, lr=[9.240640261071813e-06, 9.240640261071813e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:59:45,419] [INFO] [timer.py:199:stop] epoch=2/micro_step=140/global_step=1980, RunningAvgSamplesPerSec=177.10173136224074, CurrSamplesPerSec=176.65410771844896, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:59:49,052] [INFO] [logging.py:96:log_dist] [Rank 0] step=1990, skipped=36, lr=[9.236479275331666e-06, 9.236479275331666e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:59:49,070] [INFO] [timer.py:199:stop] epoch=2/micro_step=150/global_step=1990, RunningAvgSamplesPerSec=177.0936704671639, CurrSamplesPerSec=176.99900764410754, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:59:52,672] [INFO] [logging.py:96:log_dist] [Rank 0] step=2000, skipped=36, lr=[9.232298195507963e-06, 9.232298195507963e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:59:52,690] [INFO] [timer.py:199:stop] epoch=2/micro_step=160/global_step=2000, RunningAvgSamplesPerSec=177.0933051317691, CurrSamplesPerSec=176.89671360883577, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:59:56,294] [INFO] [logging.py:96:log_dist] [Rank 0] step=2010, skipped=36, lr=[9.228097040645329e-06, 9.228097040645329e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:59:56,313] [INFO] [timer.py:199:stop] epoch=2/micro_step=170/global_step=2010, RunningAvgSamplesPerSec=177.0921583718301, CurrSamplesPerSec=176.98500375483036, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 21:59:59,919] [INFO] [logging.py:96:log_dist] [Rank 0] step=2020, skipped=36, lr=[9.223875829879829e-06, 9.223875829879829e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 21:59:59,937] [INFO] [timer.py:199:stop] epoch=2/micro_step=180/global_step=2020, RunningAvgSamplesPerSec=177.09064558295927, CurrSamplesPerSec=176.9420723085979, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:00:03,544] [INFO] [logging.py:96:log_dist] [Rank 0] step=2030, skipped=36, lr=[9.219634582438881e-06, 9.219634582438881e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:00:03,562] [INFO] [timer.py:199:stop] epoch=2/micro_step=190/global_step=2030, RunningAvgSamplesPerSec=177.08888453209406, CurrSamplesPerSec=176.88389145876988, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:00:06,792] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:00:07,126] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:00:07,127] [INFO] [logging.py:96:log_dist] [Rank 0] step=2040, skipped=38, lr=[9.216227171058895e-06, 9.216227171058895e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:00:07,127] [INFO] [timer.py:199:stop] epoch=2/micro_step=200/global_step=2040, RunningAvgSamplesPerSec=177.10194086388543, CurrSamplesPerSec=191.96599418314025, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:00:10,736] [INFO] [logging.py:96:log_dist] [Rank 0] step=2050, skipped=38, lr=[9.211949906346505e-06, 9.211949906346505e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:00:10,754] [INFO] [timer.py:199:stop] epoch=2/micro_step=210/global_step=2050, RunningAvgSamplesPerSec=177.0996839090297, CurrSamplesPerSec=173.87033602934937, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:00:14,356] [INFO] [logging.py:96:log_dist] [Rank 0] step=2060, skipped=38, lr=[9.2076526592807e-06, 9.2076526592807e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:00:14,375] [INFO] [timer.py:199:stop] epoch=2/micro_step=220/global_step=2060, RunningAvgSamplesPerSec=177.09929418906515, CurrSamplesPerSec=176.99270561969925, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:00:17,977] [INFO] [logging.py:96:log_dist] [Rank 0] step=2070, skipped=38, lr=[9.203335449435236e-06, 9.203335449435236e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:00:17,996] [INFO] [timer.py:199:stop] epoch=2/micro_step=230/global_step=2070, RunningAvgSamplesPerSec=177.09873631331777, CurrSamplesPerSec=177.13741513836538, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:00:21,604] [INFO] [logging.py:96:log_dist] [Rank 0] step=2080, skipped=38, lr=[9.198998296474807e-06, 9.198998296474807e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:00:21,622] [INFO] [timer.py:199:stop] epoch=2/micro_step=240/global_step=2080, RunningAvgSamplesPerSec=177.09676460297894, CurrSamplesPerSec=176.8604666944704, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:00:25,224] [INFO] [logging.py:96:log_dist] [Rank 0] step=2090, skipped=38, lr=[9.194641220154943e-06, 9.194641220154943e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:00:25,243] [INFO] [timer.py:199:stop] epoch=2/micro_step=250/global_step=2090, RunningAvgSamplesPerSec=177.0962228583155, CurrSamplesPerSec=176.78173040132188, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:00:28,845] [INFO] [logging.py:96:log_dist] [Rank 0] step=2100, skipped=38, lr=[9.190264240321921e-06, 9.190264240321921e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:00:28,863] [INFO] [timer.py:199:stop] epoch=2/micro_step=260/global_step=2100, RunningAvgSamplesPerSec=177.0955804162965, CurrSamplesPerSec=177.0039095289245, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:00:32,474] [INFO] [logging.py:96:log_dist] [Rank 0] step=2110, skipped=38, lr=[9.185867376912686e-06, 9.185867376912686e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:00:32,492] [INFO] [timer.py:199:stop] epoch=2/micro_step=270/global_step=2110, RunningAvgSamplesPerSec=177.09306021518913, CurrSamplesPerSec=176.77823781171654, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:00:36,100] [INFO] [logging.py:96:log_dist] [Rank 0] step=2120, skipped=38, lr=[9.181450649954749e-06, 9.181450649954749e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:00:36,119] [INFO] [timer.py:199:stop] epoch=2/micro_step=280/global_step=2120, RunningAvgSamplesPerSec=177.09112563492138, CurrSamplesPerSec=176.82272043106371, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:00:39,725] [INFO] [logging.py:96:log_dist] [Rank 0] step=2130, skipped=38, lr=[9.17701407956609e-06, 9.17701407956609e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:00:39,743] [INFO] [timer.py:199:stop] epoch=2/micro_step=290/global_step=2130, RunningAvgSamplesPerSec=177.0896769182843, CurrSamplesPerSec=177.06730835193483, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:00:43,348] [INFO] [logging.py:96:log_dist] [Rank 0] step=2140, skipped=38, lr=[9.172557685955084e-06, 9.172557685955084e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:00:43,366] [INFO] [timer.py:199:stop] epoch=2/micro_step=300/global_step=2140, RunningAvgSamplesPerSec=177.08850974512646, CurrSamplesPerSec=176.89111826021173, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:00:43,699] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:00:44,032] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:00:46,916] [INFO] [logging.py:96:log_dist] [Rank 0] step=2150, skipped=40, lr=[9.16897831198386e-06, 9.16897831198386e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:00:46,935] [INFO] [timer.py:199:stop] epoch=2/micro_step=310/global_step=2150, RunningAvgSamplesPerSec=177.09978524148522, CurrSamplesPerSec=176.83448528462375, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:00:50,535] [INFO] [logging.py:96:log_dist] [Rank 0] step=2160, skipped=40, lr=[9.164486287785888e-06, 9.164486287785888e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:00:50,553] [INFO] [timer.py:199:stop] epoch=2/micro_step=320/global_step=2160, RunningAvgSamplesPerSec=177.0996287629393, CurrSamplesPerSec=177.03239251681882, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:00:54,183] [INFO] [logging.py:96:log_dist] [Rank 0] step=2170, skipped=40, lr=[9.15997449742908e-06, 9.15997449742908e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:00:54,202] [INFO] [timer.py:199:stop] epoch=2/micro_step=330/global_step=2170, RunningAvgSamplesPerSec=177.09275245540317, CurrSamplesPerSec=176.76880850906906, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:00:57,805] [INFO] [logging.py:96:log_dist] [Rank 0] step=2180, skipped=40, lr=[9.15544296146443e-06, 9.15544296146443e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:00:57,824] [INFO] [timer.py:199:stop] epoch=2/micro_step=340/global_step=2180, RunningAvgSamplesPerSec=177.09198134896852, CurrSamplesPerSec=177.15039097945817, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:01:01,432] [INFO] [logging.py:96:log_dist] [Rank 0] step=2190, skipped=40, lr=[9.15089170053288e-06, 9.15089170053288e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:01:01,451] [INFO] [timer.py:199:stop] epoch=2/micro_step=350/global_step=2190, RunningAvgSamplesPerSec=177.0899134893599, CurrSamplesPerSec=177.1385840551459, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:01:05,058] [INFO] [logging.py:96:log_dist] [Rank 0] step=2200, skipped=40, lr=[9.146320735365205e-06, 9.146320735365205e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:01:05,077] [INFO] [timer.py:199:stop] epoch=2/micro_step=360/global_step=2200, RunningAvgSamplesPerSec=177.08828050572288, CurrSamplesPerSec=177.0450026678556, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:01:08,679] [INFO] [logging.py:96:log_dist] [Rank 0] step=2210, skipped=40, lr=[9.141730086781944e-06, 9.141730086781944e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:01:08,698] [INFO] [timer.py:199:stop] epoch=2/micro_step=370/global_step=2210, RunningAvgSamplesPerSec=177.08753889752137, CurrSamplesPerSec=176.9624854803132, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:01:12,298] [INFO] [logging.py:96:log_dist] [Rank 0] step=2220, skipped=40, lr=[9.137119775693286e-06, 9.137119775693286e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:01:12,317] [INFO] [timer.py:199:stop] epoch=2/micro_step=380/global_step=2220, RunningAvgSamplesPerSec=177.08746626936676, CurrSamplesPerSec=176.93029313556318, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:01:15,922] [INFO] [logging.py:96:log_dist] [Rank 0] step=2230, skipped=40, lr=[9.132489823098989e-06, 9.132489823098989e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:01:15,940] [INFO] [timer.py:199:stop] epoch=2/micro_step=390/global_step=2230, RunningAvgSamplesPerSec=177.08636510750011, CurrSamplesPerSec=175.91585464873575, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:01:19,542] [INFO] [logging.py:96:log_dist] [Rank 0] step=2240, skipped=40, lr=[9.127840250088267e-06, 9.127840250088267e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:01:19,561] [INFO] [timer.py:199:stop] epoch=2/micro_step=400/global_step=2240, RunningAvgSamplesPerSec=177.0860073077655, CurrSamplesPerSec=177.06590678223242, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:01:20,618] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:01:20,951] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:01:23,107] [INFO] [logging.py:96:log_dist] [Rank 0] step=2250, skipped=42, lr=[9.124106479208876e-06, 9.124106479208876e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:01:23,125] [INFO] [timer.py:199:stop] epoch=2/micro_step=410/global_step=2250, RunningAvgSamplesPerSec=177.09772573894148, CurrSamplesPerSec=177.00951201547244, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:01:26,746] [INFO] [logging.py:96:log_dist] [Rank 0] step=2260, skipped=42, lr=[9.119421642878632e-06, 9.119421642878632e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:01:26,764] [INFO] [timer.py:199:stop] epoch=2/micro_step=420/global_step=2260, RunningAvgSamplesPerSec=177.0932736810206, CurrSamplesPerSec=176.90207615010704, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:01:30,367] [INFO] [logging.py:96:log_dist] [Rank 0] step=2270, skipped=42, lr=[9.114717245656921e-06, 9.114717245656921e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:01:30,386] [INFO] [timer.py:199:stop] epoch=2/micro_step=430/global_step=2270, RunningAvgSamplesPerSec=177.09254586170934, CurrSamplesPerSec=176.90685608050177, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:01:33,989] [INFO] [logging.py:96:log_dist] [Rank 0] step=2280, skipped=42, lr=[9.109993308972054e-06, 9.109993308972054e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:01:34,007] [INFO] [timer.py:199:stop] epoch=2/micro_step=440/global_step=2280, RunningAvgSamplesPerSec=177.09179238420347, CurrSamplesPerSec=176.9199147680276, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:01:37,620] [INFO] [logging.py:96:log_dist] [Rank 0] step=2290, skipped=42, lr=[9.105249854341344e-06, 9.105249854341344e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:01:37,639] [INFO] [timer.py:199:stop] epoch=2/micro_step=450/global_step=2290, RunningAvgSamplesPerSec=177.089674547972, CurrSamplesPerSec=176.89275020032895, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:01:41,240] [INFO] [logging.py:96:log_dist] [Rank 0] step=2300, skipped=42, lr=[9.100486903371005e-06, 9.100486903371005e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:01:41,258] [INFO] [timer.py:199:stop] epoch=2/micro_step=460/global_step=2300, RunningAvgSamplesPerSec=177.08943806082172, CurrSamplesPerSec=177.0535272134374, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:01:44,859] [INFO] [logging.py:96:log_dist] [Rank 0] step=2310, skipped=42, lr=[9.095704477756058e-06, 9.095704477756058e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:01:44,877] [INFO] [timer.py:199:stop] epoch=2/micro_step=470/global_step=2310, RunningAvgSamplesPerSec=177.08921141881217, CurrSamplesPerSec=177.0022755371514, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:01:48,490] [INFO] [logging.py:96:log_dist] [Rank 0] step=2320, skipped=42, lr=[9.090902599280228e-06, 9.090902599280228e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:01:48,509] [INFO] [timer.py:199:stop] epoch=2/micro_step=480/global_step=2320, RunningAvgSamplesPerSec=177.08634455679953, CurrSamplesPerSec=176.77858706446693, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:01:52,112] [INFO] [logging.py:96:log_dist] [Rank 0] step=2330, skipped=42, lr=[9.086081289815856e-06, 9.086081289815856e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:01:52,130] [INFO] [timer.py:199:stop] epoch=2/micro_step=490/global_step=2330, RunningAvgSamplesPerSec=177.0856780601212, CurrSamplesPerSec=176.9037082924193, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:01:55,734] [INFO] [logging.py:96:log_dist] [Rank 0] step=2340, skipped=42, lr=[9.081240571323775e-06, 9.081240571323775e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:01:55,752] [INFO] [timer.py:199:stop] epoch=2/micro_step=500/global_step=2340, RunningAvgSamplesPerSec=177.0847981668279, CurrSamplesPerSec=176.7855724092875, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:01:57,533] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:01:57,876] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:01:59,309] [INFO] [logging.py:96:log_dist] [Rank 0] step=2350, skipped=44, lr=[9.077354036844291e-06, 9.077354036844291e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:01:59,327] [INFO] [timer.py:199:stop] epoch=2/micro_step=510/global_step=2350, RunningAvgSamplesPerSec=177.09388706098542, CurrSamplesPerSec=177.1562365739711, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:02:02,929] [INFO] [logging.py:96:log_dist] [Rank 0] step=2360, skipped=44, lr=[9.072478437725792e-06, 9.072478437725792e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:02:02,947] [INFO] [timer.py:199:stop] epoch=2/micro_step=520/global_step=2360, RunningAvgSamplesPerSec=177.09343081389574, CurrSamplesPerSec=177.06859314365397, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:02:06,549] [INFO] [logging.py:96:log_dist] [Rank 0] step=2370, skipped=44, lr=[9.067583491539948e-06, 9.067583491539948e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:02:06,567] [INFO] [timer.py:199:stop] epoch=2/micro_step=530/global_step=2370, RunningAvgSamplesPerSec=177.0930914558829, CurrSamplesPerSec=176.99679021397694, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:02:10,171] [INFO] [logging.py:96:log_dist] [Rank 0] step=2380, skipped=44, lr=[9.062669220583011e-06, 9.062669220583011e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:02:10,189] [INFO] [timer.py:199:stop] epoch=2/micro_step=540/global_step=2380, RunningAvgSamplesPerSec=177.09224861735976, CurrSamplesPerSec=175.51408634884768, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:02:13,792] [INFO] [logging.py:96:log_dist] [Rank 0] step=2390, skipped=44, lr=[9.05773564723926e-06, 9.05773564723926e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:02:13,810] [INFO] [timer.py:199:stop] epoch=2/micro_step=550/global_step=2390, RunningAvgSamplesPerSec=177.0916736103195, CurrSamplesPerSec=177.2747514584932, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:02:17,422] [INFO] [logging.py:96:log_dist] [Rank 0] step=2400, skipped=44, lr=[9.05278279398089e-06, 9.05278279398089e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:02:17,440] [INFO] [timer.py:199:stop] epoch=2/micro_step=560/global_step=2400, RunningAvgSamplesPerSec=177.08934697819225, CurrSamplesPerSec=175.01470607411193, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:02:21,042] [INFO] [logging.py:96:log_dist] [Rank 0] step=2410, skipped=44, lr=[9.04781068336792e-06, 9.04781068336792e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:02:21,061] [INFO] [timer.py:199:stop] epoch=2/micro_step=570/global_step=2410, RunningAvgSamplesPerSec=177.0889486274888, CurrSamplesPerSec=176.90697266739687, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:02:24,662] [INFO] [logging.py:96:log_dist] [Rank 0] step=2420, skipped=44, lr=[9.04281933804808e-06, 9.04281933804808e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:02:24,680] [INFO] [timer.py:199:stop] epoch=2/micro_step=580/global_step=2420, RunningAvgSamplesPerSec=177.08873557740785, CurrSamplesPerSec=176.740643003404, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:02:28,281] [INFO] [logging.py:96:log_dist] [Rank 0] step=2430, skipped=44, lr=[9.037808780756722e-06, 9.037808780756722e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:02:28,300] [INFO] [timer.py:199:stop] epoch=2/micro_step=590/global_step=2430, RunningAvgSamplesPerSec=177.08849512429964, CurrSamplesPerSec=176.9238793939246, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:02:31,937] [INFO] [logging.py:96:log_dist] [Rank 0] step=2440, skipped=44, lr=[9.032779034316696e-06, 9.032779034316696e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:02:31,955] [INFO] [timer.py:199:stop] epoch=2/micro_step=600/global_step=2440, RunningAvgSamplesPerSec=177.08229871692353, CurrSamplesPerSec=177.02737230850397, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:02:34,459] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:02:34,795] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:02:35,501] [INFO] [logging.py:96:log_dist] [Rank 0] step=2450, skipped=46, lr=[9.028741436370401e-06, 9.028741436370401e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:02:35,520] [INFO] [timer.py:199:stop] epoch=2/micro_step=610/global_step=2450, RunningAvgSamplesPerSec=177.0929670670982, CurrSamplesPerSec=177.08634853827402, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:02:39,143] [INFO] [logging.py:96:log_dist] [Rank 0] step=2460, skipped=46, lr=[9.023677207255308e-06, 9.023677207255308e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:02:39,161] [INFO] [timer.py:199:stop] epoch=2/micro_step=620/global_step=2460, RunningAvgSamplesPerSec=177.08829843839666, CurrSamplesPerSec=176.99142192900442, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:02:42,790] [INFO] [logging.py:96:log_dist] [Rank 0] step=2470, skipped=46, lr=[9.018593853360213e-06, 9.018593853360213e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:02:42,809] [INFO] [timer.py:199:stop] epoch=2/micro_step=630/global_step=2470, RunningAvgSamplesPerSec=177.0824219728806, CurrSamplesPerSec=176.60773458126667, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:02:46,409] [INFO] [logging.py:96:log_dist] [Rank 0] step=2480, skipped=46, lr=[9.013491397839557e-06, 9.013491397839557e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:02:46,428] [INFO] [timer.py:199:stop] epoch=2/micro_step=640/global_step=2480, RunningAvgSamplesPerSec=177.08227692206776, CurrSamplesPerSec=177.12584369455237, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:02:50,031] [INFO] [logging.py:96:log_dist] [Rank 0] step=2490, skipped=46, lr=[9.008369863934787e-06, 9.008369863934787e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:02:50,049] [INFO] [timer.py:199:stop] epoch=2/micro_step=650/global_step=2490, RunningAvgSamplesPerSec=177.08156505425103, CurrSamplesPerSec=177.11439062288656, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:02:53,649] [INFO] [logging.py:96:log_dist] [Rank 0] step=2500, skipped=46, lr=[9.003229274974254e-06, 9.003229274974254e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:02:53,667] [INFO] [timer.py:199:stop] epoch=2/micro_step=660/global_step=2500, RunningAvgSamplesPerSec=177.08162834850629, CurrSamplesPerSec=177.13460980103363, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:02:57,268] [INFO] [logging.py:96:log_dist] [Rank 0] step=2510, skipped=46, lr=[8.998069654373099e-06, 8.998069654373099e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:02:57,286] [INFO] [timer.py:199:stop] epoch=2/micro_step=670/global_step=2510, RunningAvgSamplesPerSec=177.08156742724944, CurrSamplesPerSec=176.91000398060837, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:03:00,895] [INFO] [logging.py:96:log_dist] [Rank 0] step=2520, skipped=46, lr=[8.99289102563316e-06, 8.99289102563316e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:03:00,913] [INFO] [timer.py:199:stop] epoch=2/micro_step=680/global_step=2520, RunningAvgSamplesPerSec=177.07991526641928, CurrSamplesPerSec=176.83646566143187, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:03:04,550] [INFO] [logging.py:96:log_dist] [Rank 0] step=2530, skipped=46, lr=[8.987693412342847e-06, 8.987693412342847e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:03:04,562] [INFO] [timer.py:199:stop] epoch=2/micro_step=690/global_step=2530, RunningAvgSamplesPerSec=177.07388825502244, CurrSamplesPerSec=164.19938659688503, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:03:08,178] [INFO] [logging.py:96:log_dist] [Rank 0] step=2540, skipped=46, lr=[8.982476838177047e-06, 8.982476838177047e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:03:08,196] [INFO] [timer.py:199:stop] epoch=2/micro_step=700/global_step=2540, RunningAvgSamplesPerSec=177.07241711775075, CurrSamplesPerSec=177.0437182184526, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:03:11,432] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:03:11,765] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:03:11,766] [INFO] [logging.py:96:log_dist] [Rank 0] step=2550, skipped=48, lr=[8.978289942978722e-06, 8.978289942978722e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:03:11,766] [INFO] [timer.py:199:stop] epoch=2/micro_step=710/global_step=2550, RunningAvgSamplesPerSec=177.08169946936985, CurrSamplesPerSec=192.10804721913098, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:03:15,368] [INFO] [logging.py:96:log_dist] [Rank 0] step=2560, skipped=48, lr=[8.973039299173377e-06, 8.973039299173377e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:03:15,386] [INFO] [timer.py:199:stop] epoch=2/micro_step=720/global_step=2560, RunningAvgSamplesPerSec=177.08157593744653, CurrSamplesPerSec=177.06322050232083, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:03:18,993] [INFO] [logging.py:96:log_dist] [Rank 0] step=2570, skipped=48, lr=[8.967769761241352e-06, 8.967769761241352e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:03:19,011] [INFO] [timer.py:199:stop] epoch=2/micro_step=730/global_step=2570, RunningAvgSamplesPerSec=177.0802691018004, CurrSamplesPerSec=175.62501292151958, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:03:22,612] [INFO] [logging.py:96:log_dist] [Rank 0] step=2580, skipped=48, lr=[8.962481353185147e-06, 8.962481353185147e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:03:22,631] [INFO] [timer.py:199:stop] epoch=2/micro_step=740/global_step=2580, RunningAvgSamplesPerSec=177.08012339578138, CurrSamplesPerSec=177.0260881149014, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:03:26,232] [INFO] [logging.py:96:log_dist] [Rank 0] step=2590, skipped=48, lr=[8.957174099093217e-06, 8.957174099093217e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:03:26,250] [INFO] [timer.py:199:stop] epoch=2/micro_step=750/global_step=2590, RunningAvgSamplesPerSec=177.07986601818013, CurrSamplesPerSec=176.99492294748615, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:03:29,856] [INFO] [logging.py:96:log_dist] [Rank 0] step=2600, skipped=48, lr=[8.95184802313986e-06, 8.95184802313986e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:03:29,875] [INFO] [timer.py:199:stop] epoch=2/micro_step=760/global_step=2600, RunningAvgSamplesPerSec=177.07866469000658, CurrSamplesPerSec=177.12946691336583, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:03:33,479] [INFO] [logging.py:96:log_dist] [Rank 0] step=2610, skipped=48, lr=[8.946503149585103e-06, 8.946503149585103e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:03:33,497] [INFO] [timer.py:199:stop] epoch=2/micro_step=770/global_step=2610, RunningAvgSamplesPerSec=177.07787207174985, CurrSamplesPerSec=176.80816213617172, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:03:37,116] [INFO] [logging.py:96:log_dist] [Rank 0] step=2620, skipped=48, lr=[8.941139502774598e-06, 8.941139502774598e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:03:37,134] [INFO] [timer.py:199:stop] epoch=2/micro_step=780/global_step=2620, RunningAvgSamplesPerSec=177.07440915849443, CurrSamplesPerSec=177.11088487663588, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:03:40,736] [INFO] [logging.py:96:log_dist] [Rank 0] step=2630, skipped=48, lr=[8.935757107139506e-06, 8.935757107139506e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:03:40,754] [INFO] [timer.py:199:stop] epoch=2/micro_step=790/global_step=2630, RunningAvgSamplesPerSec=177.07407487335266, CurrSamplesPerSec=177.05084130913255, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:03:44,360] [INFO] [logging.py:96:log_dist] [Rank 0] step=2640, skipped=48, lr=[8.93035598719639e-06, 8.93035598719639e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:03:44,378] [INFO] [timer.py:199:stop] epoch=2/micro_step=800/global_step=2640, RunningAvgSamplesPerSec=177.0730625204342, CurrSamplesPerSec=176.8713042469862, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:03:47,997] [INFO] [logging.py:96:log_dist] [Rank 0] step=2650, skipped=48, lr=[8.924936167547103e-06, 8.924936167547103e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:03:48,016] [INFO] [timer.py:199:stop] epoch=2/micro_step=810/global_step=2650, RunningAvgSamplesPerSec=177.0698942745113, CurrSamplesPerSec=177.03145850301553, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:03:48,349] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:03:48,683] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:03:51,570] [INFO] [logging.py:96:log_dist] [Rank 0] step=2660, skipped=50, lr=[8.920586864626051e-06, 8.920586864626051e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:03:51,588] [INFO] [timer.py:199:stop] epoch=2/micro_step=820/global_step=2660, RunningAvgSamplesPerSec=177.07836775945228, CurrSamplesPerSec=176.88342523379265, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:03:55,196] [INFO] [logging.py:96:log_dist] [Rank 0] step=2670, skipped=50, lr=[8.915133447774127e-06, 8.915133447774127e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:03:55,214] [INFO] [timer.py:199:stop] epoch=2/micro_step=830/global_step=2670, RunningAvgSamplesPerSec=177.07706619643375, CurrSamplesPerSec=176.714347783726, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:03:58,815] [INFO] [logging.py:96:log_dist] [Rank 0] step=2680, skipped=50, lr=[8.909661400553994e-06, 8.909661400553994e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:03:58,833] [INFO] [timer.py:199:stop] epoch=2/micro_step=840/global_step=2680, RunningAvgSamplesPerSec=177.07688568449768, CurrSamplesPerSec=176.97986955062677, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:04:02,439] [INFO] [logging.py:96:log_dist] [Rank 0] step=2690, skipped=50, lr=[8.90417074789057e-06, 8.90417074789057e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:04:02,457] [INFO] [timer.py:199:stop] epoch=2/micro_step=850/global_step=2690, RunningAvgSamplesPerSec=177.07587954462292, CurrSamplesPerSec=176.90009430350176, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:04:06,059] [INFO] [logging.py:96:log_dist] [Rank 0] step=2700, skipped=50, lr=[8.898661514793523e-06, 8.898661514793523e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:04:06,077] [INFO] [timer.py:199:stop] epoch=2/micro_step=860/global_step=2700, RunningAvgSamplesPerSec=177.0755469618806, CurrSamplesPerSec=176.8660601041158, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:04:09,722] [INFO] [logging.py:96:log_dist] [Rank 0] step=2710, skipped=50, lr=[8.893133726357158e-06, 8.893133726357158e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:04:09,740] [INFO] [timer.py:199:stop] epoch=2/micro_step=870/global_step=2710, RunningAvgSamplesPerSec=177.06781634879542, CurrSamplesPerSec=173.06201687325205, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:04:13,350] [INFO] [logging.py:96:log_dist] [Rank 0] step=2720, skipped=50, lr=[8.887587407760289e-06, 8.887587407760289e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:04:13,368] [INFO] [timer.py:199:stop] epoch=2/micro_step=880/global_step=2720, RunningAvgSamplesPerSec=177.06627835761626, CurrSamplesPerSec=176.68957228209464, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:04:16,969] [INFO] [logging.py:96:log_dist] [Rank 0] step=2730, skipped=50, lr=[8.882022584266147e-06, 8.882022584266147e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:04:16,987] [INFO] [timer.py:199:stop] epoch=2/micro_step=890/global_step=2730, RunningAvgSamplesPerSec=177.06615262047478, CurrSamplesPerSec=177.01861681256904, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:04:20,592] [INFO] [logging.py:96:log_dist] [Rank 0] step=2740, skipped=50, lr=[8.876439281222242e-06, 8.876439281222242e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:04:20,610] [INFO] [timer.py:199:stop] epoch=2/micro_step=900/global_step=2740, RunningAvgSamplesPerSec=177.06528466575668, CurrSamplesPerSec=176.8462517038309, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:04:24,218] [INFO] [logging.py:96:log_dist] [Rank 0] step=2750, skipped=50, lr=[8.870837524060258e-06, 8.870837524060258e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:04:24,236] [INFO] [timer.py:199:stop] epoch=2/micro_step=910/global_step=2750, RunningAvgSamplesPerSec=177.0641358106171, CurrSamplesPerSec=176.9265614474952, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:04:25,293] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:04:25,626] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:04:27,781] [INFO] [logging.py:96:log_dist] [Rank 0] step=2760, skipped=52, lr=[8.866342848509415e-06, 8.866342848509415e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:04:27,800] [INFO] [timer.py:199:stop] epoch=2/micro_step=920/global_step=2760, RunningAvgSamplesPerSec=177.07379208629789, CurrSamplesPerSec=176.77393047459245, MemAllocated=4.98GB, MaxMemAllocated=24.08GB ***** Evaluating perplexity, Epoch 3/16 ***** ppl: 1.903592824935913 Beginning of Epoch 4/16, Total Micro Batches 920 [2023-04-21 22:04:39,550] [INFO] [logging.py:96:log_dist] [Rank 0] step=2770, skipped=52, lr=[8.860707938290982e-06, 8.860707938290982e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:04:39,568] [INFO] [timer.py:199:stop] epoch=3/micro_step=10/global_step=2770, RunningAvgSamplesPerSec=177.06947062559237, CurrSamplesPerSec=176.69736476603856, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:04:43,173] [INFO] [logging.py:96:log_dist] [Rank 0] step=2780, skipped=52, lr=[8.85505464561001e-06, 8.85505464561001e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:04:43,191] [INFO] [timer.py:199:stop] epoch=3/micro_step=20/global_step=2780, RunningAvgSamplesPerSec=177.06869022485142, CurrSamplesPerSec=176.99503965048757, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:04:46,792] [INFO] [logging.py:96:log_dist] [Rank 0] step=2790, skipped=52, lr=[8.849382996216985e-06, 8.849382996216985e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:04:46,810] [INFO] [timer.py:199:stop] epoch=3/micro_step=30/global_step=2790, RunningAvgSamplesPerSec=177.06860546045104, CurrSamplesPerSec=177.01044579771116, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:04:50,419] [INFO] [logging.py:96:log_dist] [Rank 0] step=2800, skipped=52, lr=[8.843693015946007e-06, 8.843693015946007e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:04:50,437] [INFO] [timer.py:199:stop] epoch=3/micro_step=40/global_step=2800, RunningAvgSamplesPerSec=177.0672294984399, CurrSamplesPerSec=176.2821823530957, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:04:54,041] [INFO] [logging.py:96:log_dist] [Rank 0] step=2810, skipped=52, lr=[8.837984730714672e-06, 8.837984730714672e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:04:54,059] [INFO] [timer.py:199:stop] epoch=3/micro_step=50/global_step=2810, RunningAvgSamplesPerSec=177.06670813997084, CurrSamplesPerSec=176.8400770509817, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:04:57,661] [INFO] [logging.py:96:log_dist] [Rank 0] step=2820, skipped=52, lr=[8.832258166523955e-06, 8.832258166523955e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:04:57,679] [INFO] [timer.py:199:stop] epoch=3/micro_step=60/global_step=2820, RunningAvgSamplesPerSec=177.06649043533685, CurrSamplesPerSec=176.94802080380745, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:05:01,280] [INFO] [logging.py:96:log_dist] [Rank 0] step=2830, skipped=52, lr=[8.826513349458089e-06, 8.826513349458089e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:05:01,298] [INFO] [timer.py:199:stop] epoch=3/micro_step=70/global_step=2830, RunningAvgSamplesPerSec=177.06644463446014, CurrSamplesPerSec=177.00122512979914, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:05:04,899] [INFO] [logging.py:96:log_dist] [Rank 0] step=2840, skipped=52, lr=[8.820750305684452e-06, 8.820750305684452e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:05:04,917] [INFO] [timer.py:199:stop] epoch=3/micro_step=80/global_step=2840, RunningAvgSamplesPerSec=177.06641767601812, CurrSamplesPerSec=177.0134806580286, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:05:08,518] [INFO] [logging.py:96:log_dist] [Rank 0] step=2850, skipped=52, lr=[8.81496906145344e-06, 8.81496906145344e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:05:08,536] [INFO] [timer.py:199:stop] epoch=3/micro_step=90/global_step=2850, RunningAvgSamplesPerSec=177.0662591820771, CurrSamplesPerSec=177.02410348870796, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:05:10,317] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:05:10,650] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:05:12,081] [INFO] [logging.py:96:log_dist] [Rank 0] step=2860, skipped=54, lr=[8.810330979432513e-06, 8.810330979432513e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:05:12,099] [INFO] [timer.py:199:stop] epoch=3/micro_step=100/global_step=2860, RunningAvgSamplesPerSec=177.0757896378999, CurrSamplesPerSec=176.73901386862468, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:05:15,699] [INFO] [logging.py:96:log_dist] [Rank 0] step=2870, skipped=54, lr=[8.804517040793774e-06, 8.804517040793774e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:05:15,717] [INFO] [timer.py:199:stop] epoch=3/micro_step=110/global_step=2870, RunningAvgSamplesPerSec=177.07577175346304, CurrSamplesPerSec=177.12747996856467, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:05:19,333] [INFO] [logging.py:96:log_dist] [Rank 0] step=2880, skipped=54, lr=[8.798684975639427e-06, 8.798684975639427e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:05:19,351] [INFO] [timer.py:199:stop] epoch=3/micro_step=120/global_step=2880, RunningAvgSamplesPerSec=177.07321575591638, CurrSamplesPerSec=176.78731883175087, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:05:22,952] [INFO] [logging.py:96:log_dist] [Rank 0] step=2890, skipped=54, lr=[8.792834810534262e-06, 8.792834810534262e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:05:22,971] [INFO] [timer.py:199:stop] epoch=3/micro_step=130/global_step=2890, RunningAvgSamplesPerSec=177.07302878689353, CurrSamplesPerSec=176.9502370121245, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:05:26,574] [INFO] [logging.py:96:log_dist] [Rank 0] step=2900, skipped=54, lr=[8.786966572125507e-06, 8.786966572125507e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:05:26,592] [INFO] [timer.py:199:stop] epoch=3/micro_step=140/global_step=2900, RunningAvgSamplesPerSec=177.0725162708821, CurrSamplesPerSec=176.06388884516295, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:05:30,194] [INFO] [logging.py:96:log_dist] [Rank 0] step=2910, skipped=54, lr=[8.781080287142716e-06, 8.781080287142716e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:05:30,212] [INFO] [timer.py:199:stop] epoch=3/micro_step=150/global_step=2910, RunningAvgSamplesPerSec=177.07217638440636, CurrSamplesPerSec=176.87060500996907, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:05:33,813] [INFO] [logging.py:96:log_dist] [Rank 0] step=2920, skipped=54, lr=[8.775175982397645e-06, 8.775175982397645e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:05:33,832] [INFO] [timer.py:199:stop] epoch=3/micro_step=160/global_step=2920, RunningAvgSamplesPerSec=177.07205986875684, CurrSamplesPerSec=176.97240214896814, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:05:37,433] [INFO] [logging.py:96:log_dist] [Rank 0] step=2930, skipped=54, lr=[8.769253684784129e-06, 8.769253684784129e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:05:37,451] [INFO] [timer.py:199:stop] epoch=3/micro_step=170/global_step=2930, RunningAvgSamplesPerSec=177.07190784718858, CurrSamplesPerSec=176.97881940919035, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:05:41,051] [INFO] [logging.py:96:log_dist] [Rank 0] step=2940, skipped=54, lr=[8.763313421277957e-06, 8.763313421277957e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:05:41,069] [INFO] [timer.py:199:stop] epoch=3/micro_step=180/global_step=2940, RunningAvgSamplesPerSec=177.07191628400318, CurrSamplesPerSec=177.1117028716809, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:05:44,675] [INFO] [logging.py:96:log_dist] [Rank 0] step=2950, skipped=54, lr=[8.757355218936757e-06, 8.757355218936757e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:05:44,693] [INFO] [timer.py:199:stop] epoch=3/micro_step=190/global_step=2950, RunningAvgSamplesPerSec=177.07105200204404, CurrSamplesPerSec=176.89822907667235, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:05:47,200] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:05:47,533] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:05:48,240] [INFO] [logging.py:96:log_dist] [Rank 0] step=2960, skipped=56, lr=[8.752575759337464e-06, 8.752575759337464e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:05:48,258] [INFO] [timer.py:199:stop] epoch=3/micro_step=200/global_step=2960, RunningAvgSamplesPerSec=177.07989093703873, CurrSamplesPerSec=176.7943048666677, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:05:51,866] [INFO] [logging.py:96:log_dist] [Rank 0] step=2970, skipped=56, lr=[8.746585335539165e-06, 8.746585335539165e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:05:51,884] [INFO] [timer.py:199:stop] epoch=3/micro_step=210/global_step=2970, RunningAvgSamplesPerSec=177.07860424119437, CurrSamplesPerSec=176.89846222787497, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:05:55,509] [INFO] [logging.py:96:log_dist] [Rank 0] step=2980, skipped=56, lr=[8.740577049101491e-06, 8.740577049101491e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:05:55,527] [INFO] [timer.py:199:stop] epoch=3/micro_step=220/global_step=2980, RunningAvgSamplesPerSec=177.07465542515462, CurrSamplesPerSec=177.00157526419798, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:05:59,129] [INFO] [logging.py:96:log_dist] [Rank 0] step=2990, skipped=56, lr=[8.73455092739191e-06, 8.73455092739191e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:05:59,148] [INFO] [timer.py:199:stop] epoch=3/micro_step=230/global_step=2990, RunningAvgSamplesPerSec=177.07429206153915, CurrSamplesPerSec=176.99760715545577, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:06:02,748] [INFO] [logging.py:96:log_dist] [Rank 0] step=3000, skipped=56, lr=[8.728506997859123e-06, 8.728506997859123e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:06:02,766] [INFO] [timer.py:199:stop] epoch=3/micro_step=240/global_step=3000, RunningAvgSamplesPerSec=177.07417954350367, CurrSamplesPerSec=176.6505039204783, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:06:06,368] [INFO] [logging.py:96:log_dist] [Rank 0] step=3010, skipped=56, lr=[8.72244528803295e-06, 8.72244528803295e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:06:06,386] [INFO] [timer.py:199:stop] epoch=3/micro_step=250/global_step=3010, RunningAvgSamplesPerSec=177.07400603052702, CurrSamplesPerSec=177.0270220720373, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:06:09,989] [INFO] [logging.py:96:log_dist] [Rank 0] step=3020, skipped=56, lr=[8.7163658255242e-06, 8.7163658255242e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:06:10,008] [INFO] [timer.py:199:stop] epoch=3/micro_step=260/global_step=3020, RunningAvgSamplesPerSec=177.07346737905948, CurrSamplesPerSec=176.94067272078902, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:06:13,610] [INFO] [logging.py:96:log_dist] [Rank 0] step=3030, skipped=56, lr=[8.710268638024543e-06, 8.710268638024543e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:06:13,628] [INFO] [timer.py:199:stop] epoch=3/micro_step=270/global_step=3030, RunningAvgSamplesPerSec=177.0731158298107, CurrSamplesPerSec=176.08594798016605, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:06:17,229] [INFO] [logging.py:96:log_dist] [Rank 0] step=3040, skipped=56, lr=[8.704153753306384e-06, 8.704153753306384e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:06:17,248] [INFO] [timer.py:199:stop] epoch=3/micro_step=280/global_step=3040, RunningAvgSamplesPerSec=177.07290208781518, CurrSamplesPerSec=177.12245436598008, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:06:20,849] [INFO] [logging.py:96:log_dist] [Rank 0] step=3050, skipped=56, lr=[8.698021199222738e-06, 8.698021199222738e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:06:20,868] [INFO] [timer.py:199:stop] epoch=3/micro_step=290/global_step=3050, RunningAvgSamplesPerSec=177.07269066856546, CurrSamplesPerSec=177.11661100062946, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:06:24,118] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:06:24,451] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:06:24,452] [INFO] [logging.py:96:log_dist] [Rank 0] step=3060, skipped=58, lr=[8.693102452781284e-06, 8.693102452781284e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:06:24,452] [INFO] [timer.py:199:stop] epoch=3/micro_step=300/global_step=3060, RunningAvgSamplesPerSec=177.0781420472198, CurrSamplesPerSec=192.2556806055111, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:06:28,054] [INFO] [logging.py:96:log_dist] [Rank 0] step=3070, skipped=58, lr=[8.68693816428619e-06, 8.68693816428619e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:06:28,073] [INFO] [timer.py:199:stop] epoch=3/micro_step=310/global_step=3070, RunningAvgSamplesPerSec=177.0777802446931, CurrSamplesPerSec=177.09814850001055, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:06:31,675] [INFO] [logging.py:96:log_dist] [Rank 0] step=3080, skipped=58, lr=[8.680756284841818e-06, 8.680756284841818e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:06:31,693] [INFO] [timer.py:199:stop] epoch=3/micro_step=320/global_step=3080, RunningAvgSamplesPerSec=177.07742174328737, CurrSamplesPerSec=177.02713881737222, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:06:35,296] [INFO] [logging.py:96:log_dist] [Rank 0] step=3090, skipped=58, lr=[8.674556842606344e-06, 8.674556842606344e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:06:35,314] [INFO] [timer.py:199:stop] epoch=3/micro_step=330/global_step=3090, RunningAvgSamplesPerSec=177.0769877919454, CurrSamplesPerSec=176.93169255917266, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:06:38,917] [INFO] [logging.py:96:log_dist] [Rank 0] step=3100, skipped=58, lr=[8.668339865817942e-06, 8.668339865817942e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:06:38,936] [INFO] [timer.py:199:stop] epoch=3/micro_step=340/global_step=3100, RunningAvgSamplesPerSec=177.07644055007665, CurrSamplesPerSec=176.7235385860166, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:06:42,539] [INFO] [logging.py:96:log_dist] [Rank 0] step=3110, skipped=58, lr=[8.662105382794651e-06, 8.662105382794651e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:06:42,558] [INFO] [timer.py:199:stop] epoch=3/micro_step=350/global_step=3110, RunningAvgSamplesPerSec=177.07591964629157, CurrSamplesPerSec=176.8711877070994, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:06:46,161] [INFO] [logging.py:96:log_dist] [Rank 0] step=3120, skipped=58, lr=[8.655853421934254e-06, 8.655853421934254e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:06:46,180] [INFO] [timer.py:199:stop] epoch=3/micro_step=360/global_step=3120, RunningAvgSamplesPerSec=177.07537975897156, CurrSamplesPerSec=176.06308050017873, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:06:49,784] [INFO] [logging.py:96:log_dist] [Rank 0] step=3130, skipped=58, lr=[8.649584011714141e-06, 8.649584011714141e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:06:49,802] [INFO] [timer.py:199:stop] epoch=3/micro_step=370/global_step=3130, RunningAvgSamplesPerSec=177.07473731067506, CurrSamplesPerSec=176.79267474246936, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:06:53,403] [INFO] [logging.py:96:log_dist] [Rank 0] step=3140, skipped=58, lr=[8.643297180691187e-06, 8.643297180691187e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:06:53,421] [INFO] [timer.py:199:stop] epoch=3/micro_step=380/global_step=3140, RunningAvgSamplesPerSec=177.07473719814692, CurrSamplesPerSec=177.0608846729984, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:06:57,058] [INFO] [logging.py:96:log_dist] [Rank 0] step=3150, skipped=58, lr=[8.636992957501612e-06, 8.636992957501612e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:06:57,076] [INFO] [timer.py:199:stop] epoch=3/micro_step=390/global_step=3150, RunningAvgSamplesPerSec=177.0699890419398, CurrSamplesPerSec=177.09639593101542, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:07:00,697] [INFO] [logging.py:96:log_dist] [Rank 0] step=3160, skipped=58, lr=[8.630671370860863e-06, 8.630671370860863e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:07:00,715] [INFO] [timer.py:199:stop] epoch=3/micro_step=400/global_step=3160, RunningAvgSamplesPerSec=177.0668418992633, CurrSamplesPerSec=177.24186026549754, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:07:01,048] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:07:01,381] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:07:04,260] [INFO] [logging.py:96:log_dist] [Rank 0] step=3170, skipped=60, lr=[8.625601619210692e-06, 8.625601619210692e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:07:04,279] [INFO] [timer.py:199:stop] epoch=3/micro_step=410/global_step=3170, RunningAvgSamplesPerSec=177.07529432339052, CurrSamplesPerSec=176.98897129847515, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:07:07,883] [INFO] [logging.py:96:log_dist] [Rank 0] step=3180, skipped=60, lr=[8.61924885097312e-06, 8.61924885097312e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:07:07,901] [INFO] [timer.py:199:stop] epoch=3/micro_step=420/global_step=3180, RunningAvgSamplesPerSec=177.0746705956821, CurrSamplesPerSec=176.98850451940189, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:07:11,504] [INFO] [logging.py:96:log_dist] [Rank 0] step=3190, skipped=60, lr=[8.612878800107956e-06, 8.612878800107956e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:07:11,522] [INFO] [timer.py:199:stop] epoch=3/micro_step=430/global_step=3190, RunningAvgSamplesPerSec=177.0742305763829, CurrSamplesPerSec=176.9708854100053, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:07:15,123] [INFO] [logging.py:96:log_dist] [Rank 0] step=3200, skipped=60, lr=[8.606491495630485e-06, 8.606491495630485e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:07:15,142] [INFO] [timer.py:199:stop] epoch=3/micro_step=440/global_step=3200, RunningAvgSamplesPerSec=177.07402418592682, CurrSamplesPerSec=176.76310485132197, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:07:18,743] [INFO] [logging.py:96:log_dist] [Rank 0] step=3210, skipped=60, lr=[8.600086966634588e-06, 8.600086966634588e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:07:18,761] [INFO] [timer.py:199:stop] epoch=3/micro_step=450/global_step=3210, RunningAvgSamplesPerSec=177.0738835304409, CurrSamplesPerSec=177.15331372849232, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:07:22,361] [INFO] [logging.py:96:log_dist] [Rank 0] step=3220, skipped=60, lr=[8.593665242292592e-06, 8.593665242292592e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:07:22,379] [INFO] [timer.py:199:stop] epoch=3/micro_step=460/global_step=3220, RunningAvgSamplesPerSec=177.0738733332284, CurrSamplesPerSec=176.9364740902059, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:07:25,981] [INFO] [logging.py:96:log_dist] [Rank 0] step=3230, skipped=60, lr=[8.587226351855153e-06, 8.587226351855153e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:07:25,999] [INFO] [timer.py:199:stop] epoch=3/micro_step=470/global_step=3230, RunningAvgSamplesPerSec=177.07370915732315, CurrSamplesPerSec=177.23378564675184, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:07:29,604] [INFO] [logging.py:96:log_dist] [Rank 0] step=3240, skipped=60, lr=[8.580770324651124e-06, 8.580770324651124e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:07:29,623] [INFO] [timer.py:199:stop] epoch=3/micro_step=480/global_step=3240, RunningAvgSamplesPerSec=177.07291734170198, CurrSamplesPerSec=177.17330791808598, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:07:33,224] [INFO] [logging.py:96:log_dist] [Rank 0] step=3250, skipped=60, lr=[8.574297190087406e-06, 8.574297190087406e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:07:33,242] [INFO] [timer.py:199:stop] epoch=3/micro_step=490/global_step=3250, RunningAvgSamplesPerSec=177.07271756326142, CurrSamplesPerSec=176.83728112359847, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:07:36,843] [INFO] [logging.py:96:log_dist] [Rank 0] step=3260, skipped=60, lr=[8.567806977648827e-06, 8.567806977648827e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:07:36,861] [INFO] [timer.py:199:stop] epoch=3/micro_step=500/global_step=3260, RunningAvgSamplesPerSec=177.07269421973888, CurrSamplesPerSec=176.83005872037796, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:07:37,919] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:07:38,252] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:07:40,407] [INFO] [logging.py:96:log_dist] [Rank 0] step=3270, skipped=62, lr=[8.562602531491531e-06, 8.562602531491531e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:07:40,425] [INFO] [timer.py:199:stop] epoch=3/micro_step=510/global_step=3270, RunningAvgSamplesPerSec=177.08081539445763, CurrSamplesPerSec=176.99772386199695, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:07:44,027] [INFO] [logging.py:96:log_dist] [Rank 0] step=3280, skipped=62, lr=[8.556081653428184e-06, 8.556081653428184e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:07:44,045] [INFO] [timer.py:199:stop] epoch=3/micro_step=520/global_step=3280, RunningAvgSamplesPerSec=177.08063360064943, CurrSamplesPerSec=176.6376012044514, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:07:47,646] [INFO] [logging.py:96:log_dist] [Rank 0] step=3290, skipped=62, lr=[8.549543780460902e-06, 8.549543780460902e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:07:47,665] [INFO] [timer.py:199:stop] epoch=3/micro_step=530/global_step=3290, RunningAvgSamplesPerSec=177.0804410822223, CurrSamplesPerSec=177.06847634363635, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:07:51,265] [INFO] [logging.py:96:log_dist] [Rank 0] step=3300, skipped=62, lr=[8.542988942369392e-06, 8.542988942369392e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:07:51,283] [INFO] [timer.py:199:stop] epoch=3/micro_step=540/global_step=3300, RunningAvgSamplesPerSec=177.0804371041967, CurrSamplesPerSec=177.02176869133302, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:07:54,886] [INFO] [logging.py:96:log_dist] [Rank 0] step=3310, skipped=62, lr=[8.536417169010639e-06, 8.536417169010639e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:07:54,904] [INFO] [timer.py:199:stop] epoch=3/micro_step=550/global_step=3310, RunningAvgSamplesPerSec=177.08007256895257, CurrSamplesPerSec=177.08915234714388, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:07:58,506] [INFO] [logging.py:96:log_dist] [Rank 0] step=3320, skipped=62, lr=[8.529828490318763e-06, 8.529828490318763e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:07:58,524] [INFO] [timer.py:199:stop] epoch=3/micro_step=560/global_step=3320, RunningAvgSamplesPerSec=177.07977054106055, CurrSamplesPerSec=176.80792922293827, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:08:02,136] [INFO] [logging.py:96:log_dist] [Rank 0] step=3330, skipped=62, lr=[8.523222936304894e-06, 8.523222936304894e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:08:02,147] [INFO] [timer.py:199:stop] epoch=3/micro_step=570/global_step=3330, RunningAvgSamplesPerSec=177.07917208086337, CurrSamplesPerSec=177.05189230545992, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:08:05,754] [INFO] [logging.py:96:log_dist] [Rank 0] step=3340, skipped=62, lr=[8.516600537057021e-06, 8.516600537057021e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:08:05,773] [INFO] [timer.py:199:stop] epoch=3/micro_step=580/global_step=3340, RunningAvgSamplesPerSec=177.07801365293872, CurrSamplesPerSec=176.8694396272264, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:08:09,374] [INFO] [logging.py:96:log_dist] [Rank 0] step=3350, skipped=62, lr=[8.509961322739866e-06, 8.509961322739866e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:08:09,392] [INFO] [timer.py:199:stop] epoch=3/micro_step=590/global_step=3350, RunningAvgSamplesPerSec=177.0778512213355, CurrSamplesPerSec=177.21541570473104, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:08:12,992] [INFO] [logging.py:96:log_dist] [Rank 0] step=3360, skipped=62, lr=[8.503305323594745e-06, 8.503305323594745e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:08:13,011] [INFO] [timer.py:199:stop] epoch=3/micro_step=600/global_step=3360, RunningAvgSamplesPerSec=177.077772026885, CurrSamplesPerSec=177.09511073579765, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:08:14,791] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:08:15,124] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:08:16,555] [INFO] [logging.py:96:log_dist] [Rank 0] step=3370, skipped=64, lr=[8.497968459573483e-06, 8.497968459573483e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:08:16,573] [INFO] [timer.py:199:stop] epoch=3/micro_step=610/global_step=3370, RunningAvgSamplesPerSec=177.0859478428321, CurrSamplesPerSec=177.24607340214027, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:08:20,173] [INFO] [logging.py:96:log_dist] [Rank 0] step=3380, skipped=64, lr=[8.491282324190084e-06, 8.491282324190084e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:08:20,191] [INFO] [timer.py:199:stop] epoch=3/micro_step=620/global_step=3380, RunningAvgSamplesPerSec=177.08590337884, CurrSamplesPerSec=177.06800914510666, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:08:23,792] [INFO] [logging.py:96:log_dist] [Rank 0] step=3390, skipped=64, lr=[8.484579489060685e-06, 8.484579489060685e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:08:23,811] [INFO] [timer.py:199:stop] epoch=3/micro_step=630/global_step=3390, RunningAvgSamplesPerSec=177.0857147005359, CurrSamplesPerSec=177.0028591021786, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:08:27,414] [INFO] [logging.py:96:log_dist] [Rank 0] step=3400, skipped=64, lr=[8.477859984716394e-06, 8.477859984716394e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:08:27,432] [INFO] [timer.py:199:stop] epoch=3/micro_step=640/global_step=3400, RunningAvgSamplesPerSec=177.08524119712519, CurrSamplesPerSec=177.06135183393258, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:08:31,035] [INFO] [logging.py:96:log_dist] [Rank 0] step=3410, skipped=64, lr=[8.471123841764245e-06, 8.471123841764245e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:08:31,054] [INFO] [timer.py:199:stop] epoch=3/micro_step=650/global_step=3410, RunningAvgSamplesPerSec=177.08476694430544, CurrSamplesPerSec=176.53456338785847, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:08:34,660] [INFO] [logging.py:96:log_dist] [Rank 0] step=3420, skipped=64, lr=[8.464371090887049e-06, 8.464371090887049e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:08:34,678] [INFO] [timer.py:199:stop] epoch=3/micro_step=660/global_step=3420, RunningAvgSamplesPerSec=177.0838576825894, CurrSamplesPerSec=176.9088380786178, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:08:38,279] [INFO] [logging.py:96:log_dist] [Rank 0] step=3430, skipped=64, lr=[8.45760176284328e-06, 8.45760176284328e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:08:38,297] [INFO] [timer.py:199:stop] epoch=3/micro_step=670/global_step=3430, RunningAvgSamplesPerSec=177.0837912033907, CurrSamplesPerSec=177.03320978698147, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:08:41,900] [INFO] [logging.py:96:log_dist] [Rank 0] step=3440, skipped=64, lr=[8.450815888466909e-06, 8.450815888466909e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:08:41,918] [INFO] [timer.py:199:stop] epoch=3/micro_step=680/global_step=3440, RunningAvgSamplesPerSec=177.08338097157412, CurrSamplesPerSec=176.90160982926372, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:08:45,521] [INFO] [logging.py:96:log_dist] [Rank 0] step=3450, skipped=64, lr=[8.444013498667281e-06, 8.444013498667281e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:08:45,540] [INFO] [timer.py:199:stop] epoch=3/micro_step=690/global_step=3450, RunningAvgSamplesPerSec=177.0829286364331, CurrSamplesPerSec=177.01266357001364, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:08:49,142] [INFO] [logging.py:96:log_dist] [Rank 0] step=3460, skipped=64, lr=[8.437194624428967e-06, 8.437194624428967e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:08:49,160] [INFO] [timer.py:199:stop] epoch=3/micro_step=700/global_step=3460, RunningAvgSamplesPerSec=177.0825844212314, CurrSamplesPerSec=177.12689558151843, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:08:51,664] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:08:51,997] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:08:52,706] [INFO] [logging.py:96:log_dist] [Rank 0] step=3470, skipped=66, lr=[8.43172767711203e-06, 8.43172767711203e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:08:52,724] [INFO] [timer.py:199:stop] epoch=3/micro_step=710/global_step=3470, RunningAvgSamplesPerSec=177.0902292661916, CurrSamplesPerSec=177.04897290202584, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:08:56,327] [INFO] [logging.py:96:log_dist] [Rank 0] step=3480, skipped=66, lr=[8.42487920920478e-06, 8.42487920920478e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:08:56,346] [INFO] [timer.py:199:stop] epoch=3/micro_step=720/global_step=3480, RunningAvgSamplesPerSec=177.08976318967137, CurrSamplesPerSec=177.08961965725413, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:08:59,947] [INFO] [logging.py:96:log_dist] [Rank 0] step=3490, skipped=66, lr=[8.418014344014644e-06, 8.418014344014644e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:08:59,966] [INFO] [timer.py:199:stop] epoch=3/micro_step=730/global_step=3490, RunningAvgSamplesPerSec=177.08942636232732, CurrSamplesPerSec=177.09464339670689, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:09:03,568] [INFO] [logging.py:96:log_dist] [Rank 0] step=3500, skipped=66, lr=[8.411133112810762e-06, 8.411133112810762e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:09:03,586] [INFO] [timer.py:199:stop] epoch=3/micro_step=740/global_step=3500, RunningAvgSamplesPerSec=177.08907098894622, CurrSamplesPerSec=176.91874873540235, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:09:07,203] [INFO] [logging.py:96:log_dist] [Rank 0] step=3510, skipped=66, lr=[8.404235546936829e-06, 8.404235546936829e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:09:07,221] [INFO] [timer.py:199:stop] epoch=3/micro_step=750/global_step=3510, RunningAvgSamplesPerSec=177.08720377182703, CurrSamplesPerSec=176.744134107639, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:09:10,824] [INFO] [logging.py:96:log_dist] [Rank 0] step=3520, skipped=66, lr=[8.397321677810934e-06, 8.397321677810934e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:09:10,842] [INFO] [timer.py:199:stop] epoch=3/micro_step=760/global_step=3520, RunningAvgSamplesPerSec=177.08671624422115, CurrSamplesPerSec=176.97298551703042, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:09:14,444] [INFO] [logging.py:96:log_dist] [Rank 0] step=3530, skipped=66, lr=[8.390391536925431e-06, 8.390391536925431e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:09:14,463] [INFO] [timer.py:199:stop] epoch=3/micro_step=770/global_step=3530, RunningAvgSamplesPerSec=177.08637903557573, CurrSamplesPerSec=176.6007632829456, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:09:18,065] [INFO] [logging.py:96:log_dist] [Rank 0] step=3540, skipped=66, lr=[8.38344515584679e-06, 8.38344515584679e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:09:18,083] [INFO] [timer.py:199:stop] epoch=3/micro_step=780/global_step=3540, RunningAvgSamplesPerSec=177.08616679830973, CurrSamplesPerSec=177.30226770295093, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:09:21,682] [INFO] [logging.py:96:log_dist] [Rank 0] step=3550, skipped=66, lr=[8.376482566215455e-06, 8.376482566215455e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:09:21,701] [INFO] [timer.py:199:stop] epoch=3/micro_step=790/global_step=3550, RunningAvgSamplesPerSec=177.08619444199516, CurrSamplesPerSec=177.065322801405, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:09:25,301] [INFO] [logging.py:96:log_dist] [Rank 0] step=3560, skipped=66, lr=[8.3695037997457e-06, 8.3695037997457e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:09:25,319] [INFO] [timer.py:199:stop] epoch=3/micro_step=800/global_step=3560, RunningAvgSamplesPerSec=177.08617934461947, CurrSamplesPerSec=177.09464339670689, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:09:28,548] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:09:28,882] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:09:28,882] [INFO] [logging.py:96:log_dist] [Rank 0] step=3570, skipped=68, lr=[8.363909160605268e-06, 8.363909160605268e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:09:28,883] [INFO] [timer.py:199:stop] epoch=3/micro_step=810/global_step=3570, RunningAvgSamplesPerSec=177.0937375085558, CurrSamplesPerSec=191.9997711184369, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:09:32,483] [INFO] [logging.py:96:log_dist] [Rank 0] step=3580, skipped=68, lr=[8.356901355981433e-06, 8.356901355981433e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:09:32,501] [INFO] [timer.py:199:stop] epoch=3/micro_step=820/global_step=3580, RunningAvgSamplesPerSec=177.0937265545864, CurrSamplesPerSec=177.04476913021426, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:09:36,104] [INFO] [logging.py:96:log_dist] [Rank 0] step=3590, skipped=68, lr=[8.349877463710679e-06, 8.349877463710679e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:09:36,122] [INFO] [timer.py:199:stop] epoch=3/micro_step=830/global_step=3590, RunningAvgSamplesPerSec=177.09325884749114, CurrSamplesPerSec=176.7939555518088, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:09:39,726] [INFO] [logging.py:96:log_dist] [Rank 0] step=3600, skipped=68, lr=[8.342837515786516e-06, 8.342837515786516e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:09:39,745] [INFO] [timer.py:199:stop] epoch=3/micro_step=840/global_step=3600, RunningAvgSamplesPerSec=177.09267892393902, CurrSamplesPerSec=177.0227026028924, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:09:43,356] [INFO] [logging.py:96:log_dist] [Rank 0] step=3610, skipped=68, lr=[8.335781544275574e-06, 8.335781544275574e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:09:43,374] [INFO] [timer.py:199:stop] epoch=3/micro_step=850/global_step=3610, RunningAvgSamplesPerSec=177.0911292036249, CurrSamplesPerSec=176.94277211080532, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:09:46,975] [INFO] [logging.py:96:log_dist] [Rank 0] step=3620, skipped=68, lr=[8.32870958131748e-06, 8.32870958131748e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:09:46,993] [INFO] [timer.py:199:stop] epoch=3/micro_step=860/global_step=3620, RunningAvgSamplesPerSec=177.09094946379489, CurrSamplesPerSec=177.01826661074577, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:09:50,594] [INFO] [logging.py:96:log_dist] [Rank 0] step=3630, skipped=68, lr=[8.321621659124696e-06, 8.321621659124696e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:09:50,612] [INFO] [timer.py:199:stop] epoch=3/micro_step=870/global_step=3630, RunningAvgSamplesPerSec=177.0908286146368, CurrSamplesPerSec=177.11427376244222, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:09:54,215] [INFO] [logging.py:96:log_dist] [Rank 0] step=3640, skipped=68, lr=[8.31451780998238e-06, 8.31451780998238e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:09:54,234] [INFO] [timer.py:199:stop] epoch=3/micro_step=880/global_step=3640, RunningAvgSamplesPerSec=177.09036009449272, CurrSamplesPerSec=176.82621477858473, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:09:57,833] [INFO] [logging.py:96:log_dist] [Rank 0] step=3650, skipped=68, lr=[8.307398066248235e-06, 8.307398066248235e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:09:57,851] [INFO] [timer.py:199:stop] epoch=3/micro_step=890/global_step=3650, RunningAvgSamplesPerSec=177.09037631928038, CurrSamplesPerSec=177.00507668437584, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:10:01,452] [INFO] [logging.py:96:log_dist] [Rank 0] step=3660, skipped=68, lr=[8.300262460352361e-06, 8.300262460352361e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:10:01,470] [INFO] [timer.py:199:stop] epoch=3/micro_step=900/global_step=3660, RunningAvgSamplesPerSec=177.09024432748424, CurrSamplesPerSec=177.28669363041004, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:10:05,071] [INFO] [logging.py:96:log_dist] [Rank 0] step=3670, skipped=68, lr=[8.293111024797115e-06, 8.293111024797115e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:10:05,090] [INFO] [timer.py:199:stop] epoch=3/micro_step=910/global_step=3670, RunningAvgSamplesPerSec=177.09003932116175, CurrSamplesPerSec=176.988854603476, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:10:05,423] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:10:05,756] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:10:08,634] [INFO] [logging.py:96:log_dist] [Rank 0] step=3680, skipped=70, lr=[8.287378500885789e-06, 8.287378500885789e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:10:08,652] [INFO] [timer.py:199:stop] epoch=3/micro_step=920/global_step=3680, RunningAvgSamplesPerSec=177.09751151007285, CurrSamplesPerSec=176.91606691869262, MemAllocated=4.98GB, MaxMemAllocated=24.08GB ***** Evaluating perplexity, Epoch 4/16 ***** ppl: 1.872370719909668 Beginning of Epoch 5/16, Total Micro Batches 920 [2023-04-21 22:10:20,437] [INFO] [logging.py:96:log_dist] [Rank 0] step=3690, skipped=70, lr=[8.280198654079664e-06, 8.280198654079664e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:10:20,455] [INFO] [timer.py:199:stop] epoch=4/micro_step=10/global_step=3690, RunningAvgSamplesPerSec=177.0945905234252, CurrSamplesPerSec=176.89787935102078, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:10:24,058] [INFO] [logging.py:96:log_dist] [Rank 0] step=3700, skipped=70, lr=[8.273003069003873e-06, 8.273003069003873e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:10:24,077] [INFO] [timer.py:199:stop] epoch=4/micro_step=20/global_step=3700, RunningAvgSamplesPerSec=177.09420379943757, CurrSamplesPerSec=176.87095462778655, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:10:27,681] [INFO] [logging.py:96:log_dist] [Rank 0] step=3710, skipped=70, lr=[8.265791778433975e-06, 8.265791778433975e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:10:27,700] [INFO] [timer.py:199:stop] epoch=4/micro_step=30/global_step=3710, RunningAvgSamplesPerSec=177.0935531374707, CurrSamplesPerSec=176.84019354987467, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:10:31,311] [INFO] [logging.py:96:log_dist] [Rank 0] step=3720, skipped=70, lr=[8.258564815217059e-06, 8.258564815217059e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:10:31,330] [INFO] [timer.py:199:stop] epoch=4/micro_step=40/global_step=3720, RunningAvgSamplesPerSec=177.09209497769749, CurrSamplesPerSec=176.83565020678606, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:10:34,931] [INFO] [logging.py:96:log_dist] [Rank 0] step=3730, skipped=70, lr=[8.251322212271614e-06, 8.251322212271614e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:10:34,950] [INFO] [timer.py:199:stop] epoch=4/micro_step=50/global_step=3730, RunningAvgSamplesPerSec=177.0918767383875, CurrSamplesPerSec=177.27393195667005, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:10:38,551] [INFO] [logging.py:96:log_dist] [Rank 0] step=3740, skipped=70, lr=[8.244064002587355e-06, 8.244064002587355e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:10:38,569] [INFO] [timer.py:199:stop] epoch=4/micro_step=60/global_step=3740, RunningAvgSamplesPerSec=177.09168995321912, CurrSamplesPerSec=177.1744773086237, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:10:42,170] [INFO] [logging.py:96:log_dist] [Rank 0] step=3750, skipped=70, lr=[8.236790219225093e-06, 8.236790219225093e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:10:42,189] [INFO] [timer.py:199:stop] epoch=4/micro_step=70/global_step=3750, RunningAvgSamplesPerSec=177.09154821032791, CurrSamplesPerSec=177.0747837650995, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:10:45,814] [INFO] [logging.py:96:log_dist] [Rank 0] step=3760, skipped=70, lr=[8.229500895316573e-06, 8.229500895316573e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:10:45,833] [INFO] [timer.py:199:stop] epoch=4/micro_step=80/global_step=3760, RunningAvgSamplesPerSec=177.08820778639426, CurrSamplesPerSec=177.01067924481023, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:10:49,439] [INFO] [logging.py:96:log_dist] [Rank 0] step=3770, skipped=70, lr=[8.222196064064329e-06, 8.222196064064329e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:10:49,457] [INFO] [timer.py:199:stop] epoch=4/micro_step=90/global_step=3770, RunningAvgSamplesPerSec=177.0874484472268, CurrSamplesPerSec=176.79919541963517, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:10:50,515] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:10:50,848] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:10:53,004] [INFO] [logging.py:96:log_dist] [Rank 0] step=3780, skipped=72, lr=[8.216341056132252e-06, 8.216341056132252e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:10:53,022] [INFO] [timer.py:199:stop] epoch=4/micro_step=100/global_step=3780, RunningAvgSamplesPerSec=177.09436111420536, CurrSamplesPerSec=176.96388541321937, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:10:56,623] [INFO] [logging.py:96:log_dist] [Rank 0] step=3790, skipped=72, lr=[8.209008395557055e-06, 8.209008395557055e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:10:56,642] [INFO] [timer.py:199:stop] epoch=4/micro_step=110/global_step=3790, RunningAvgSamplesPerSec=177.09416264331395, CurrSamplesPerSec=176.84764979830567, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:11:00,243] [INFO] [logging.py:96:log_dist] [Rank 0] step=3800, skipped=72, lr=[8.20166032098052e-06, 8.20166032098052e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:11:00,262] [INFO] [timer.py:199:stop] epoch=4/micro_step=120/global_step=3800, RunningAvgSamplesPerSec=177.09398582845068, CurrSamplesPerSec=177.0437182184526, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:11:03,863] [INFO] [logging.py:96:log_dist] [Rank 0] step=3810, skipped=72, lr=[8.194296865872786e-06, 8.194296865872786e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:11:03,882] [INFO] [timer.py:199:stop] epoch=4/micro_step=130/global_step=3810, RunningAvgSamplesPerSec=177.09376079158974, CurrSamplesPerSec=176.93857338058982, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:11:07,483] [INFO] [logging.py:96:log_dist] [Rank 0] step=3820, skipped=72, lr=[8.186918063774048e-06, 8.186918063774048e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:11:07,502] [INFO] [timer.py:199:stop] epoch=4/micro_step=140/global_step=3820, RunningAvgSamplesPerSec=177.0934913698689, CurrSamplesPerSec=177.000641575546, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:11:11,105] [INFO] [logging.py:96:log_dist] [Rank 0] step=3830, skipped=72, lr=[8.179523948294408e-06, 8.179523948294408e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:11:11,123] [INFO] [timer.py:199:stop] epoch=4/micro_step=150/global_step=3830, RunningAvgSamplesPerSec=177.0931166874086, CurrSamplesPerSec=177.20582272648355, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:11:14,725] [INFO] [logging.py:96:log_dist] [Rank 0] step=3840, skipped=72, lr=[8.172114553113722e-06, 8.172114553113722e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:11:14,743] [INFO] [timer.py:199:stop] epoch=4/micro_step=160/global_step=3840, RunningAvgSamplesPerSec=177.0928887053396, CurrSamplesPerSec=177.00671072786702, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:11:18,389] [INFO] [logging.py:96:log_dist] [Rank 0] step=3850, skipped=72, lr=[8.164689911981435e-06, 8.164689911981435e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:11:18,407] [INFO] [timer.py:199:stop] epoch=4/micro_step=170/global_step=3850, RunningAvgSamplesPerSec=177.08707926512002, CurrSamplesPerSec=172.6178671905394, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:11:22,010] [INFO] [logging.py:96:log_dist] [Rank 0] step=3860, skipped=72, lr=[8.15725005871645e-06, 8.15725005871645e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:11:22,029] [INFO] [timer.py:199:stop] epoch=4/micro_step=180/global_step=3860, RunningAvgSamplesPerSec=177.08667154469182, CurrSamplesPerSec=176.95128681438786, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:11:25,631] [INFO] [logging.py:96:log_dist] [Rank 0] step=3870, skipped=72, lr=[8.14979502720695e-06, 8.14979502720695e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:11:25,649] [INFO] [timer.py:199:stop] epoch=4/micro_step=190/global_step=3870, RunningAvgSamplesPerSec=177.08642193053643, CurrSamplesPerSec=176.96388541321937, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:11:27,429] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:11:27,763] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:11:29,196] [INFO] [logging.py:96:log_dist] [Rank 0] step=3880, skipped=74, lr=[8.143820096480303e-06, 8.143820096480303e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:11:29,214] [INFO] [timer.py:199:stop] epoch=4/micro_step=200/global_step=3880, RunningAvgSamplesPerSec=177.09315013951303, CurrSamplesPerSec=176.9019595696657, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:11:32,816] [INFO] [logging.py:96:log_dist] [Rank 0] step=3890, skipped=74, lr=[8.13633782974949e-06, 8.13633782974949e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:11:32,834] [INFO] [timer.py:199:stop] epoch=4/micro_step=210/global_step=3890, RunningAvgSamplesPerSec=177.09296049157226, CurrSamplesPerSec=177.24127512347147, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:11:36,435] [INFO] [logging.py:96:log_dist] [Rank 0] step=3900, skipped=74, lr=[8.1288404800284e-06, 8.1288404800284e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:11:36,454] [INFO] [timer.py:199:stop] epoch=4/micro_step=220/global_step=3900, RunningAvgSamplesPerSec=177.09279936162787, CurrSamplesPerSec=176.95093687891602, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:11:40,054] [INFO] [logging.py:96:log_dist] [Rank 0] step=3910, skipped=74, lr=[8.121328081467107e-06, 8.121328081467107e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:11:40,073] [INFO] [timer.py:199:stop] epoch=4/micro_step=230/global_step=3910, RunningAvgSamplesPerSec=177.0926539445768, CurrSamplesPerSec=177.15097552154964, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:11:43,674] [INFO] [logging.py:96:log_dist] [Rank 0] step=3920, skipped=74, lr=[8.11380066828424e-06, 8.11380066828424e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:11:43,692] [INFO] [timer.py:199:stop] epoch=4/micro_step=240/global_step=3920, RunningAvgSamplesPerSec=177.09254776666828, CurrSamplesPerSec=177.0744333410073, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:11:47,294] [INFO] [logging.py:96:log_dist] [Rank 0] step=3930, skipped=74, lr=[8.106258274766821e-06, 8.106258274766821e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:11:47,313] [INFO] [timer.py:199:stop] epoch=4/micro_step=250/global_step=3930, RunningAvgSamplesPerSec=177.09224127344035, CurrSamplesPerSec=176.7330794173319, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:11:50,943] [INFO] [logging.py:96:log_dist] [Rank 0] step=3940, skipped=74, lr=[8.098700935270097e-06, 8.098700935270097e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:11:50,954] [INFO] [timer.py:199:stop] epoch=4/micro_step=260/global_step=3940, RunningAvgSamplesPerSec=177.08942189862435, CurrSamplesPerSec=166.77391389166223, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:11:54,555] [INFO] [logging.py:96:log_dist] [Rank 0] step=3950, skipped=74, lr=[8.091128684217402e-06, 8.091128684217402e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:11:54,574] [INFO] [timer.py:199:stop] epoch=4/micro_step=270/global_step=3950, RunningAvgSamplesPerSec=177.08926177655724, CurrSamplesPerSec=177.0555124994311, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:11:58,174] [INFO] [logging.py:96:log_dist] [Rank 0] step=3960, skipped=74, lr=[8.083541556099988e-06, 8.083541556099988e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:11:58,193] [INFO] [timer.py:199:stop] epoch=4/micro_step=280/global_step=3960, RunningAvgSamplesPerSec=177.0891690536149, CurrSamplesPerSec=176.99072174192833, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:12:01,795] [INFO] [logging.py:96:log_dist] [Rank 0] step=3970, skipped=74, lr=[8.075939585476871e-06, 8.075939585476871e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:12:01,813] [INFO] [timer.py:199:stop] epoch=4/micro_step=290/global_step=3970, RunningAvgSamplesPerSec=177.08891147965878, CurrSamplesPerSec=176.99889093587342, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:12:04,318] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:12:04,651] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:12:05,358] [INFO] [logging.py:96:log_dist] [Rank 0] step=3980, skipped=76, lr=[8.069847345641095e-06, 8.069847345641095e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:12:05,377] [INFO] [timer.py:199:stop] epoch=4/micro_step=300/global_step=3980, RunningAvgSamplesPerSec=177.0956878935803, CurrSamplesPerSec=176.75670322492283, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:12:08,980] [INFO] [logging.py:96:log_dist] [Rank 0] step=3990, skipped=76, lr=[8.062218745812137e-06, 8.062218745812137e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:12:08,998] [INFO] [timer.py:199:stop] epoch=4/micro_step=310/global_step=3990, RunningAvgSamplesPerSec=177.0952462915483, CurrSamplesPerSec=176.91361837444185, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:12:12,601] [INFO] [logging.py:96:log_dist] [Rank 0] step=4000, skipped=76, lr=[8.054575400601889e-06, 8.054575400601889e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:12:12,620] [INFO] [timer.py:199:stop] epoch=4/micro_step=320/global_step=4000, RunningAvgSamplesPerSec=177.0948400108012, CurrSamplesPerSec=177.07688633877908, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:12:16,220] [INFO] [logging.py:96:log_dist] [Rank 0] step=4010, skipped=76, lr=[8.046917344825433e-06, 8.046917344825433e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:12:16,239] [INFO] [timer.py:199:stop] epoch=4/micro_step=330/global_step=4010, RunningAvgSamplesPerSec=177.0947378729297, CurrSamplesPerSec=177.2643326423755, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:12:19,839] [INFO] [logging.py:96:log_dist] [Rank 0] step=4020, skipped=76, lr=[8.03924461336486e-06, 8.03924461336486e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:12:19,858] [INFO] [timer.py:199:stop] epoch=4/micro_step=340/global_step=4020, RunningAvgSamplesPerSec=177.0946666007673, CurrSamplesPerSec=176.97076873885425, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:12:23,509] [INFO] [logging.py:96:log_dist] [Rank 0] step=4030, skipped=76, lr=[8.031557241169105e-06, 8.031557241169105e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:12:23,527] [INFO] [timer.py:199:stop] epoch=4/micro_step=350/global_step=4030, RunningAvgSamplesPerSec=177.08840396581328, CurrSamplesPerSec=176.76042775114576, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:12:27,134] [INFO] [logging.py:96:log_dist] [Rank 0] step=4040, skipped=76, lr=[8.023855263253791e-06, 8.023855263253791e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:12:27,153] [INFO] [timer.py:199:stop] epoch=4/micro_step=360/global_step=4040, RunningAvgSamplesPerSec=177.08757487548414, CurrSamplesPerSec=177.00892840657616, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:12:30,753] [INFO] [logging.py:96:log_dist] [Rank 0] step=4050, skipped=76, lr=[8.016138714701073e-06, 8.016138714701073e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:12:30,771] [INFO] [timer.py:199:stop] epoch=4/micro_step=370/global_step=4050, RunningAvgSamplesPerSec=177.08752668014205, CurrSamplesPerSec=177.03706273367422, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:12:34,371] [INFO] [logging.py:96:log_dist] [Rank 0] step=4060, skipped=76, lr=[8.008407630659467e-06, 8.008407630659467e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:12:34,389] [INFO] [timer.py:199:stop] epoch=4/micro_step=380/global_step=4060, RunningAvgSamplesPerSec=177.0875509247358, CurrSamplesPerSec=177.12222062396324, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:12:37,990] [INFO] [logging.py:96:log_dist] [Rank 0] step=4070, skipped=76, lr=[8.000662046343707e-06, 8.000662046343707e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:12:38,009] [INFO] [timer.py:199:stop] epoch=4/micro_step=390/global_step=4070, RunningAvgSamplesPerSec=177.08743950091875, CurrSamplesPerSec=176.99737374283515, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:12:41,236] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:12:41,570] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:12:41,570] [INFO] [logging.py:96:log_dist] [Rank 0] step=4080, skipped=78, lr=[7.994455162400175e-06, 7.994455162400175e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:12:41,570] [INFO] [timer.py:199:stop] epoch=4/micro_step=400/global_step=4080, RunningAvgSamplesPerSec=177.09422611128173, CurrSamplesPerSec=192.08948871158182, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:12:45,171] [INFO] [logging.py:96:log_dist] [Rank 0] step=4090, skipped=78, lr=[7.986683566542777e-06, 7.986683566542777e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:12:45,190] [INFO] [timer.py:199:stop] epoch=4/micro_step=410/global_step=4090, RunningAvgSamplesPerSec=177.0941134705435, CurrSamplesPerSec=177.05574606541623, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:12:48,791] [INFO] [logging.py:96:log_dist] [Rank 0] step=4100, skipped=78, lr=[7.978897569363325e-06, 7.978897569363325e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:12:48,810] [INFO] [timer.py:199:stop] epoch=4/micro_step=420/global_step=4100, RunningAvgSamplesPerSec=177.09389592158294, CurrSamplesPerSec=176.9128022080863, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:12:52,414] [INFO] [logging.py:96:log_dist] [Rank 0] step=4110, skipped=78, lr=[7.971097206326683e-06, 7.971097206326683e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:12:52,433] [INFO] [timer.py:199:stop] epoch=4/micro_step=430/global_step=4110, RunningAvgSamplesPerSec=177.09337375533266, CurrSamplesPerSec=177.06485561951652, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:12:56,043] [INFO] [logging.py:96:log_dist] [Rank 0] step=4120, skipped=78, lr=[7.963282512963134e-06, 7.963282512963134e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:12:56,062] [INFO] [timer.py:199:stop] epoch=4/micro_step=440/global_step=4120, RunningAvgSamplesPerSec=177.09211543345398, CurrSamplesPerSec=176.85522319422648, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:12:59,665] [INFO] [logging.py:96:log_dist] [Rank 0] step=4130, skipped=78, lr=[7.95545352486825e-06, 7.95545352486825e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:12:59,683] [INFO] [timer.py:199:stop] epoch=4/micro_step=450/global_step=4130, RunningAvgSamplesPerSec=177.09169929073738, CurrSamplesPerSec=176.95361975290413, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:13:03,288] [INFO] [logging.py:96:log_dist] [Rank 0] step=4140, skipped=78, lr=[7.947610277702705e-06, 7.947610277702705e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:13:03,307] [INFO] [timer.py:199:stop] epoch=4/micro_step=460/global_step=4140, RunningAvgSamplesPerSec=177.09103348863252, CurrSamplesPerSec=177.0029758156458, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:13:06,907] [INFO] [logging.py:96:log_dist] [Rank 0] step=4150, skipped=78, lr=[7.939752807192133e-06, 7.939752807192133e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:13:06,926] [INFO] [timer.py:199:stop] epoch=4/micro_step=470/global_step=4150, RunningAvgSamplesPerSec=177.09095459714737, CurrSamplesPerSec=177.09639593101542, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:13:10,532] [INFO] [logging.py:96:log_dist] [Rank 0] step=4160, skipped=78, lr=[7.931881149126938e-06, 7.931881149126938e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:13:10,551] [INFO] [timer.py:199:stop] epoch=4/micro_step=480/global_step=4160, RunningAvgSamplesPerSec=177.09022363506188, CurrSamplesPerSec=176.933791736095, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:13:14,155] [INFO] [logging.py:96:log_dist] [Rank 0] step=4170, skipped=78, lr=[7.923995339362163e-06, 7.923995339362163e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:13:14,173] [INFO] [timer.py:199:stop] epoch=4/micro_step=490/global_step=4170, RunningAvgSamplesPerSec=177.08970226085128, CurrSamplesPerSec=176.77218431663593, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:13:17,777] [INFO] [logging.py:96:log_dist] [Rank 0] step=4180, skipped=78, lr=[7.91609541381731e-06, 7.91609541381731e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:13:17,796] [INFO] [timer.py:199:stop] epoch=4/micro_step=500/global_step=4180, RunningAvgSamplesPerSec=177.08921327739722, CurrSamplesPerSec=176.87142078702658, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:13:18,129] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:13:18,462] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:13:21,342] [INFO] [logging.py:96:log_dist] [Rank 0] step=4190, skipped=80, lr=[7.909765334198717e-06, 7.909765334198717e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:13:21,360] [INFO] [timer.py:199:stop] epoch=4/micro_step=510/global_step=4190, RunningAvgSamplesPerSec=177.09549245925973, CurrSamplesPerSec=176.83902856785235, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:13:24,961] [INFO] [logging.py:96:log_dist] [Rank 0] step=4200, skipped=80, lr=[7.901840090971978e-06, 7.901840090971978e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:13:24,980] [INFO] [timer.py:199:stop] epoch=4/micro_step=520/global_step=4200, RunningAvgSamplesPerSec=177.0953541471895, CurrSamplesPerSec=176.97520235074305, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:13:28,599] [INFO] [logging.py:96:log_dist] [Rank 0] step=4210, skipped=80, lr=[7.893900832881286e-06, 7.893900832881286e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:13:28,621] [INFO] [timer.py:199:stop] epoch=4/micro_step=530/global_step=4210, RunningAvgSamplesPerSec=177.09306336905053, CurrSamplesPerSec=176.21645239895807, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:13:32,236] [INFO] [logging.py:96:log_dist] [Rank 0] step=4220, skipped=80, lr=[7.88594759608959e-06, 7.88594759608959e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:13:32,255] [INFO] [timer.py:199:stop] epoch=4/micro_step=540/global_step=4220, RunningAvgSamplesPerSec=177.0913075835237, CurrSamplesPerSec=176.91909854357607, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:13:35,858] [INFO] [logging.py:96:log_dist] [Rank 0] step=4230, skipped=80, lr=[7.87798041682352e-06, 7.87798041682352e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:13:35,876] [INFO] [timer.py:199:stop] epoch=4/micro_step=550/global_step=4230, RunningAvgSamplesPerSec=177.0909107413599, CurrSamplesPerSec=177.102471651843, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:13:39,479] [INFO] [logging.py:96:log_dist] [Rank 0] step=4240, skipped=80, lr=[7.869999331373206e-06, 7.869999331373206e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:13:39,497] [INFO] [timer.py:199:stop] epoch=4/micro_step=560/global_step=4240, RunningAvgSamplesPerSec=177.09062496676427, CurrSamplesPerSec=176.83413581096792, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:13:43,112] [INFO] [logging.py:96:log_dist] [Rank 0] step=4250, skipped=80, lr=[7.862004376092122e-06, 7.862004376092122e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:13:43,131] [INFO] [timer.py:199:stop] epoch=4/micro_step=570/global_step=4250, RunningAvgSamplesPerSec=177.08949301135036, CurrSamplesPerSec=176.60192512777937, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:13:46,742] [INFO] [logging.py:96:log_dist] [Rank 0] step=4260, skipped=80, lr=[7.853995587396918e-06, 7.853995587396918e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:13:46,760] [INFO] [timer.py:199:stop] epoch=4/micro_step=580/global_step=4260, RunningAvgSamplesPerSec=177.0881785819866, CurrSamplesPerSec=176.79290761551437, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:13:50,367] [INFO] [logging.py:96:log_dist] [Rank 0] step=4270, skipped=80, lr=[7.845973001767257e-06, 7.845973001767257e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:13:50,385] [INFO] [timer.py:199:stop] epoch=4/micro_step=590/global_step=4270, RunningAvgSamplesPerSec=177.08744624273976, CurrSamplesPerSec=176.90627314833122, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:13:53,988] [INFO] [logging.py:96:log_dist] [Rank 0] step=4280, skipped=80, lr=[7.837936655745642e-06, 7.837936655745642e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:13:54,006] [INFO] [timer.py:199:stop] epoch=4/micro_step=600/global_step=4280, RunningAvgSamplesPerSec=177.0871108163881, CurrSamplesPerSec=176.8531258811841, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:13:55,063] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:13:55,397] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:13:57,552] [INFO] [logging.py:96:log_dist] [Rank 0] step=4290, skipped=82, lr=[7.831497696042727e-06, 7.831497696042727e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:13:57,571] [INFO] [timer.py:199:stop] epoch=4/micro_step=610/global_step=4290, RunningAvgSamplesPerSec=177.0932559165725, CurrSamplesPerSec=176.8051343120001, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:14:01,194] [INFO] [logging.py:96:log_dist] [Rank 0] step=4300, skipped=82, lr=[7.823436673602674e-06, 7.823436673602674e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:14:01,213] [INFO] [timer.py:199:stop] epoch=4/micro_step=620/global_step=4300, RunningAvgSamplesPerSec=177.09060016842867, CurrSamplesPerSec=177.0822598094836, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:14:04,813] [INFO] [logging.py:96:log_dist] [Rank 0] step=4310, skipped=82, lr=[7.8153619934226e-06, 7.8153619934226e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:14:04,831] [INFO] [timer.py:199:stop] epoch=4/micro_step=630/global_step=4310, RunningAvgSamplesPerSec=177.0905374710352, CurrSamplesPerSec=177.11228715848202, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:14:08,432] [INFO] [logging.py:96:log_dist] [Rank 0] step=4320, skipped=82, lr=[7.807273692282295e-06, 7.807273692282295e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:14:08,451] [INFO] [timer.py:199:stop] epoch=4/micro_step=640/global_step=4320, RunningAvgSamplesPerSec=177.0903987924998, CurrSamplesPerSec=176.95571945016493, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:14:12,055] [INFO] [logging.py:96:log_dist] [Rank 0] step=4330, skipped=82, lr=[7.799171807023597e-06, 7.799171807023597e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:14:12,074] [INFO] [timer.py:199:stop] epoch=4/micro_step=650/global_step=4330, RunningAvgSamplesPerSec=177.08987868718773, CurrSamplesPerSec=176.64201813829163, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:14:15,677] [INFO] [logging.py:96:log_dist] [Rank 0] step=4340, skipped=82, lr=[7.791056374550221e-06, 7.791056374550221e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:14:15,696] [INFO] [timer.py:199:stop] epoch=4/micro_step=660/global_step=4340, RunningAvgSamplesPerSec=177.08945044826584, CurrSamplesPerSec=176.90067719495255, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:14:19,297] [INFO] [logging.py:96:log_dist] [Rank 0] step=4350, skipped=82, lr=[7.782927431827583e-06, 7.782927431827583e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:14:19,315] [INFO] [timer.py:199:stop] epoch=4/micro_step=670/global_step=4350, RunningAvgSamplesPerSec=177.0893074630886, CurrSamplesPerSec=177.00017473491394, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:14:22,915] [INFO] [logging.py:96:log_dist] [Rank 0] step=4360, skipped=82, lr=[7.77478501588264e-06, 7.77478501588264e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:14:22,934] [INFO] [timer.py:199:stop] epoch=4/micro_step=680/global_step=4360, RunningAvgSamplesPerSec=177.0893022281755, CurrSamplesPerSec=177.08307754013694, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:14:26,535] [INFO] [logging.py:96:log_dist] [Rank 0] step=4370, skipped=82, lr=[7.766629163803721e-06, 7.766629163803721e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:14:26,554] [INFO] [timer.py:199:stop] epoch=4/micro_step=690/global_step=4370, RunningAvgSamplesPerSec=177.08912458463595, CurrSamplesPerSec=176.89741305230305, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:14:30,160] [INFO] [logging.py:96:log_dist] [Rank 0] step=4380, skipped=82, lr=[7.75845991274035e-06, 7.75845991274035e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:14:30,179] [INFO] [timer.py:199:stop] epoch=4/micro_step=700/global_step=4380, RunningAvgSamplesPerSec=177.08832289912007, CurrSamplesPerSec=176.75088396691427, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:14:31,973] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:14:32,307] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:14:33,737] [INFO] [logging.py:96:log_dist] [Rank 0] step=4390, skipped=84, lr=[7.7519148896243e-06, 7.7519148896243e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:14:33,756] [INFO] [timer.py:199:stop] epoch=4/micro_step=710/global_step=4390, RunningAvgSamplesPerSec=177.09313369422506, CurrSamplesPerSec=176.67759413223533, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:14:37,390] [INFO] [logging.py:96:log_dist] [Rank 0] step=4400, skipped=84, lr=[7.743721614200437e-06, 7.743721614200437e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:14:37,409] [INFO] [timer.py:199:stop] epoch=4/micro_step=720/global_step=4400, RunningAvgSamplesPerSec=177.08925756608215, CurrSamplesPerSec=176.95128681438786, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:14:41,011] [INFO] [logging.py:96:log_dist] [Rank 0] step=4410, skipped=84, lr=[7.735515044134952e-06, 7.735515044134952e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:14:41,030] [INFO] [timer.py:199:stop] epoch=4/micro_step=730/global_step=4410, RunningAvgSamplesPerSec=177.0889419894901, CurrSamplesPerSec=177.06205257995597, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:14:44,632] [INFO] [logging.py:96:log_dist] [Rank 0] step=4420, skipped=84, lr=[7.727295216808389e-06, 7.727295216808389e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:14:44,651] [INFO] [timer.py:199:stop] epoch=4/micro_step=740/global_step=4420, RunningAvgSamplesPerSec=177.08862950853825, CurrSamplesPerSec=177.23437073932757, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:14:48,258] [INFO] [logging.py:96:log_dist] [Rank 0] step=4430, skipped=84, lr=[7.719062169661682e-06, 7.719062169661682e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:14:48,277] [INFO] [timer.py:199:stop] epoch=4/micro_step=750/global_step=4430, RunningAvgSamplesPerSec=177.08782533390635, CurrSamplesPerSec=177.0248039399303, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:14:51,882] [INFO] [logging.py:96:log_dist] [Rank 0] step=4440, skipped=84, lr=[7.710815940195977e-06, 7.710815940195977e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:14:51,900] [INFO] [timer.py:199:stop] epoch=4/micro_step=760/global_step=4440, RunningAvgSamplesPerSec=177.08726278311792, CurrSamplesPerSec=176.91047034570752, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:14:55,504] [INFO] [logging.py:96:log_dist] [Rank 0] step=4450, skipped=84, lr=[7.702556565972468e-06, 7.702556565972468e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:14:55,523] [INFO] [timer.py:199:stop] epoch=4/micro_step=770/global_step=4450, RunningAvgSamplesPerSec=177.08681977324974, CurrSamplesPerSec=176.98710419695456, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:14:59,124] [INFO] [logging.py:96:log_dist] [Rank 0] step=4460, skipped=84, lr=[7.694284084612225e-06, 7.694284084612225e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:14:59,143] [INFO] [timer.py:199:stop] epoch=4/micro_step=780/global_step=4460, RunningAvgSamplesPerSec=177.0866266562561, CurrSamplesPerSec=177.0381135664247, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:15:02,743] [INFO] [logging.py:96:log_dist] [Rank 0] step=4470, skipped=84, lr=[7.685998533796011e-06, 7.685998533796011e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:15:02,762] [INFO] [timer.py:199:stop] epoch=4/micro_step=790/global_step=4470, RunningAvgSamplesPerSec=177.08657368791793, CurrSamplesPerSec=176.89286676863242, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:15:06,363] [INFO] [logging.py:96:log_dist] [Rank 0] step=4480, skipped=84, lr=[7.677699951264129e-06, 7.677699951264129e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:15:06,382] [INFO] [timer.py:199:stop] epoch=4/micro_step=800/global_step=4480, RunningAvgSamplesPerSec=177.0864199420487, CurrSamplesPerSec=177.12584369455237, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:15:08,903] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:15:09,237] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:15:09,943] [INFO] [logging.py:96:log_dist] [Rank 0] step=4490, skipped=86, lr=[7.671051727802724e-06, 7.671051727802724e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:15:09,961] [INFO] [timer.py:199:stop] epoch=4/micro_step=810/global_step=4490, RunningAvgSamplesPerSec=177.09102129000405, CurrSamplesPerSec=177.00449310472612, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:15:13,562] [INFO] [logging.py:96:log_dist] [Rank 0] step=4500, skipped=86, lr=[7.66272978347756e-06, 7.66272978347756e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:15:13,580] [INFO] [timer.py:199:stop] epoch=4/micro_step=820/global_step=4500, RunningAvgSamplesPerSec=177.09095427858787, CurrSamplesPerSec=176.7855724092875, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:15:17,183] [INFO] [logging.py:96:log_dist] [Rank 0] step=4510, skipped=86, lr=[7.654394913424805e-06, 7.654394913424805e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:15:17,202] [INFO] [timer.py:199:stop] epoch=4/micro_step=830/global_step=4510, RunningAvgSamplesPerSec=177.09059188891425, CurrSamplesPerSec=176.96890202135873, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:15:20,805] [INFO] [logging.py:96:log_dist] [Rank 0] step=4520, skipped=86, lr=[7.646047155609408e-06, 7.646047155609408e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:15:20,823] [INFO] [timer.py:199:stop] epoch=4/micro_step=840/global_step=4520, RunningAvgSamplesPerSec=177.09025823047455, CurrSamplesPerSec=176.76124251218036, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:15:24,424] [INFO] [logging.py:96:log_dist] [Rank 0] step=4530, skipped=86, lr=[7.637686548055018e-06, 7.637686548055018e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:15:24,442] [INFO] [timer.py:199:stop] epoch=4/micro_step=850/global_step=4530, RunningAvgSamplesPerSec=177.0901973996205, CurrSamplesPerSec=177.05609641554932, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:15:28,047] [INFO] [logging.py:96:log_dist] [Rank 0] step=4540, skipped=86, lr=[7.6293131288438135e-06, 7.6293131288438135e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:15:28,065] [INFO] [timer.py:199:stop] epoch=4/micro_step=860/global_step=4540, RunningAvgSamplesPerSec=177.08966743877062, CurrSamplesPerSec=176.86198154131594, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:15:31,667] [INFO] [logging.py:96:log_dist] [Rank 0] step=4550, skipped=86, lr=[7.620926936116333e-06, 7.620926936116333e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:15:31,685] [INFO] [timer.py:199:stop] epoch=4/micro_step=870/global_step=4550, RunningAvgSamplesPerSec=177.08950586064458, CurrSamplesPerSec=177.18383298867136, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:15:35,285] [INFO] [logging.py:96:log_dist] [Rank 0] step=4560, skipped=86, lr=[7.612528008071294e-06, 7.612528008071294e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:15:35,304] [INFO] [timer.py:199:stop] epoch=4/micro_step=880/global_step=4560, RunningAvgSamplesPerSec=177.08945715444597, CurrSamplesPerSec=177.16301794693598, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:15:38,907] [INFO] [logging.py:96:log_dist] [Rank 0] step=4570, skipped=86, lr=[7.604116382965426e-06, 7.604116382965426e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:15:38,925] [INFO] [timer.py:199:stop] epoch=4/micro_step=890/global_step=4570, RunningAvgSamplesPerSec=177.08914375389875, CurrSamplesPerSec=176.9353078392949, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:15:42,546] [INFO] [logging.py:96:log_dist] [Rank 0] step=4580, skipped=86, lr=[7.595692099113291e-06, 7.595692099113291e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:15:42,564] [INFO] [timer.py:199:stop] epoch=4/micro_step=900/global_step=4580, RunningAvgSamplesPerSec=177.0873014027794, CurrSamplesPerSec=176.7979145343729, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:15:45,793] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:15:46,126] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:15:46,127] [INFO] [logging.py:96:log_dist] [Rank 0] step=4590, skipped=88, lr=[7.5889435835184686e-06, 7.5889435835184686e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:15:46,127] [INFO] [timer.py:199:stop] epoch=4/micro_step=910/global_step=4590, RunningAvgSamplesPerSec=177.09323474791483, CurrSamplesPerSec=192.00183107214997, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:15:49,730] [INFO] [logging.py:96:log_dist] [Rank 0] step=4600, skipped=88, lr=[7.580496610659687e-06, 7.580496610659687e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:15:49,748] [INFO] [timer.py:199:stop] epoch=4/micro_step=920/global_step=4600, RunningAvgSamplesPerSec=177.09293772651264, CurrSamplesPerSec=176.98967147170154, MemAllocated=4.98GB, MaxMemAllocated=24.08GB ***** Evaluating perplexity, Epoch 5/16 ***** ppl: 1.8518953323364258 Beginning of Epoch 6/16, Total Micro Batches 920 [2023-04-21 22:16:01,500] [INFO] [logging.py:96:log_dist] [Rank 0] step=4610, skipped=88, lr=[7.572037086641604e-06, 7.572037086641604e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:16:01,518] [INFO] [timer.py:199:stop] epoch=5/micro_step=10/global_step=4610, RunningAvgSamplesPerSec=177.09017159950966, CurrSamplesPerSec=176.75460824789786, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:16:05,126] [INFO] [logging.py:96:log_dist] [Rank 0] step=4620, skipped=88, lr=[7.5635650499969625e-06, 7.5635650499969625e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:16:05,144] [INFO] [timer.py:199:stop] epoch=5/micro_step=20/global_step=4620, RunningAvgSamplesPerSec=177.08935162151414, CurrSamplesPerSec=176.7629884539759, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:16:08,746] [INFO] [logging.py:96:log_dist] [Rank 0] step=4630, skipped=88, lr=[7.555080539315493e-06, 7.555080539315493e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:16:08,764] [INFO] [timer.py:199:stop] epoch=5/micro_step=30/global_step=4630, RunningAvgSamplesPerSec=177.08917789363545, CurrSamplesPerSec=177.13870094767245, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:16:12,385] [INFO] [logging.py:96:log_dist] [Rank 0] step=4640, skipped=88, lr=[7.5465835932437515e-06, 7.5465835932437515e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:16:12,404] [INFO] [timer.py:199:stop] epoch=5/micro_step=40/global_step=4640, RunningAvgSamplesPerSec=177.0869750752545, CurrSamplesPerSec=168.2538919243158, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:16:16,006] [INFO] [logging.py:96:log_dist] [Rank 0] step=4650, skipped=88, lr=[7.538074250484931e-06, 7.538074250484931e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:16:16,025] [INFO] [timer.py:199:stop] epoch=5/micro_step=50/global_step=4650, RunningAvgSamplesPerSec=177.08668164176407, CurrSamplesPerSec=177.0099789053603, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:16:19,627] [INFO] [logging.py:96:log_dist] [Rank 0] step=4660, skipped=88, lr=[7.529552549798694e-06, 7.529552549798694e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:16:19,646] [INFO] [timer.py:199:stop] epoch=5/micro_step=60/global_step=4660, RunningAvgSamplesPerSec=177.08640677449182, CurrSamplesPerSec=177.09709695445102, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:16:23,248] [INFO] [logging.py:96:log_dist] [Rank 0] step=4670, skipped=88, lr=[7.521018530000993e-06, 7.521018530000993e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:16:23,267] [INFO] [timer.py:199:stop] epoch=5/micro_step=70/global_step=4670, RunningAvgSamplesPerSec=177.08615355733113, CurrSamplesPerSec=176.95513619592754, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:16:26,869] [INFO] [logging.py:96:log_dist] [Rank 0] step=4680, skipped=88, lr=[7.51247222996389e-06, 7.51247222996389e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:16:26,887] [INFO] [timer.py:199:stop] epoch=5/micro_step=80/global_step=4680, RunningAvgSamplesPerSec=177.08588998591875, CurrSamplesPerSec=176.92726112701752, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:16:30,488] [INFO] [logging.py:96:log_dist] [Rank 0] step=4690, skipped=88, lr=[7.503913688615389e-06, 7.503913688615389e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:16:30,507] [INFO] [timer.py:199:stop] epoch=5/micro_step=90/global_step=4690, RunningAvgSamplesPerSec=177.0858092279246, CurrSamplesPerSec=177.16757812757814, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:16:30,840] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:16:31,173] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:16:34,052] [INFO] [logging.py:96:log_dist] [Rank 0] step=4700, skipped=90, lr=[7.497058067987595e-06, 7.497058067987595e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:16:34,070] [INFO] [timer.py:199:stop] epoch=5/micro_step=100/global_step=4700, RunningAvgSamplesPerSec=177.0915137407191, CurrSamplesPerSec=176.98266998871918, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:16:37,671] [INFO] [logging.py:96:log_dist] [Rank 0] step=4710, skipped=90, lr=[7.488477590555002e-06, 7.488477590555002e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:16:37,690] [INFO] [timer.py:199:stop] epoch=5/micro_step=110/global_step=4710, RunningAvgSamplesPerSec=177.09140301192735, CurrSamplesPerSec=176.68701370591555, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:16:41,300] [INFO] [logging.py:96:log_dist] [Rank 0] step=4720, skipped=90, lr=[7.479884981105479e-06, 7.479884981105479e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:16:41,318] [INFO] [timer.py:199:stop] epoch=5/micro_step=120/global_step=4720, RunningAvgSamplesPerSec=177.0903402635885, CurrSamplesPerSec=175.74482393794486, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:16:44,936] [INFO] [logging.py:96:log_dist] [Rank 0] step=4730, skipped=90, lr=[7.471280278777963e-06, 7.471280278777963e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:16:44,955] [INFO] [timer.py:199:stop] epoch=5/micro_step=130/global_step=4730, RunningAvgSamplesPerSec=177.0884411074709, CurrSamplesPerSec=177.03577839947556, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:16:48,555] [INFO] [logging.py:96:log_dist] [Rank 0] step=4740, skipped=90, lr=[7.462663522766476e-06, 7.462663522766476e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:16:48,574] [INFO] [timer.py:199:stop] epoch=5/micro_step=140/global_step=4740, RunningAvgSamplesPerSec=177.08834215482463, CurrSamplesPerSec=177.02235238490286, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:16:52,177] [INFO] [logging.py:96:log_dist] [Rank 0] step=4750, skipped=90, lr=[7.45403475231994e-06, 7.45403475231994e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:16:52,195] [INFO] [timer.py:199:stop] epoch=5/micro_step=150/global_step=4750, RunningAvgSamplesPerSec=177.08803100486276, CurrSamplesPerSec=176.78266178185768, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:16:55,801] [INFO] [logging.py:96:log_dist] [Rank 0] step=4760, skipped=90, lr=[7.445394006742005e-06, 7.445394006742005e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:16:55,819] [INFO] [timer.py:199:stop] epoch=5/micro_step=160/global_step=4760, RunningAvgSamplesPerSec=177.0874657346831, CurrSamplesPerSec=177.0966296048773, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:16:59,419] [INFO] [logging.py:96:log_dist] [Rank 0] step=4770, skipped=90, lr=[7.436741325390867e-06, 7.436741325390867e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:16:59,438] [INFO] [timer.py:199:stop] epoch=5/micro_step=170/global_step=4770, RunningAvgSamplesPerSec=177.08744481912765, CurrSamplesPerSec=176.95758588956247, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:17:03,037] [INFO] [logging.py:96:log_dist] [Rank 0] step=4780, skipped=90, lr=[7.428076747679087e-06, 7.428076747679087e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:17:03,056] [INFO] [timer.py:199:stop] epoch=5/micro_step=180/global_step=4780, RunningAvgSamplesPerSec=177.08745105793847, CurrSamplesPerSec=177.03612867059076, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:17:06,660] [INFO] [logging.py:96:log_dist] [Rank 0] step=4790, skipped=90, lr=[7.419400313073417e-06, 7.419400313073417e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:17:06,679] [INFO] [timer.py:199:stop] epoch=5/micro_step=190/global_step=4790, RunningAvgSamplesPerSec=177.08698914241162, CurrSamplesPerSec=176.988854603476, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:17:07,735] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:17:08,069] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:17:10,224] [INFO] [logging.py:96:log_dist] [Rank 0] step=4800, skipped=92, lr=[7.412450654981417e-06, 7.412450654981417e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:17:10,242] [INFO] [timer.py:199:stop] epoch=5/micro_step=200/global_step=4800, RunningAvgSamplesPerSec=177.09263480663665, CurrSamplesPerSec=177.05165874964217, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:17:13,844] [INFO] [logging.py:96:log_dist] [Rank 0] step=4810, skipped=92, lr=[7.403752977595229e-06, 7.403752977595229e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:17:13,863] [INFO] [timer.py:199:stop] epoch=5/micro_step=210/global_step=4810, RunningAvgSamplesPerSec=177.0923791747087, CurrSamplesPerSec=176.938690009294, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:17:17,481] [INFO] [logging.py:96:log_dist] [Rank 0] step=4820, skipped=92, lr=[7.395043554108795e-06, 7.395043554108795e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:17:17,500] [INFO] [timer.py:199:stop] epoch=5/micro_step=220/global_step=4820, RunningAvgSamplesPerSec=177.090510007201, CurrSamplesPerSec=177.03414381926436, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:17:21,104] [INFO] [logging.py:96:log_dist] [Rank 0] step=4830, skipped=92, lr=[7.386322424193133e-06, 7.386322424193133e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:17:21,122] [INFO] [timer.py:199:stop] epoch=5/micro_step=230/global_step=4830, RunningAvgSamplesPerSec=177.0900845984164, CurrSamplesPerSec=176.90953761796806, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:17:24,724] [INFO] [logging.py:96:log_dist] [Rank 0] step=4840, skipped=92, lr=[7.377589627572588e-06, 7.377589627572588e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:17:24,742] [INFO] [timer.py:199:stop] epoch=5/micro_step=240/global_step=4840, RunningAvgSamplesPerSec=177.089907863639, CurrSamplesPerSec=176.99060504462093, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:17:28,342] [INFO] [logging.py:96:log_dist] [Rank 0] step=4850, skipped=92, lr=[7.368845204024645e-06, 7.368845204024645e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:17:28,361] [INFO] [timer.py:199:stop] epoch=5/micro_step=250/global_step=4850, RunningAvgSamplesPerSec=177.08986136203816, CurrSamplesPerSec=177.01278029641108, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:17:31,961] [INFO] [logging.py:96:log_dist] [Rank 0] step=4860, skipped=92, lr=[7.360089193379744e-06, 7.360089193379744e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:17:31,979] [INFO] [timer.py:199:stop] epoch=5/micro_step=260/global_step=4860, RunningAvgSamplesPerSec=177.0898521348523, CurrSamplesPerSec=177.3268639156448, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:17:35,581] [INFO] [logging.py:96:log_dist] [Rank 0] step=4870, skipped=92, lr=[7.351321635521108e-06, 7.351321635521108e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:17:35,600] [INFO] [timer.py:199:stop] epoch=5/micro_step=270/global_step=4870, RunningAvgSamplesPerSec=177.0896606478026, CurrSamplesPerSec=176.96703534324368, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:17:39,200] [INFO] [logging.py:96:log_dist] [Rank 0] step=4880, skipped=92, lr=[7.342542570384559e-06, 7.342542570384559e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:17:39,219] [INFO] [timer.py:199:stop] epoch=5/micro_step=280/global_step=4880, RunningAvgSamplesPerSec=177.08958840258032, CurrSamplesPerSec=177.09499390079372, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:17:42,828] [INFO] [logging.py:96:log_dist] [Rank 0] step=4890, skipped=92, lr=[7.333752037958332e-06, 7.333752037958332e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:17:42,846] [INFO] [timer.py:199:stop] epoch=5/micro_step=290/global_step=4890, RunningAvgSamplesPerSec=177.08865549824446, CurrSamplesPerSec=176.77963483099822, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:17:44,628] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:17:44,961] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:17:46,405] [INFO] [logging.py:96:log_dist] [Rank 0] step=4900, skipped=94, lr=[7.326711382474223e-06, 7.326711382474223e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:17:46,424] [INFO] [timer.py:199:stop] epoch=5/micro_step=300/global_step=4900, RunningAvgSamplesPerSec=177.09325584922968, CurrSamplesPerSec=177.0890355200017, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:17:50,026] [INFO] [logging.py:96:log_dist] [Rank 0] step=4910, skipped=94, lr=[7.317900309863533e-06, 7.317900309863533e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:17:50,045] [INFO] [timer.py:199:stop] epoch=5/micro_step=310/global_step=4910, RunningAvgSamplesPerSec=177.09298444855096, CurrSamplesPerSec=177.10422434109304, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:17:53,649] [INFO] [logging.py:96:log_dist] [Rank 0] step=4920, skipped=94, lr=[7.309077882207519e-06, 7.309077882207519e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:17:53,668] [INFO] [timer.py:199:stop] epoch=5/micro_step=320/global_step=4920, RunningAvgSamplesPerSec=177.09253992620285, CurrSamplesPerSec=176.9279608120738, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:17:57,273] [INFO] [logging.py:96:log_dist] [Rank 0] step=4930, skipped=94, lr=[7.300244139691927e-06, 7.300244139691927e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:17:57,291] [INFO] [timer.py:199:stop] epoch=5/micro_step=330/global_step=4930, RunningAvgSamplesPerSec=177.09202132211186, CurrSamplesPerSec=176.82831145339074, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:18:00,893] [INFO] [logging.py:96:log_dist] [Rank 0] step=4940, skipped=94, lr=[7.291399122554046e-06, 7.291399122554046e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:18:00,912] [INFO] [timer.py:199:stop] epoch=5/micro_step=340/global_step=4940, RunningAvgSamplesPerSec=177.09178123519095, CurrSamplesPerSec=177.15062479583185, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:18:04,513] [INFO] [logging.py:96:log_dist] [Rank 0] step=4950, skipped=94, lr=[7.28254287108252e-06, 7.28254287108252e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:18:04,531] [INFO] [timer.py:199:stop] epoch=5/micro_step=350/global_step=4950, RunningAvgSamplesPerSec=177.09170060079654, CurrSamplesPerSec=177.12327246789735, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:18:08,132] [INFO] [logging.py:96:log_dist] [Rank 0] step=4960, skipped=94, lr=[7.273675425617163e-06, 7.273675425617163e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:18:08,150] [INFO] [timer.py:199:stop] epoch=5/micro_step=360/global_step=4960, RunningAvgSamplesPerSec=177.09161385876422, CurrSamplesPerSec=176.89216736111615, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:18:11,751] [INFO] [logging.py:96:log_dist] [Rank 0] step=4970, skipped=94, lr=[7.264796826548777e-06, 7.264796826548777e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:18:11,770] [INFO] [timer.py:199:stop] epoch=5/micro_step=370/global_step=4970, RunningAvgSamplesPerSec=177.09150811187888, CurrSamplesPerSec=177.13227208779955, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:18:15,370] [INFO] [logging.py:96:log_dist] [Rank 0] step=4980, skipped=94, lr=[7.25590711431897e-06, 7.25590711431897e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:18:15,389] [INFO] [timer.py:199:stop] epoch=5/micro_step=380/global_step=4980, RunningAvgSamplesPerSec=177.0914217766438, CurrSamplesPerSec=177.0750173819315, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:18:18,990] [INFO] [logging.py:96:log_dist] [Rank 0] step=4990, skipped=94, lr=[7.247006329419968e-06, 7.247006329419968e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:18:19,009] [INFO] [timer.py:199:stop] epoch=5/micro_step=390/global_step=4990, RunningAvgSamplesPerSec=177.09125528548464, CurrSamplesPerSec=176.88260934599637, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:18:21,528] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:18:21,863] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:18:22,569] [INFO] [logging.py:96:log_dist] [Rank 0] step=5000, skipped=96, lr=[7.239877756421927e-06, 7.239877756421927e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:18:22,587] [INFO] [timer.py:199:stop] epoch=5/micro_step=400/global_step=5000, RunningAvgSamplesPerSec=177.09513577188997, CurrSamplesPerSec=177.0176829441195, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:18:26,188] [INFO] [logging.py:96:log_dist] [Rank 0] step=5010, skipped=96, lr=[7.23095714291966e-06, 7.23095714291966e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:18:26,207] [INFO] [timer.py:199:stop] epoch=5/micro_step=410/global_step=5010, RunningAvgSamplesPerSec=177.09503297797934, CurrSamplesPerSec=176.96050227994434, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:18:29,809] [INFO] [logging.py:96:log_dist] [Rank 0] step=5020, skipped=96, lr=[7.2220255703941615e-06, 7.2220255703941615e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:18:29,828] [INFO] [timer.py:199:stop] epoch=5/micro_step=420/global_step=5020, RunningAvgSamplesPerSec=177.09471883389816, CurrSamplesPerSec=176.8021065915291, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:18:33,429] [INFO] [logging.py:96:log_dist] [Rank 0] step=5030, skipped=96, lr=[7.2130830795283315e-06, 7.2130830795283315e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:18:33,448] [INFO] [timer.py:199:stop] epoch=5/micro_step=430/global_step=5030, RunningAvgSamplesPerSec=177.09454031846857, CurrSamplesPerSec=176.97613577102683, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:18:37,050] [INFO] [logging.py:96:log_dist] [Rank 0] step=5040, skipped=96, lr=[7.2041297110548e-06, 7.2041297110548e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:18:37,068] [INFO] [timer.py:199:stop] epoch=5/micro_step=440/global_step=5040, RunningAvgSamplesPerSec=177.09432294896823, CurrSamplesPerSec=176.8759659683932, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:18:40,669] [INFO] [logging.py:96:log_dist] [Rank 0] step=5050, skipped=96, lr=[7.1951655057557455e-06, 7.1951655057557455e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:18:40,688] [INFO] [timer.py:199:stop] epoch=5/micro_step=450/global_step=5050, RunningAvgSamplesPerSec=177.09421320714387, CurrSamplesPerSec=177.07396611104184, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:18:44,289] [INFO] [logging.py:96:log_dist] [Rank 0] step=5060, skipped=96, lr=[7.186190504462706e-06, 7.186190504462706e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:18:44,308] [INFO] [timer.py:199:stop] epoch=5/micro_step=460/global_step=5060, RunningAvgSamplesPerSec=177.0940629683165, CurrSamplesPerSec=176.79849675264586, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:18:47,910] [INFO] [logging.py:96:log_dist] [Rank 0] step=5070, skipped=96, lr=[7.1772047480564e-06, 7.1772047480564e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:18:47,928] [INFO] [timer.py:199:stop] epoch=5/micro_step=470/global_step=5070, RunningAvgSamplesPerSec=177.0938224009511, CurrSamplesPerSec=176.92481227290014, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:18:51,530] [INFO] [logging.py:96:log_dist] [Rank 0] step=5080, skipped=96, lr=[7.168208277466528e-06, 7.168208277466528e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:18:51,549] [INFO] [timer.py:199:stop] epoch=5/micro_step=480/global_step=5080, RunningAvgSamplesPerSec=177.09357944560864, CurrSamplesPerSec=176.96050227994434, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:18:55,174] [INFO] [logging.py:96:log_dist] [Rank 0] step=5090, skipped=96, lr=[7.159201133671599e-06, 7.159201133671599e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:18:55,192] [INFO] [timer.py:199:stop] epoch=5/micro_step=490/global_step=5090, RunningAvgSamplesPerSec=177.09118616982292, CurrSamplesPerSec=176.8764321540493, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:18:58,423] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:18:58,756] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:18:58,756] [INFO] [logging.py:96:log_dist] [Rank 0] step=5100, skipped=98, lr=[7.151987761496608e-06, 7.151987761496608e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:18:58,757] [INFO] [timer.py:199:stop] epoch=5/micro_step=500/global_step=5100, RunningAvgSamplesPerSec=177.09636052246617, CurrSamplesPerSec=192.18369303354538, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:19:02,364] [INFO] [logging.py:96:log_dist] [Rank 0] step=5110, skipped=98, lr=[7.142961509353471e-06, 7.142961509353471e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:19:02,382] [INFO] [timer.py:199:stop] epoch=5/micro_step=510/global_step=5110, RunningAvgSamplesPerSec=177.0956568065347, CurrSamplesPerSec=176.72842522458464, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:19:05,985] [INFO] [logging.py:96:log_dist] [Rank 0] step=5120, skipped=98, lr=[7.133924699003135e-06, 7.133924699003135e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:19:06,004] [INFO] [timer.py:199:stop] epoch=5/micro_step=520/global_step=5120, RunningAvgSamplesPerSec=177.09536059155494, CurrSamplesPerSec=176.9255119385878, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:19:09,604] [INFO] [logging.py:96:log_dist] [Rank 0] step=5130, skipped=98, lr=[7.124877371607849e-06, 7.124877371607849e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:19:09,623] [INFO] [timer.py:199:stop] epoch=5/micro_step=530/global_step=5130, RunningAvgSamplesPerSec=177.09528163688742, CurrSamplesPerSec=176.9076721919946, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:19:13,231] [INFO] [logging.py:96:log_dist] [Rank 0] step=5140, skipped=98, lr=[7.115819568377772e-06, 7.115819568377772e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:19:13,250] [INFO] [timer.py:199:stop] epoch=5/micro_step=540/global_step=5140, RunningAvgSamplesPerSec=177.09446020918816, CurrSamplesPerSec=176.80816213617172, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:19:16,852] [INFO] [logging.py:96:log_dist] [Rank 0] step=5150, skipped=98, lr=[7.106751330570777e-06, 7.106751330570777e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:19:16,870] [INFO] [timer.py:199:stop] epoch=5/micro_step=550/global_step=5150, RunningAvgSamplesPerSec=177.09421616215255, CurrSamplesPerSec=176.81794504612864, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:19:20,479] [INFO] [logging.py:96:log_dist] [Rank 0] step=5160, skipped=98, lr=[7.097672699492267e-06, 7.097672699492267e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:19:20,497] [INFO] [timer.py:199:stop] epoch=5/micro_step=560/global_step=5160, RunningAvgSamplesPerSec=177.09342955428798, CurrSamplesPerSec=177.1874585803113, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:19:24,098] [INFO] [logging.py:96:log_dist] [Rank 0] step=5170, skipped=98, lr=[7.088583716494987e-06, 7.088583716494987e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:19:24,117] [INFO] [timer.py:199:stop] epoch=5/micro_step=570/global_step=5170, RunningAvgSamplesPerSec=177.09328466390843, CurrSamplesPerSec=176.76066053924635, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:19:27,727] [INFO] [logging.py:96:log_dist] [Rank 0] step=5180, skipped=98, lr=[7.07948442297883e-06, 7.07948442297883e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:19:27,746] [INFO] [timer.py:199:stop] epoch=5/micro_step=580/global_step=5180, RunningAvgSamplesPerSec=177.09226244702938, CurrSamplesPerSec=173.64201995843234, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:19:31,352] [INFO] [logging.py:96:log_dist] [Rank 0] step=5190, skipped=98, lr=[7.07037486039066e-06, 7.07037486039066e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:19:31,370] [INFO] [timer.py:199:stop] epoch=5/micro_step=590/global_step=5190, RunningAvgSamplesPerSec=177.09168166554377, CurrSamplesPerSec=176.84310607208502, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:19:34,973] [INFO] [logging.py:96:log_dist] [Rank 0] step=5200, skipped=98, lr=[7.0612550702241075e-06, 7.0612550702241075e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:19:34,992] [INFO] [timer.py:199:stop] epoch=5/micro_step=600/global_step=5200, RunningAvgSamplesPerSec=177.0913671249418, CurrSamplesPerSec=176.94498818764765, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:19:35,325] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:19:35,659] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:19:38,542] [INFO] [logging.py:96:log_dist] [Rank 0] step=5210, skipped=100, lr=[7.053951902147903e-06, 7.053951902147903e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:19:38,561] [INFO] [timer.py:199:stop] epoch=5/micro_step=610/global_step=5210, RunningAvgSamplesPerSec=177.09604171976756, CurrSamplesPerSec=176.29908565138638, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:19:42,163] [INFO] [logging.py:96:log_dist] [Rank 0] step=5220, skipped=100, lr=[7.04481380705281e-06, 7.04481380705281e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:19:42,182] [INFO] [timer.py:199:stop] epoch=5/micro_step=620/global_step=5220, RunningAvgSamplesPerSec=177.0957570479711, CurrSamplesPerSec=176.8651278446059, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:19:45,786] [INFO] [logging.py:96:log_dist] [Rank 0] step=5230, skipped=100, lr=[7.03566560080875e-06, 7.03566560080875e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:19:45,805] [INFO] [timer.py:199:stop] epoch=5/micro_step=630/global_step=5230, RunningAvgSamplesPerSec=177.0953103479093, CurrSamplesPerSec=176.7102762148708, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:19:49,406] [INFO] [logging.py:96:log_dist] [Rank 0] step=5240, skipped=100, lr=[7.026507325085379e-06, 7.026507325085379e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:19:49,425] [INFO] [timer.py:199:stop] epoch=5/micro_step=640/global_step=5240, RunningAvgSamplesPerSec=177.0950952113209, CurrSamplesPerSec=176.92784419751345, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:19:53,029] [INFO] [logging.py:96:log_dist] [Rank 0] step=5250, skipped=100, lr=[7.017339021598217e-06, 7.017339021598217e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:19:53,048] [INFO] [timer.py:199:stop] epoch=5/micro_step=650/global_step=5250, RunningAvgSamplesPerSec=177.0946758334153, CurrSamplesPerSec=176.66433865581723, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:19:56,661] [INFO] [logging.py:96:log_dist] [Rank 0] step=5260, skipped=100, lr=[7.008160732108462e-06, 7.008160732108462e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:19:56,680] [INFO] [timer.py:199:stop] epoch=5/micro_step=660/global_step=5260, RunningAvgSamplesPerSec=177.09340026717842, CurrSamplesPerSec=176.89123482636438, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:20:00,292] [INFO] [logging.py:96:log_dist] [Rank 0] step=5270, skipped=100, lr=[6.998972498422798e-06, 6.998972498422798e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:20:00,310] [INFO] [timer.py:199:stop] epoch=5/micro_step=670/global_step=5270, RunningAvgSamplesPerSec=177.0922845606944, CurrSamplesPerSec=177.0628701239936, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:20:03,915] [INFO] [logging.py:96:log_dist] [Rank 0] step=5280, skipped=100, lr=[6.989774362393201e-06, 6.989774362393201e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:20:03,934] [INFO] [timer.py:199:stop] epoch=5/micro_step=680/global_step=5280, RunningAvgSamplesPerSec=177.0917921105717, CurrSamplesPerSec=177.1948271753502, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:20:07,537] [INFO] [logging.py:96:log_dist] [Rank 0] step=5290, skipped=100, lr=[6.980566365916755e-06, 6.980566365916755e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:20:07,556] [INFO] [timer.py:199:stop] epoch=5/micro_step=690/global_step=5290, RunningAvgSamplesPerSec=177.09143695851773, CurrSamplesPerSec=176.99433943478746, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:20:11,158] [INFO] [logging.py:96:log_dist] [Rank 0] step=5300, skipped=100, lr=[6.971348550935457e-06, 6.971348550935457e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:20:11,177] [INFO] [timer.py:199:stop] epoch=5/micro_step=700/global_step=5300, RunningAvgSamplesPerSec=177.09123590065747, CurrSamplesPerSec=176.90044403791126, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:20:12,234] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:20:12,568] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:20:14,724] [INFO] [logging.py:96:log_dist] [Rank 0] step=5310, skipped=102, lr=[6.963967257840505e-06, 6.963967257840505e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:20:14,742] [INFO] [timer.py:199:stop] epoch=5/micro_step=710/global_step=5310, RunningAvgSamplesPerSec=177.0960845575907, CurrSamplesPerSec=176.9088380786178, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:20:18,344] [INFO] [logging.py:96:log_dist] [Rank 0] step=5320, skipped=102, lr=[6.954731875386939e-06, 6.954731875386939e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:20:18,362] [INFO] [timer.py:199:stop] epoch=5/micro_step=720/global_step=5320, RunningAvgSamplesPerSec=177.0959236199345, CurrSamplesPerSec=177.2152987109356, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:20:21,964] [INFO] [logging.py:96:log_dist] [Rank 0] step=5330, skipped=102, lr=[6.94548679210343e-06, 6.94548679210343e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:20:21,983] [INFO] [timer.py:199:stop] epoch=5/micro_step=730/global_step=5330, RunningAvgSamplesPerSec=177.0957337268502, CurrSamplesPerSec=177.04313438397588, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:20:25,587] [INFO] [logging.py:96:log_dist] [Rank 0] step=5340, skipped=102, lr=[6.9362320501009e-06, 6.9362320501009e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:20:25,606] [INFO] [timer.py:199:stop] epoch=5/micro_step=740/global_step=5340, RunningAvgSamplesPerSec=177.09524006869347, CurrSamplesPerSec=176.95711927602193, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:20:29,209] [INFO] [logging.py:96:log_dist] [Rank 0] step=5350, skipped=102, lr=[6.9269676915342725e-06, 6.9269676915342725e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:20:29,228] [INFO] [timer.py:199:stop] epoch=5/micro_step=750/global_step=5350, RunningAvgSamplesPerSec=177.09491931697318, CurrSamplesPerSec=176.73331213340418, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:20:32,847] [INFO] [logging.py:96:log_dist] [Rank 0] step=5360, skipped=102, lr=[6.917693758602269e-06, 6.917693758602269e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:20:32,866] [INFO] [timer.py:199:stop] epoch=5/micro_step=760/global_step=5360, RunningAvgSamplesPerSec=177.09313079799435, CurrSamplesPerSec=176.88342523379265, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:20:36,465] [INFO] [logging.py:96:log_dist] [Rank 0] step=5370, skipped=102, lr=[6.908410293547225e-06, 6.908410293547225e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:20:36,484] [INFO] [timer.py:199:stop] epoch=5/micro_step=770/global_step=5370, RunningAvgSamplesPerSec=177.09310249237814, CurrSamplesPerSec=176.8813272518091, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:20:40,091] [INFO] [logging.py:96:log_dist] [Rank 0] step=5380, skipped=102, lr=[6.899117338654896e-06, 6.899117338654896e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:20:40,110] [INFO] [timer.py:199:stop] epoch=5/micro_step=780/global_step=5380, RunningAvgSamplesPerSec=177.09246508423428, CurrSamplesPerSec=177.03998174433897, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:20:43,710] [INFO] [logging.py:96:log_dist] [Rank 0] step=5390, skipped=102, lr=[6.889814936254255e-06, 6.889814936254255e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:20:43,729] [INFO] [timer.py:199:stop] epoch=5/micro_step=790/global_step=5390, RunningAvgSamplesPerSec=177.0923539001516, CurrSamplesPerSec=176.9746189680657, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:20:47,331] [INFO] [logging.py:96:log_dist] [Rank 0] step=5400, skipped=102, lr=[6.880503128717318e-06, 6.880503128717318e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:20:47,349] [INFO] [timer.py:199:stop] epoch=5/micro_step=800/global_step=5400, RunningAvgSamplesPerSec=177.09216402140981, CurrSamplesPerSec=177.08716630668977, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:20:49,134] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:20:49,467] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:20:50,897] [INFO] [logging.py:96:log_dist] [Rank 0] step=5410, skipped=104, lr=[6.87304693949098e-06, 6.87304693949098e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:20:50,916] [INFO] [timer.py:199:stop] epoch=5/micro_step=810/global_step=5410, RunningAvgSamplesPerSec=177.0968557609736, CurrSamplesPerSec=176.80909379524206, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:20:54,517] [INFO] [logging.py:96:log_dist] [Rank 0] step=5420, skipped=104, lr=[6.863718309622797e-06, 6.863718309622797e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:20:54,536] [INFO] [timer.py:199:stop] epoch=5/micro_step=820/global_step=5420, RunningAvgSamplesPerSec=177.09671133480657, CurrSamplesPerSec=176.9554861480086, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:20:58,138] [INFO] [logging.py:96:log_dist] [Rank 0] step=5430, skipped=104, lr=[6.854380393487243e-06, 6.854380393487243e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:20:58,156] [INFO] [timer.py:199:stop] epoch=5/micro_step=830/global_step=5430, RunningAvgSamplesPerSec=177.09649053269342, CurrSamplesPerSec=176.97590241503278, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:21:01,762] [INFO] [logging.py:96:log_dist] [Rank 0] step=5440, skipped=104, lr=[6.845033233618091e-06, 6.845033233618091e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:21:01,780] [INFO] [timer.py:199:stop] epoch=5/micro_step=840/global_step=5440, RunningAvgSamplesPerSec=177.0959547110687, CurrSamplesPerSec=176.77975125026836, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:21:05,393] [INFO] [logging.py:96:log_dist] [Rank 0] step=5450, skipped=104, lr=[6.83567687259122e-06, 6.83567687259122e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:21:05,412] [INFO] [timer.py:199:stop] epoch=5/micro_step=850/global_step=5450, RunningAvgSamplesPerSec=177.09475643541003, CurrSamplesPerSec=177.06111825315736, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:21:09,015] [INFO] [logging.py:96:log_dist] [Rank 0] step=5460, skipped=104, lr=[6.826311353024422e-06, 6.826311353024422e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:21:09,033] [INFO] [timer.py:199:stop] epoch=5/micro_step=860/global_step=5460, RunningAvgSamplesPerSec=177.09445962557538, CurrSamplesPerSec=176.77136945472841, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:21:12,636] [INFO] [logging.py:96:log_dist] [Rank 0] step=5470, skipped=104, lr=[6.816936717577205e-06, 6.816936717577205e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:21:12,654] [INFO] [timer.py:199:stop] epoch=5/micro_step=870/global_step=5470, RunningAvgSamplesPerSec=177.0942100547219, CurrSamplesPerSec=176.93799023937524, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:21:16,257] [INFO] [logging.py:96:log_dist] [Rank 0] step=5480, skipped=104, lr=[6.807553008950597e-06, 6.807553008950597e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:21:16,276] [INFO] [timer.py:199:stop] epoch=5/micro_step=880/global_step=5480, RunningAvgSamplesPerSec=177.09394768198334, CurrSamplesPerSec=177.0307579991308, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:21:19,879] [INFO] [logging.py:96:log_dist] [Rank 0] step=5490, skipped=104, lr=[6.7981602698869555e-06, 6.7981602698869555e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:21:19,897] [INFO] [timer.py:199:stop] epoch=5/micro_step=890/global_step=5490, RunningAvgSamplesPerSec=177.09367153430472, CurrSamplesPerSec=176.99200542246643, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:21:23,500] [INFO] [logging.py:96:log_dist] [Rank 0] step=5500, skipped=104, lr=[6.788758543169772e-06, 6.788758543169772e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:21:23,518] [INFO] [timer.py:199:stop] epoch=5/micro_step=900/global_step=5500, RunningAvgSamplesPerSec=177.09342752336178, CurrSamplesPerSec=176.99667350866702, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:21:26,025] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:21:26,358] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:21:27,063] [INFO] [logging.py:96:log_dist] [Rank 0] step=5510, skipped=106, lr=[6.781230719462121e-06, 6.781230719462121e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:21:27,082] [INFO] [timer.py:199:stop] epoch=5/micro_step=910/global_step=5510, RunningAvgSamplesPerSec=177.0982756485225, CurrSamplesPerSec=177.10130321161537, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:21:30,684] [INFO] [logging.py:96:log_dist] [Rank 0] step=5520, skipped=106, lr=[6.77181292291351e-06, 6.77181292291351e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:21:30,702] [INFO] [timer.py:199:stop] epoch=5/micro_step=920/global_step=5520, RunningAvgSamplesPerSec=177.09805929326262, CurrSamplesPerSec=176.93764035649147, MemAllocated=4.98GB, MaxMemAllocated=24.08GB ***** Evaluating perplexity, Epoch 6/16 ***** ppl: 1.8256641626358032 Beginning of Epoch 7/16, Total Micro Batches 920 [2023-04-21 22:21:42,470] [INFO] [logging.py:96:log_dist] [Rank 0] step=5530, skipped=106, lr=[6.762386258722304e-06, 6.762386258722304e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:21:42,488] [INFO] [timer.py:199:stop] epoch=6/micro_step=10/global_step=5530, RunningAvgSamplesPerSec=177.0951064242282, CurrSamplesPerSec=177.1279474809781, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:21:46,090] [INFO] [logging.py:96:log_dist] [Rank 0] step=5540, skipped=106, lr=[6.752950769826516e-06, 6.752950769826516e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:21:46,108] [INFO] [timer.py:199:stop] epoch=6/micro_step=20/global_step=5540, RunningAvgSamplesPerSec=177.09496879542294, CurrSamplesPerSec=177.0858812454275, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:21:49,711] [INFO] [logging.py:96:log_dist] [Rank 0] step=5550, skipped=106, lr=[6.743506499204363e-06, 6.743506499204363e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:21:49,730] [INFO] [timer.py:199:stop] epoch=6/micro_step=30/global_step=5550, RunningAvgSamplesPerSec=177.09467894404204, CurrSamplesPerSec=176.8291268403544, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:21:53,333] [INFO] [logging.py:96:log_dist] [Rank 0] step=5560, skipped=106, lr=[6.73405348987406e-06, 6.73405348987406e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:21:53,352] [INFO] [timer.py:199:stop] epoch=6/micro_step=40/global_step=5560, RunningAvgSamplesPerSec=177.09430296300536, CurrSamplesPerSec=176.77404688634942, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:21:56,957] [INFO] [logging.py:96:log_dist] [Rank 0] step=5570, skipped=106, lr=[6.724591784893625e-06, 6.724591784893625e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:21:56,976] [INFO] [timer.py:199:stop] epoch=6/micro_step=50/global_step=5570, RunningAvgSamplesPerSec=177.09381395563082, CurrSamplesPerSec=176.87678179490408, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:22:00,577] [INFO] [logging.py:96:log_dist] [Rank 0] step=5580, skipped=106, lr=[6.715121427360688e-06, 6.715121427360688e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:22:00,596] [INFO] [timer.py:199:stop] epoch=6/micro_step=60/global_step=5580, RunningAvgSamplesPerSec=177.093687717729, CurrSamplesPerSec=176.9152507297449, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:22:04,209] [INFO] [logging.py:96:log_dist] [Rank 0] step=5590, skipped=106, lr=[6.7056424604122874e-06, 6.7056424604122874e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:22:04,228] [INFO] [timer.py:199:stop] epoch=6/micro_step=70/global_step=5590, RunningAvgSamplesPerSec=177.092490933945, CurrSamplesPerSec=176.96528536827518, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:22:07,829] [INFO] [logging.py:96:log_dist] [Rank 0] step=5600, skipped=106, lr=[6.696154927224676e-06, 6.696154927224676e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:22:07,848] [INFO] [timer.py:199:stop] epoch=6/micro_step=80/global_step=5600, RunningAvgSamplesPerSec=177.09231736118332, CurrSamplesPerSec=176.94755424070917, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:22:11,103] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:22:11,437] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:22:11,437] [INFO] [logging.py:96:log_dist] [Rank 0] step=5610, skipped=108, lr=[6.688558762021714e-06, 6.688558762021714e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:22:11,437] [INFO] [timer.py:199:stop] epoch=6/micro_step=90/global_step=5610, RunningAvgSamplesPerSec=177.09482166944548, CurrSamplesPerSec=192.10735980305157, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:22:15,041] [INFO] [logging.py:96:log_dist] [Rank 0] step=5620, skipped=108, lr=[6.679055918532112e-06, 6.679055918532112e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:22:15,060] [INFO] [timer.py:199:stop] epoch=6/micro_step=100/global_step=5620, RunningAvgSamplesPerSec=177.09448848407538, CurrSamplesPerSec=177.08634853827402, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:22:18,663] [INFO] [logging.py:96:log_dist] [Rank 0] step=5630, skipped=108, lr=[6.669544629903765e-06, 6.669544629903765e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:22:18,681] [INFO] [timer.py:199:stop] epoch=6/micro_step=110/global_step=5630, RunningAvgSamplesPerSec=177.09417871728468, CurrSamplesPerSec=177.08833456038548, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:22:22,285] [INFO] [logging.py:96:log_dist] [Rank 0] step=5640, skipped=108, lr=[6.660024939460153e-06, 6.660024939460153e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:22:22,303] [INFO] [timer.py:199:stop] epoch=6/micro_step=120/global_step=5640, RunningAvgSamplesPerSec=177.09383304005152, CurrSamplesPerSec=177.01056252118374, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:22:25,910] [INFO] [logging.py:96:log_dist] [Rank 0] step=5650, skipped=108, lr=[6.650496890563025e-06, 6.650496890563025e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:22:25,928] [INFO] [timer.py:199:stop] epoch=6/micro_step=130/global_step=5650, RunningAvgSamplesPerSec=177.09317131124132, CurrSamplesPerSec=176.73191584616347, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:22:29,533] [INFO] [logging.py:96:log_dist] [Rank 0] step=5660, skipped=108, lr=[6.640960526612202e-06, 6.640960526612202e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:22:29,551] [INFO] [timer.py:199:stop] epoch=6/micro_step=140/global_step=5660, RunningAvgSamplesPerSec=177.09272372278033, CurrSamplesPerSec=176.92446244213147, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:22:33,153] [INFO] [logging.py:96:log_dist] [Rank 0] step=5670, skipped=108, lr=[6.631415891045378e-06, 6.631415891045378e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:22:33,172] [INFO] [timer.py:199:stop] epoch=6/micro_step=150/global_step=5670, RunningAvgSamplesPerSec=177.09253267163177, CurrSamplesPerSec=176.98407024100015, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:22:36,773] [INFO] [logging.py:96:log_dist] [Rank 0] step=5680, skipped=108, lr=[6.621863027337929e-06, 6.621863027337929e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:22:36,791] [INFO] [timer.py:199:stop] epoch=6/micro_step=160/global_step=5680, RunningAvgSamplesPerSec=177.09238371366234, CurrSamplesPerSec=177.11672786415807, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:22:40,392] [INFO] [logging.py:96:log_dist] [Rank 0] step=5690, skipped=108, lr=[6.612301979002704e-06, 6.612301979002704e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:22:40,411] [INFO] [timer.py:199:stop] epoch=6/micro_step=170/global_step=5690, RunningAvgSamplesPerSec=177.09228658853692, CurrSamplesPerSec=176.8558057900108, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:22:44,028] [INFO] [logging.py:96:log_dist] [Rank 0] step=5700, skipped=108, lr=[6.602732789589832e-06, 6.602732789589832e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:22:44,046] [INFO] [timer.py:199:stop] epoch=6/micro_step=180/global_step=5700, RunningAvgSamplesPerSec=177.0907553016326, CurrSamplesPerSec=176.7039947338106, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:22:47,652] [INFO] [logging.py:96:log_dist] [Rank 0] step=5710, skipped=108, lr=[6.593155502686531e-06, 6.593155502686531e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:22:47,670] [INFO] [timer.py:199:stop] epoch=6/micro_step=190/global_step=5710, RunningAvgSamplesPerSec=177.0902493590724, CurrSamplesPerSec=176.85883534996492, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:22:48,003] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:22:48,336] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:22:51,214] [INFO] [logging.py:96:log_dist] [Rank 0] step=5720, skipped=110, lr=[6.58548787228494e-06, 6.58548787228494e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:22:51,232] [INFO] [timer.py:199:stop] epoch=6/micro_step=200/global_step=5720, RunningAvgSamplesPerSec=177.09504026289906, CurrSamplesPerSec=177.1300513173792, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:22:54,834] [INFO] [logging.py:96:log_dist] [Rank 0] step=5730, skipped=110, lr=[6.57589611985625e-06, 6.57589611985625e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:22:54,852] [INFO] [timer.py:199:stop] epoch=6/micro_step=210/global_step=5730, RunningAvgSamplesPerSec=177.09486301232104, CurrSamplesPerSec=176.5357243618031, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:22:58,453] [INFO] [logging.py:96:log_dist] [Rank 0] step=5740, skipped=110, lr=[6.566296392176917e-06, 6.566296392176917e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:22:58,471] [INFO] [timer.py:199:stop] epoch=6/micro_step=220/global_step=5740, RunningAvgSamplesPerSec=177.09481502421727, CurrSamplesPerSec=177.05457824165285, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:23:02,074] [INFO] [logging.py:96:log_dist] [Rank 0] step=5750, skipped=110, lr=[6.556688732973254e-06, 6.556688732973254e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:23:02,092] [INFO] [timer.py:199:stop] epoch=6/micro_step=230/global_step=5750, RunningAvgSamplesPerSec=177.09450861439242, CurrSamplesPerSec=176.75425908988848, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:23:05,692] [INFO] [logging.py:96:log_dist] [Rank 0] step=5760, skipped=110, lr=[6.547073186007704e-06, 6.547073186007704e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:23:05,710] [INFO] [timer.py:199:stop] epoch=6/micro_step=240/global_step=5760, RunningAvgSamplesPerSec=177.09448354587684, CurrSamplesPerSec=177.058315331924, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:23:09,326] [INFO] [logging.py:96:log_dist] [Rank 0] step=5770, skipped=110, lr=[6.5374497950786375e-06, 6.5374497950786375e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:23:09,345] [INFO] [timer.py:199:stop] epoch=6/micro_step=250/global_step=5770, RunningAvgSamplesPerSec=177.09324109040412, CurrSamplesPerSec=176.8604666944704, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:23:12,947] [INFO] [logging.py:96:log_dist] [Rank 0] step=5780, skipped=110, lr=[6.527818604020154e-06, 6.527818604020154e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:23:12,965] [INFO] [timer.py:199:stop] epoch=6/micro_step=260/global_step=5780, RunningAvgSamplesPerSec=177.0930338538132, CurrSamplesPerSec=177.00892840657616, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:23:16,574] [INFO] [logging.py:96:log_dist] [Rank 0] step=5790, skipped=110, lr=[6.518179656701883e-06, 6.518179656701883e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:23:16,593] [INFO] [timer.py:199:stop] epoch=6/micro_step=270/global_step=5790, RunningAvgSamplesPerSec=177.0922541560707, CurrSamplesPerSec=177.07525099937993, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:23:20,194] [INFO] [logging.py:96:log_dist] [Rank 0] step=5800, skipped=110, lr=[6.50853299702878e-06, 6.50853299702878e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:23:20,212] [INFO] [timer.py:199:stop] epoch=6/micro_step=280/global_step=5800, RunningAvgSamplesPerSec=177.0921260709351, CurrSamplesPerSec=176.87713143714114, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:23:23,813] [INFO] [logging.py:96:log_dist] [Rank 0] step=5810, skipped=110, lr=[6.498878668940935e-06, 6.498878668940935e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:23:23,832] [INFO] [timer.py:199:stop] epoch=6/micro_step=290/global_step=5810, RunningAvgSamplesPerSec=177.09199225158216, CurrSamplesPerSec=177.07992347780197, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:23:24,891] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:23:25,224] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:23:27,379] [INFO] [logging.py:96:log_dist] [Rank 0] step=5820, skipped=112, lr=[6.4911497147620875e-06, 6.4911497147620875e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:23:27,398] [INFO] [timer.py:199:stop] epoch=6/micro_step=300/global_step=5820, RunningAvgSamplesPerSec=177.0963726158762, CurrSamplesPerSec=176.90126009024462, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:23:31,004] [INFO] [logging.py:96:log_dist] [Rank 0] step=5830, skipped=112, lr=[6.481481694368093e-06, 6.481481694368093e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:23:31,022] [INFO] [timer.py:199:stop] epoch=6/micro_step=310/global_step=5830, RunningAvgSamplesPerSec=177.09581750265068, CurrSamplesPerSec=176.81852739633538, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:23:34,631] [INFO] [logging.py:96:log_dist] [Rank 0] step=5840, skipped=112, lr=[6.471806128776786e-06, 6.471806128776786e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:23:34,650] [INFO] [timer.py:199:stop] epoch=6/micro_step=320/global_step=5840, RunningAvgSamplesPerSec=177.09500108540155, CurrSamplesPerSec=176.74995292119073, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:23:38,253] [INFO] [logging.py:96:log_dist] [Rank 0] step=5850, skipped=112, lr=[6.462123062059916e-06, 6.462123062059916e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:23:38,272] [INFO] [timer.py:199:stop] epoch=6/micro_step=330/global_step=5850, RunningAvgSamplesPerSec=177.09469324316223, CurrSamplesPerSec=177.25929927111412, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:23:41,872] [INFO] [logging.py:96:log_dist] [Rank 0] step=5860, skipped=112, lr=[6.452432538323406e-06, 6.452432538323406e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:23:41,891] [INFO] [timer.py:199:stop] epoch=6/micro_step=340/global_step=5860, RunningAvgSamplesPerSec=177.09460037658425, CurrSamplesPerSec=176.92982666594602, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:23:45,492] [INFO] [logging.py:96:log_dist] [Rank 0] step=5870, skipped=112, lr=[6.442734601707142e-06, 6.442734601707142e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:23:45,510] [INFO] [timer.py:199:stop] epoch=6/micro_step=350/global_step=5870, RunningAvgSamplesPerSec=177.09446482014684, CurrSamplesPerSec=177.15495051007815, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:23:49,125] [INFO] [logging.py:96:log_dist] [Rank 0] step=5880, skipped=112, lr=[6.433029296384776e-06, 6.433029296384776e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:23:49,144] [INFO] [timer.py:199:stop] epoch=6/micro_step=360/global_step=5880, RunningAvgSamplesPerSec=177.09317786975228, CurrSamplesPerSec=176.97858604611912, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:23:52,748] [INFO] [logging.py:96:log_dist] [Rank 0] step=5890, skipped=112, lr=[6.423316666563523e-06, 6.423316666563523e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:23:52,766] [INFO] [timer.py:199:stop] epoch=6/micro_step=370/global_step=5890, RunningAvgSamplesPerSec=177.09277559287312, CurrSamplesPerSec=176.83856257934121, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:23:56,369] [INFO] [logging.py:96:log_dist] [Rank 0] step=5900, skipped=112, lr=[6.41359675648396e-06, 6.41359675648396e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:23:56,387] [INFO] [timer.py:199:stop] epoch=6/micro_step=380/global_step=5900, RunningAvgSamplesPerSec=177.0925614183506, CurrSamplesPerSec=177.12479182007976, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:23:59,988] [INFO] [logging.py:96:log_dist] [Rank 0] step=5910, skipped=112, lr=[6.403869610419829e-06, 6.403869610419829e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:24:00,007] [INFO] [timer.py:199:stop] epoch=6/micro_step=390/global_step=5910, RunningAvgSamplesPerSec=177.09241267360682, CurrSamplesPerSec=176.89531473849442, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:24:01,788] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:24:02,121] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:24:03,552] [INFO] [logging.py:96:log_dist] [Rank 0] step=5920, skipped=114, lr=[6.396082713432634e-06, 6.396082713432634e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:24:03,570] [INFO] [timer.py:199:stop] epoch=6/micro_step=400/global_step=5920, RunningAvgSamplesPerSec=177.0969082137889, CurrSamplesPerSec=176.83448528462375, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:24:07,173] [INFO] [logging.py:96:log_dist] [Rank 0] step=5930, skipped=114, lr=[6.386342654271181e-06, 6.386342654271181e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:24:07,191] [INFO] [timer.py:199:stop] epoch=6/micro_step=410/global_step=5930, RunningAvgSamplesPerSec=177.09666585039744, CurrSamplesPerSec=177.08611489154248, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:24:10,793] [INFO] [logging.py:96:log_dist] [Rank 0] step=5940, skipped=114, lr=[6.376595483266332e-06, 6.376595483266332e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:24:10,811] [INFO] [timer.py:199:stop] epoch=6/micro_step=420/global_step=5940, RunningAvgSamplesPerSec=177.09648494580892, CurrSamplesPerSec=176.9623688202375, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:24:14,446] [INFO] [logging.py:96:log_dist] [Rank 0] step=5950, skipped=114, lr=[6.366841244815997e-06, 6.366841244815997e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:24:14,464] [INFO] [timer.py:199:stop] epoch=6/micro_step=430/global_step=5950, RunningAvgSamplesPerSec=177.0936135976408, CurrSamplesPerSec=176.89624731626373, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:24:18,065] [INFO] [logging.py:96:log_dist] [Rank 0] step=5960, skipped=114, lr=[6.35707998335028e-06, 6.35707998335028e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:24:18,083] [INFO] [timer.py:199:stop] epoch=6/micro_step=440/global_step=5960, RunningAvgSamplesPerSec=177.093532083982, CurrSamplesPerSec=176.92609499755474, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:24:21,703] [INFO] [logging.py:96:log_dist] [Rank 0] step=5970, skipped=114, lr=[6.347311743331277e-06, 6.347311743331277e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:24:21,721] [INFO] [timer.py:199:stop] epoch=6/micro_step=450/global_step=5970, RunningAvgSamplesPerSec=177.0919001980823, CurrSamplesPerSec=176.94603792763178, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:24:25,325] [INFO] [logging.py:96:log_dist] [Rank 0] step=5980, skipped=114, lr=[6.337536569252866e-06, 6.337536569252866e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:24:25,343] [INFO] [timer.py:199:stop] epoch=6/micro_step=460/global_step=5980, RunningAvgSamplesPerSec=177.09155937625732, CurrSamplesPerSec=176.99760715545577, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:24:28,945] [INFO] [logging.py:96:log_dist] [Rank 0] step=5990, skipped=114, lr=[6.327754505640514e-06, 6.327754505640514e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:24:28,964] [INFO] [timer.py:199:stop] epoch=6/micro_step=470/global_step=5990, RunningAvgSamplesPerSec=177.09135273750846, CurrSamplesPerSec=176.97671916370425, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:24:32,565] [INFO] [logging.py:96:log_dist] [Rank 0] step=6000, skipped=114, lr=[6.317965597051064e-06, 6.317965597051064e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:24:32,583] [INFO] [timer.py:199:stop] epoch=6/micro_step=480/global_step=6000, RunningAvgSamplesPerSec=177.09124151048368, CurrSamplesPerSec=176.6621296031221, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:24:36,185] [INFO] [logging.py:96:log_dist] [Rank 0] step=6010, skipped=114, lr=[6.308169888072543e-06, 6.308169888072543e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:24:36,204] [INFO] [timer.py:199:stop] epoch=6/micro_step=490/global_step=6010, RunningAvgSamplesPerSec=177.09105211267055, CurrSamplesPerSec=176.7901131794637, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:24:38,709] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:24:39,042] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:24:39,749] [INFO] [logging.py:96:log_dist] [Rank 0] step=6020, skipped=116, lr=[6.3003284545925255e-06, 6.3003284545925255e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:24:39,767] [INFO] [timer.py:199:stop] epoch=6/micro_step=500/global_step=6020, RunningAvgSamplesPerSec=177.09550834095404, CurrSamplesPerSec=176.97578573726653, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:24:43,369] [INFO] [logging.py:96:log_dist] [Rank 0] step=6030, skipped=116, lr=[6.290520617374243e-06, 6.290520617374243e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:24:43,388] [INFO] [timer.py:199:stop] epoch=6/micro_step=510/global_step=6030, RunningAvgSamplesPerSec=177.0952899619236, CurrSamplesPerSec=176.8721200304939, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:24:46,998] [INFO] [logging.py:96:log_dist] [Rank 0] step=6040, skipped=116, lr=[6.280706104777497e-06, 6.280706104777497e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:24:47,017] [INFO] [timer.py:199:stop] epoch=6/micro_step=520/global_step=6040, RunningAvgSamplesPerSec=177.09438573432465, CurrSamplesPerSec=176.89286676863242, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:24:50,618] [INFO] [logging.py:96:log_dist] [Rank 0] step=6050, skipped=116, lr=[6.2708849615069386e-06, 6.2708849615069386e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:24:50,636] [INFO] [timer.py:199:stop] epoch=6/micro_step=530/global_step=6050, RunningAvgSamplesPerSec=177.09422694309248, CurrSamplesPerSec=176.9812697585949, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:24:54,261] [INFO] [logging.py:96:log_dist] [Rank 0] step=6060, skipped=116, lr=[6.261057232297421e-06, 6.261057232297421e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:24:54,279] [INFO] [timer.py:199:stop] epoch=6/micro_step=540/global_step=6060, RunningAvgSamplesPerSec=177.09233290863827, CurrSamplesPerSec=176.96131888647756, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:24:57,880] [INFO] [logging.py:96:log_dist] [Rank 0] step=6070, skipped=116, lr=[6.251222961913795e-06, 6.251222961913795e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:24:57,899] [INFO] [timer.py:199:stop] epoch=6/micro_step=550/global_step=6070, RunningAvgSamplesPerSec=177.0922508448109, CurrSamplesPerSec=177.24349868372488, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:25:01,498] [INFO] [logging.py:96:log_dist] [Rank 0] step=6080, skipped=116, lr=[6.241382195150706e-06, 6.241382195150706e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:25:01,517] [INFO] [timer.py:199:stop] epoch=6/micro_step=560/global_step=6080, RunningAvgSamplesPerSec=177.09224961045322, CurrSamplesPerSec=177.06216937149912, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:25:05,119] [INFO] [logging.py:96:log_dist] [Rank 0] step=6090, skipped=116, lr=[6.23153497683239e-06, 6.23153497683239e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:25:05,137] [INFO] [timer.py:199:stop] epoch=6/micro_step=570/global_step=6090, RunningAvgSamplesPerSec=177.09203966533988, CurrSamplesPerSec=177.13496046334075, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:25:08,737] [INFO] [logging.py:96:log_dist] [Rank 0] step=6100, skipped=116, lr=[6.22168135181247e-06, 6.22168135181247e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:25:08,755] [INFO] [timer.py:199:stop] epoch=6/micro_step=580/global_step=6100, RunningAvgSamplesPerSec=177.0920411570473, CurrSamplesPerSec=177.1478190400833, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:25:12,357] [INFO] [logging.py:96:log_dist] [Rank 0] step=6110, skipped=116, lr=[6.21182136497375e-06, 6.21182136497375e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:25:12,375] [INFO] [timer.py:199:stop] epoch=6/micro_step=590/global_step=6110, RunningAvgSamplesPerSec=177.09191999814934, CurrSamplesPerSec=176.94545473721402, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:25:15,618] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:25:15,951] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:25:15,952] [INFO] [logging.py:96:log_dist] [Rank 0] step=6120, skipped=118, lr=[6.2039288251729886e-06, 6.2039288251729886e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:25:15,952] [INFO] [timer.py:199:stop] epoch=6/micro_step=600/global_step=6120, RunningAvgSamplesPerSec=177.09566949493805, CurrSamplesPerSec=192.2815707701851, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:25:19,562] [INFO] [logging.py:96:log_dist] [Rank 0] step=6130, skipped=118, lr=[6.194057500257468e-06, 6.194057500257468e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:25:19,580] [INFO] [timer.py:199:stop] epoch=6/micro_step=610/global_step=6130, RunningAvgSamplesPerSec=177.09486518799707, CurrSamplesPerSec=177.218340599822, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:25:23,181] [INFO] [logging.py:96:log_dist] [Rank 0] step=6140, skipped=118, lr=[6.18417993934851e-06, 6.18417993934851e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:25:23,199] [INFO] [timer.py:199:stop] epoch=6/micro_step=620/global_step=6140, RunningAvgSamplesPerSec=177.09479331345833, CurrSamplesPerSec=177.09032062704387, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:25:26,822] [INFO] [logging.py:96:log_dist] [Rank 0] step=6150, skipped=118, lr=[6.1742961874379475e-06, 6.1742961874379475e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:25:26,840] [INFO] [timer.py:199:stop] epoch=6/micro_step=630/global_step=6150, RunningAvgSamplesPerSec=177.0929596319665, CurrSamplesPerSec=176.8722365716093, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:25:30,442] [INFO] [logging.py:96:log_dist] [Rank 0] step=6160, skipped=118, lr=[6.1644062895458145e-06, 6.1644062895458145e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:25:30,461] [INFO] [timer.py:199:stop] epoch=6/micro_step=640/global_step=6160, RunningAvgSamplesPerSec=177.0927224790867, CurrSamplesPerSec=176.80839505001885, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:25:34,061] [INFO] [logging.py:96:log_dist] [Rank 0] step=6170, skipped=118, lr=[6.154510290720134e-06, 6.154510290720134e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:25:34,080] [INFO] [timer.py:199:stop] epoch=6/micro_step=650/global_step=6170, RunningAvgSamplesPerSec=177.09263308254015, CurrSamplesPerSec=176.9072058416481, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:25:37,679] [INFO] [logging.py:96:log_dist] [Rank 0] step=6180, skipped=118, lr=[6.144608236036723e-06, 6.144608236036723e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:25:37,697] [INFO] [timer.py:199:stop] epoch=6/micro_step=660/global_step=6180, RunningAvgSamplesPerSec=177.09267546324503, CurrSamplesPerSec=176.99795727554098, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:25:41,298] [INFO] [logging.py:96:log_dist] [Rank 0] step=6190, skipped=118, lr=[6.134700170598984e-06, 6.134700170598984e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:25:41,316] [INFO] [timer.py:199:stop] epoch=6/micro_step=670/global_step=6190, RunningAvgSamplesPerSec=177.0925638884969, CurrSamplesPerSec=177.15483359610468, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:25:44,921] [INFO] [logging.py:96:log_dist] [Rank 0] step=6200, skipped=118, lr=[6.124786139537692e-06, 6.124786139537692e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:25:44,939] [INFO] [timer.py:199:stop] epoch=6/micro_step=680/global_step=6200, RunningAvgSamplesPerSec=177.09218298906032, CurrSamplesPerSec=176.8742177940689, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:25:48,544] [INFO] [logging.py:96:log_dist] [Rank 0] step=6210, skipped=118, lr=[6.114866188010802e-06, 6.114866188010802e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:25:48,562] [INFO] [timer.py:199:stop] epoch=6/micro_step=690/global_step=6210, RunningAvgSamplesPerSec=177.09177563507373, CurrSamplesPerSec=176.98745427548903, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:25:52,162] [INFO] [logging.py:96:log_dist] [Rank 0] step=6220, skipped=118, lr=[6.104940361203231e-06, 6.104940361203231e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:25:52,181] [INFO] [timer.py:199:stop] epoch=6/micro_step=700/global_step=6220, RunningAvgSamplesPerSec=177.09172141834722, CurrSamplesPerSec=177.0021588246077, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:25:52,513] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:25:52,846] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:25:55,726] [INFO] [logging.py:96:log_dist] [Rank 0] step=6230, skipped=120, lr=[6.096995499936438e-06, 6.096995499936438e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:25:55,744] [INFO] [timer.py:199:stop] epoch=6/micro_step=710/global_step=6230, RunningAvgSamplesPerSec=177.0959990658141, CurrSamplesPerSec=177.064505234718, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:25:59,359] [INFO] [logging.py:96:log_dist] [Rank 0] step=6240, skipped=120, lr=[6.0870592115749305e-06, 6.0870592115749305e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:25:59,378] [INFO] [timer.py:199:stop] epoch=6/micro_step=720/global_step=6240, RunningAvgSamplesPerSec=177.09505853649102, CurrSamplesPerSec=176.97333553971387, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:26:02,982] [INFO] [logging.py:96:log_dist] [Rank 0] step=6250, skipped=120, lr=[6.077117174592231e-06, 6.077117174592231e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:26:03,000] [INFO] [timer.py:199:stop] epoch=6/micro_step=730/global_step=6250, RunningAvgSamplesPerSec=177.0947506570932, CurrSamplesPerSec=177.28353231067544, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:26:06,601] [INFO] [logging.py:96:log_dist] [Rank 0] step=6260, skipped=120, lr=[6.067169434273856e-06, 6.067169434273856e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:26:06,620] [INFO] [timer.py:199:stop] epoch=6/micro_step=740/global_step=6260, RunningAvgSamplesPerSec=177.09459061771483, CurrSamplesPerSec=177.0022755371514, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:26:10,223] [INFO] [logging.py:96:log_dist] [Rank 0] step=6270, skipped=120, lr=[6.057216035931302e-06, 6.057216035931302e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:26:10,242] [INFO] [timer.py:199:stop] epoch=6/micro_step=750/global_step=6270, RunningAvgSamplesPerSec=177.09430536400635, CurrSamplesPerSec=177.01441448213967, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:26:13,846] [INFO] [logging.py:96:log_dist] [Rank 0] step=6280, skipped=120, lr=[6.047257024901837e-06, 6.047257024901837e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:26:13,865] [INFO] [timer.py:199:stop] epoch=6/micro_step=760/global_step=6280, RunningAvgSamplesPerSec=177.09390480051655, CurrSamplesPerSec=176.87876311919158, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:26:17,466] [INFO] [logging.py:96:log_dist] [Rank 0] step=6290, skipped=120, lr=[6.037292446548297e-06, 6.037292446548297e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:26:17,484] [INFO] [timer.py:199:stop] epoch=6/micro_step=770/global_step=6290, RunningAvgSamplesPerSec=177.09379911936637, CurrSamplesPerSec=177.011729764376, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:26:21,086] [INFO] [logging.py:96:log_dist] [Rank 0] step=6300, skipped=120, lr=[6.0273223462588705e-06, 6.0273223462588705e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:26:21,105] [INFO] [timer.py:199:stop] epoch=6/micro_step=780/global_step=6300, RunningAvgSamplesPerSec=177.0935968829422, CurrSamplesPerSec=177.04231702217749, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:26:24,706] [INFO] [logging.py:96:log_dist] [Rank 0] step=6310, skipped=120, lr=[6.0173467694469044e-06, 6.0173467694469044e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:26:24,724] [INFO] [timer.py:199:stop] epoch=6/micro_step=790/global_step=6310, RunningAvgSamplesPerSec=177.09346000188293, CurrSamplesPerSec=176.91746611726683, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:26:28,331] [INFO] [logging.py:96:log_dist] [Rank 0] step=6320, skipped=120, lr=[6.007365761550688e-06, 6.007365761550688e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:26:28,349] [INFO] [timer.py:199:stop] epoch=6/micro_step=800/global_step=6320, RunningAvgSamplesPerSec=177.09294154655467, CurrSamplesPerSec=176.9332086263979, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:26:29,406] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:26:29,740] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:26:31,895] [INFO] [logging.py:96:log_dist] [Rank 0] step=6330, skipped=122, lr=[5.999377075403383e-06, 5.999377075403383e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:26:31,913] [INFO] [timer.py:199:stop] epoch=6/micro_step=810/global_step=6330, RunningAvgSamplesPerSec=177.0970926155036, CurrSamplesPerSec=177.13192143613634, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:26:35,514] [INFO] [logging.py:96:log_dist] [Rank 0] step=6340, skipped=122, lr=[5.989386406138838e-06, 5.989386406138838e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:26:35,532] [INFO] [timer.py:199:stop] epoch=6/micro_step=820/global_step=6340, RunningAvgSamplesPerSec=177.09703377952818, CurrSamplesPerSec=177.03367680189092, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:26:39,132] [INFO] [logging.py:96:log_dist] [Rank 0] step=6350, skipped=122, lr=[5.979390433148203e-06, 5.979390433148203e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:26:39,151] [INFO] [timer.py:199:stop] epoch=6/micro_step=830/global_step=6350, RunningAvgSamplesPerSec=177.09696675449914, CurrSamplesPerSec=177.10212111815585, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:26:42,752] [INFO] [logging.py:96:log_dist] [Rank 0] step=6360, skipped=122, lr=[5.969389201962667e-06, 5.969389201962667e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:26:42,771] [INFO] [timer.py:199:stop] epoch=6/micro_step=840/global_step=6360, RunningAvgSamplesPerSec=177.09679394508206, CurrSamplesPerSec=177.21424577372784, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:26:46,373] [INFO] [logging.py:96:log_dist] [Rank 0] step=6370, skipped=122, lr=[5.959382758137377e-06, 5.959382758137377e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:26:46,391] [INFO] [timer.py:199:stop] epoch=6/micro_step=850/global_step=6370, RunningAvgSamplesPerSec=177.09656830579343, CurrSamplesPerSec=176.87934587007675, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:26:49,994] [INFO] [logging.py:96:log_dist] [Rank 0] step=6380, skipped=122, lr=[5.949371147251223e-06, 5.949371147251223e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:26:50,012] [INFO] [timer.py:199:stop] epoch=6/micro_step=860/global_step=6380, RunningAvgSamplesPerSec=177.0963533033713, CurrSamplesPerSec=177.11579296024723, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:26:53,613] [INFO] [logging.py:96:log_dist] [Rank 0] step=6390, skipped=122, lr=[5.939354414906624e-06, 5.939354414906624e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:26:53,631] [INFO] [timer.py:199:stop] epoch=6/micro_step=870/global_step=6390, RunningAvgSamplesPerSec=177.0962634929334, CurrSamplesPerSec=177.02865652073842, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:26:57,235] [INFO] [logging.py:96:log_dist] [Rank 0] step=6400, skipped=122, lr=[5.9293326067293335e-06, 5.9293326067293335e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:26:57,253] [INFO] [timer.py:199:stop] epoch=6/micro_step=880/global_step=6400, RunningAvgSamplesPerSec=177.09595915465798, CurrSamplesPerSec=176.91595031981046, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:27:00,871] [INFO] [logging.py:96:log_dist] [Rank 0] step=6410, skipped=122, lr=[5.919305768368224e-06, 5.919305768368224e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:27:00,889] [INFO] [timer.py:199:stop] epoch=6/micro_step=890/global_step=6410, RunningAvgSamplesPerSec=177.0946062022223, CurrSamplesPerSec=176.31089672856436, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:27:04,491] [INFO] [logging.py:96:log_dist] [Rank 0] step=6420, skipped=122, lr=[5.909273945495077e-06, 5.909273945495077e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:27:04,509] [INFO] [timer.py:199:stop] epoch=6/micro_step=900/global_step=6420, RunningAvgSamplesPerSec=177.0944750081485, CurrSamplesPerSec=177.06614037564196, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:27:06,290] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:27:06,623] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:27:08,054] [INFO] [logging.py:96:log_dist] [Rank 0] step=6430, skipped=124, lr=[5.901244929053832e-06, 5.901244929053832e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:27:08,072] [INFO] [timer.py:199:stop] epoch=6/micro_step=910/global_step=6430, RunningAvgSamplesPerSec=177.0986558826617, CurrSamplesPerSec=177.09534440626797, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:27:11,674] [INFO] [logging.py:96:log_dist] [Rank 0] step=6440, skipped=124, lr=[5.8912042492242554e-06, 5.8912042492242554e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:27:11,692] [INFO] [timer.py:199:stop] epoch=6/micro_step=920/global_step=6440, RunningAvgSamplesPerSec=177.0985141900556, CurrSamplesPerSec=177.07338207705237, MemAllocated=4.98GB, MaxMemAllocated=24.08GB ***** Evaluating perplexity, Epoch 7/16 ***** ppl: 1.8070838451385498 Beginning of Epoch 8/16, Total Micro Batches 920 [2023-04-21 22:27:23,464] [INFO] [logging.py:96:log_dist] [Rank 0] step=6450, skipped=124, lr=[5.881158712883758e-06, 5.881158712883758e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:27:23,483] [INFO] [timer.py:199:stop] epoch=7/micro_step=10/global_step=6450, RunningAvgSamplesPerSec=177.09528192979772, CurrSamplesPerSec=176.72912333786948, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:27:27,093] [INFO] [logging.py:96:log_dist] [Rank 0] step=6460, skipped=124, lr=[5.8711083657892926e-06, 5.8711083657892926e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:27:27,112] [INFO] [timer.py:199:stop] epoch=7/micro_step=20/global_step=6460, RunningAvgSamplesPerSec=177.09445734462125, CurrSamplesPerSec=177.0253876535166, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:27:30,718] [INFO] [logging.py:96:log_dist] [Rank 0] step=6470, skipped=124, lr=[5.861053253719727e-06, 5.861053253719727e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:27:30,737] [INFO] [timer.py:199:stop] epoch=7/micro_step=30/global_step=6470, RunningAvgSamplesPerSec=177.0939162370152, CurrSamplesPerSec=176.66398985435043, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:27:34,352] [INFO] [logging.py:96:log_dist] [Rank 0] step=6480, skipped=124, lr=[5.850993422475626e-06, 5.850993422475626e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:27:34,370] [INFO] [timer.py:199:stop] epoch=7/micro_step=40/global_step=6480, RunningAvgSamplesPerSec=177.09289190705627, CurrSamplesPerSec=176.87748108076053, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:27:37,975] [INFO] [logging.py:96:log_dist] [Rank 0] step=6490, skipped=124, lr=[5.840928917879057e-06, 5.840928917879057e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:27:37,994] [INFO] [timer.py:199:stop] epoch=7/micro_step=50/global_step=6490, RunningAvgSamplesPerSec=177.09245218335576, CurrSamplesPerSec=177.15576891221323, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:27:41,597] [INFO] [logging.py:96:log_dist] [Rank 0] step=6500, skipped=124, lr=[5.830859785773373e-06, 5.830859785773373e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:27:41,615] [INFO] [timer.py:199:stop] epoch=7/micro_step=60/global_step=6500, RunningAvgSamplesPerSec=177.09221820678968, CurrSamplesPerSec=176.7345920827567, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:27:45,221] [INFO] [logging.py:96:log_dist] [Rank 0] step=6510, skipped=124, lr=[5.8207860720230026e-06, 5.8207860720230026e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:27:45,239] [INFO] [timer.py:199:stop] epoch=7/micro_step=70/global_step=6510, RunningAvgSamplesPerSec=177.09173935029654, CurrSamplesPerSec=176.9414891443201, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:27:48,842] [INFO] [logging.py:96:log_dist] [Rank 0] step=6520, skipped=124, lr=[5.810707822513246e-06, 5.810707822513246e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:27:48,860] [INFO] [timer.py:199:stop] epoch=7/micro_step=80/global_step=6520, RunningAvgSamplesPerSec=177.0915225124211, CurrSamplesPerSec=177.04640390664213, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:27:51,367] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:27:51,700] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:27:52,407] [INFO] [logging.py:96:log_dist] [Rank 0] step=6530, skipped=126, lr=[5.802641988006797e-06, 5.802641988006797e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:27:52,425] [INFO] [timer.py:199:stop] epoch=7/micro_step=90/global_step=6530, RunningAvgSamplesPerSec=177.09548739013422, CurrSamplesPerSec=176.9073224290042, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:27:56,027] [INFO] [logging.py:96:log_dist] [Rank 0] step=6540, skipped=126, lr=[5.792555689826908e-06, 5.792555689826908e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:27:56,045] [INFO] [timer.py:199:stop] epoch=7/micro_step=100/global_step=6540, RunningAvgSamplesPerSec=177.09532607144126, CurrSamplesPerSec=177.1516769771515, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:27:59,650] [INFO] [logging.py:96:log_dist] [Rank 0] step=6550, skipped=126, lr=[5.782464984475714e-06, 5.782464984475714e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:27:59,668] [INFO] [timer.py:199:stop] epoch=7/micro_step=110/global_step=6550, RunningAvgSamplesPerSec=177.09499725524802, CurrSamplesPerSec=175.65684693108997, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:28:03,272] [INFO] [logging.py:96:log_dist] [Rank 0] step=6560, skipped=126, lr=[5.7723699179159095e-06, 5.7723699179159095e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:28:03,290] [INFO] [timer.py:199:stop] epoch=7/micro_step=120/global_step=6560, RunningAvgSamplesPerSec=177.09466710922487, CurrSamplesPerSec=176.993989329015, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:28:06,920] [INFO] [logging.py:96:log_dist] [Rank 0] step=6570, skipped=126, lr=[5.762270536130056e-06, 5.762270536130056e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:28:06,938] [INFO] [timer.py:199:stop] epoch=7/micro_step=130/global_step=6570, RunningAvgSamplesPerSec=177.09244888202616, CurrSamplesPerSec=174.1491763066437, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:28:10,541] [INFO] [logging.py:96:log_dist] [Rank 0] step=6580, skipped=126, lr=[5.752166885120367e-06, 5.752166885120367e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:28:10,560] [INFO] [timer.py:199:stop] epoch=7/micro_step=140/global_step=6580, RunningAvgSamplesPerSec=177.09219406299817, CurrSamplesPerSec=176.38585288445427, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:28:14,162] [INFO] [logging.py:96:log_dist] [Rank 0] step=6590, skipped=126, lr=[5.742059010908505e-06, 5.742059010908505e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:28:14,180] [INFO] [timer.py:199:stop] epoch=7/micro_step=150/global_step=6590, RunningAvgSamplesPerSec=177.09198525746217, CurrSamplesPerSec=176.8468342405089, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:28:17,782] [INFO] [logging.py:96:log_dist] [Rank 0] step=6600, skipped=126, lr=[5.73194695953537e-06, 5.73194695953537e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:28:17,800] [INFO] [timer.py:199:stop] epoch=7/micro_step=160/global_step=6600, RunningAvgSamplesPerSec=177.09185742197639, CurrSamplesPerSec=177.07291485263463, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:28:21,402] [INFO] [logging.py:96:log_dist] [Rank 0] step=6610, skipped=126, lr=[5.721830777060886e-06, 5.721830777060886e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:28:21,421] [INFO] [timer.py:199:stop] epoch=7/micro_step=170/global_step=6610, RunningAvgSamplesPerSec=177.09165706055904, CurrSamplesPerSec=177.04476913021426, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:28:25,022] [INFO] [logging.py:96:log_dist] [Rank 0] step=6620, skipped=126, lr=[5.711710509563793e-06, 5.711710509563793e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:28:25,040] [INFO] [timer.py:199:stop] epoch=7/micro_step=180/global_step=6620, RunningAvgSamplesPerSec=177.09155588178794, CurrSamplesPerSec=177.01207994033555, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:28:28,276] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:28:28,609] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:28:28,610] [INFO] [logging.py:96:log_dist] [Rank 0] step=6630, skipped=128, lr=[5.703611385326642e-06, 5.703611385326642e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:28:28,610] [INFO] [timer.py:199:stop] epoch=7/micro_step=190/global_step=6630, RunningAvgSamplesPerSec=177.09507942936023, CurrSamplesPerSec=192.09691168430425, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:28:32,234] [INFO] [logging.py:96:log_dist] [Rank 0] step=6640, skipped=128, lr=[5.693483880966548e-06, 5.693483880966548e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:28:32,252] [INFO] [timer.py:199:stop] epoch=7/micro_step=200/global_step=6640, RunningAvgSamplesPerSec=177.0932809603344, CurrSamplesPerSec=176.80594948522938, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:28:35,856] [INFO] [logging.py:96:log_dist] [Rank 0] step=6650, skipped=128, lr=[5.683352420702643e-06, 5.683352420702643e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:28:35,875] [INFO] [timer.py:199:stop] epoch=7/micro_step=210/global_step=6650, RunningAvgSamplesPerSec=177.0929627046, CurrSamplesPerSec=176.76764446738989, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:28:39,479] [INFO] [logging.py:96:log_dist] [Rank 0] step=6660, skipped=128, lr=[5.673217050683262e-06, 5.673217050683262e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:28:39,497] [INFO] [timer.py:199:stop] epoch=7/micro_step=220/global_step=6660, RunningAvgSamplesPerSec=177.09263475463027, CurrSamplesPerSec=176.57358047738296, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:28:43,102] [INFO] [logging.py:96:log_dist] [Rank 0] step=6670, skipped=128, lr=[5.663077817074542e-06, 5.663077817074542e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:28:43,121] [INFO] [timer.py:199:stop] epoch=7/micro_step=230/global_step=6670, RunningAvgSamplesPerSec=177.09221289473746, CurrSamplesPerSec=176.92352956684505, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:28:46,726] [INFO] [logging.py:96:log_dist] [Rank 0] step=6680, skipped=128, lr=[5.652934766060224e-06, 5.652934766060224e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:28:46,744] [INFO] [timer.py:199:stop] epoch=7/micro_step=240/global_step=6680, RunningAvgSamplesPerSec=177.09183418299187, CurrSamplesPerSec=176.80059277017972, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:28:50,347] [INFO] [logging.py:96:log_dist] [Rank 0] step=6690, skipped=128, lr=[5.642787943841435e-06, 5.642787943841435e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:28:50,365] [INFO] [timer.py:199:stop] epoch=7/micro_step=250/global_step=6690, RunningAvgSamplesPerSec=177.09161163967937, CurrSamplesPerSec=177.08331117885362, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:28:53,968] [INFO] [logging.py:96:log_dist] [Rank 0] step=6700, skipped=128, lr=[5.632637396636479e-06, 5.632637396636479e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:28:53,986] [INFO] [timer.py:199:stop] epoch=7/micro_step=260/global_step=6700, RunningAvgSamplesPerSec=177.09140556437833, CurrSamplesPerSec=176.67282658272086, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:28:57,592] [INFO] [logging.py:96:log_dist] [Rank 0] step=6710, skipped=128, lr=[5.622483170680628e-06, 5.622483170680628e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:28:57,610] [INFO] [timer.py:199:stop] epoch=7/micro_step=270/global_step=6710, RunningAvgSamplesPerSec=177.09094742566515, CurrSamplesPerSec=176.95956902456342, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:29:01,215] [INFO] [logging.py:96:log_dist] [Rank 0] step=6720, skipped=128, lr=[5.612325312225912e-06, 5.612325312225912e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:29:01,233] [INFO] [timer.py:199:stop] epoch=7/micro_step=280/global_step=6720, RunningAvgSamplesPerSec=177.0906508888975, CurrSamplesPerSec=176.9332086263979, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:29:04,840] [INFO] [logging.py:96:log_dist] [Rank 0] step=6730, skipped=128, lr=[5.602163867540904e-06, 5.602163867540904e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:29:04,859] [INFO] [timer.py:199:stop] epoch=7/micro_step=290/global_step=6730, RunningAvgSamplesPerSec=177.09007898215447, CurrSamplesPerSec=175.88185236000436, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:29:05,192] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:29:05,525] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:29:08,406] [INFO] [logging.py:96:log_dist] [Rank 0] step=6740, skipped=130, lr=[5.594032160810001e-06, 5.594032160810001e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:29:08,424] [INFO] [timer.py:199:stop] epoch=7/micro_step=300/global_step=6740, RunningAvgSamplesPerSec=177.09388345576045, CurrSamplesPerSec=176.899861147997, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:29:12,027] [INFO] [logging.py:96:log_dist] [Rank 0] step=6750, skipped=130, lr=[5.5838643775592805e-06, 5.5838643775592805e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:29:12,045] [INFO] [timer.py:199:stop] epoch=7/micro_step=310/global_step=6750, RunningAvgSamplesPerSec=177.09364352234073, CurrSamplesPerSec=176.87410125034262, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:29:15,666] [INFO] [logging.py:96:log_dist] [Rank 0] step=6760, skipped=130, lr=[5.5736931377165065e-06, 5.5736931377165065e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:29:15,684] [INFO] [timer.py:199:stop] epoch=7/micro_step=320/global_step=6760, RunningAvgSamplesPerSec=177.09217408857373, CurrSamplesPerSec=176.7191175744489, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:29:19,284] [INFO] [logging.py:96:log_dist] [Rank 0] step=6770, skipped=130, lr=[5.563518487611204e-06, 5.563518487611204e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:29:19,303] [INFO] [timer.py:199:stop] epoch=7/micro_step=330/global_step=6770, RunningAvgSamplesPerSec=177.09210090258713, CurrSamplesPerSec=177.01278029641108, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:29:22,904] [INFO] [logging.py:96:log_dist] [Rank 0] step=6780, skipped=130, lr=[5.553340473588432e-06, 5.553340473588432e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:29:22,922] [INFO] [timer.py:199:stop] epoch=7/micro_step=340/global_step=6780, RunningAvgSamplesPerSec=177.09200767937762, CurrSamplesPerSec=177.01219666596327, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:29:26,522] [INFO] [logging.py:96:log_dist] [Rank 0] step=6790, skipped=130, lr=[5.543159142008574e-06, 5.543159142008574e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:29:26,541] [INFO] [timer.py:199:stop] epoch=7/micro_step=350/global_step=6790, RunningAvgSamplesPerSec=177.09195364588066, CurrSamplesPerSec=176.88342523379265, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:29:30,143] [INFO] [logging.py:96:log_dist] [Rank 0] step=6800, skipped=130, lr=[5.5329745392471205e-06, 5.5329745392471205e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:29:30,161] [INFO] [timer.py:199:stop] epoch=7/micro_step=360/global_step=6800, RunningAvgSamplesPerSec=177.0917711166244, CurrSamplesPerSec=176.8201579973217, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:29:33,772] [INFO] [logging.py:96:log_dist] [Rank 0] step=6810, skipped=130, lr=[5.522786711694468e-06, 5.522786711694468e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:29:33,790] [INFO] [timer.py:199:stop] epoch=7/micro_step=370/global_step=6810, RunningAvgSamplesPerSec=177.0909765900169, CurrSamplesPerSec=175.84682646263093, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:29:37,402] [INFO] [logging.py:96:log_dist] [Rank 0] step=6820, skipped=130, lr=[5.512595705755698e-06, 5.512595705755698e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:29:37,420] [INFO] [timer.py:199:stop] epoch=7/micro_step=380/global_step=6820, RunningAvgSamplesPerSec=177.09015792712248, CurrSamplesPerSec=177.13332405111905, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:29:41,023] [INFO] [logging.py:96:log_dist] [Rank 0] step=6830, skipped=130, lr=[5.50240156785037e-06, 5.50240156785037e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:29:41,041] [INFO] [timer.py:199:stop] epoch=7/micro_step=390/global_step=6830, RunningAvgSamplesPerSec=177.08995602555927, CurrSamplesPerSec=176.75204778786235, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:29:42,099] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:29:42,433] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:29:44,600] [INFO] [logging.py:96:log_dist] [Rank 0] step=6840, skipped=132, lr=[5.49424403371324e-06, 5.49424403371324e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:29:44,618] [INFO] [timer.py:199:stop] epoch=7/micro_step=400/global_step=6840, RunningAvgSamplesPerSec=177.09287724559488, CurrSamplesPerSec=176.57880729008383, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:29:48,229] [INFO] [logging.py:96:log_dist] [Rank 0] step=6850, skipped=132, lr=[5.4840443752907975e-06, 5.4840443752907975e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:29:48,247] [INFO] [timer.py:199:stop] epoch=7/micro_step=410/global_step=6850, RunningAvgSamplesPerSec=177.09211502785206, CurrSamplesPerSec=176.75670322492283, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:29:51,850] [INFO] [logging.py:96:log_dist] [Rank 0] step=6860, skipped=132, lr=[5.473841714951782e-06, 5.473841714951782e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:29:51,868] [INFO] [timer.py:199:stop] epoch=7/micro_step=420/global_step=6860, RunningAvgSamplesPerSec=177.0918733320249, CurrSamplesPerSec=177.11263773241313, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:29:55,471] [INFO] [logging.py:96:log_dist] [Rank 0] step=6870, skipped=132, lr=[5.463636099168839e-06, 5.463636099168839e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:29:55,489] [INFO] [timer.py:199:stop] epoch=7/micro_step=430/global_step=6870, RunningAvgSamplesPerSec=177.09162871737624, CurrSamplesPerSec=176.88389145876988, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:29:59,091] [INFO] [logging.py:96:log_dist] [Rank 0] step=6880, skipped=132, lr=[5.4534275744280765e-06, 5.4534275744280765e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:29:59,110] [INFO] [timer.py:199:stop] epoch=7/micro_step=440/global_step=6880, RunningAvgSamplesPerSec=177.09146933740308, CurrSamplesPerSec=177.15319681667924, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:30:02,712] [INFO] [logging.py:96:log_dist] [Rank 0] step=6890, skipped=132, lr=[5.44321618722885e-06, 5.44321618722885e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:30:02,731] [INFO] [timer.py:199:stop] epoch=7/micro_step=450/global_step=6890, RunningAvgSamplesPerSec=177.09126616823124, CurrSamplesPerSec=176.74937102259446, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:30:06,333] [INFO] [logging.py:96:log_dist] [Rank 0] step=6900, skipped=132, lr=[5.433001984083553e-06, 5.433001984083553e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:30:06,351] [INFO] [timer.py:199:stop] epoch=7/micro_step=460/global_step=6900, RunningAvgSamplesPerSec=177.09110362708998, CurrSamplesPerSec=177.1680458516869, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:30:09,956] [INFO] [logging.py:96:log_dist] [Rank 0] step=6910, skipped=132, lr=[5.42278501151741e-06, 5.42278501151741e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:30:09,975] [INFO] [timer.py:199:stop] epoch=7/micro_step=470/global_step=6910, RunningAvgSamplesPerSec=177.0907019110742, CurrSamplesPerSec=176.67119869791307, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:30:13,579] [INFO] [logging.py:96:log_dist] [Rank 0] step=6920, skipped=132, lr=[5.412565316068258e-06, 5.412565316068258e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:30:13,597] [INFO] [timer.py:199:stop] epoch=7/micro_step=480/global_step=6920, RunningAvgSamplesPerSec=177.0903893091419, CurrSamplesPerSec=176.81130652488503, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:30:17,209] [INFO] [logging.py:96:log_dist] [Rank 0] step=6930, skipped=132, lr=[5.402342944286334e-06, 5.402342944286334e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:30:17,228] [INFO] [timer.py:199:stop] epoch=7/micro_step=490/global_step=6930, RunningAvgSamplesPerSec=177.08955552675488, CurrSamplesPerSec=176.6795709965215, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:30:19,012] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:30:19,345] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:30:20,776] [INFO] [logging.py:96:log_dist] [Rank 0] step=6940, skipped=134, lr=[5.3941631511907465e-06, 5.3941631511907465e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:30:20,794] [INFO] [timer.py:199:stop] epoch=7/micro_step=500/global_step=6940, RunningAvgSamplesPerSec=177.0931809403648, CurrSamplesPerSec=176.8256323777403, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:30:24,398] [INFO] [logging.py:96:log_dist] [Rank 0] step=6950, skipped=134, lr=[5.383936079355214e-06, 5.383936079355214e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:30:24,416] [INFO] [timer.py:199:stop] epoch=7/micro_step=510/global_step=6950, RunningAvgSamplesPerSec=177.09287161002814, CurrSamplesPerSec=176.9042912076868, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:30:28,021] [INFO] [logging.py:96:log_dist] [Rank 0] step=6960, skipped=134, lr=[5.373706461591753e-06, 5.373706461591753e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:30:28,039] [INFO] [timer.py:199:stop] epoch=7/micro_step=520/global_step=6960, RunningAvgSamplesPerSec=177.09252606181954, CurrSamplesPerSec=176.99585657580673, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:30:31,643] [INFO] [logging.py:96:log_dist] [Rank 0] step=6970, skipped=134, lr=[5.3634743444958e-06, 5.3634743444958e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:30:31,661] [INFO] [timer.py:199:stop] epoch=7/micro_step=530/global_step=6970, RunningAvgSamplesPerSec=177.0922391343544, CurrSamplesPerSec=176.89706332987802, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:30:35,263] [INFO] [logging.py:96:log_dist] [Rank 0] step=6980, skipped=134, lr=[5.3532397746741776e-06, 5.3532397746741776e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:30:35,281] [INFO] [timer.py:199:stop] epoch=7/micro_step=540/global_step=6980, RunningAvgSamplesPerSec=177.0921155187373, CurrSamplesPerSec=176.81771210712, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:30:38,885] [INFO] [logging.py:96:log_dist] [Rank 0] step=6990, skipped=134, lr=[5.343002798744872e-06, 5.343002798744872e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:30:38,904] [INFO] [timer.py:199:stop] epoch=7/micro_step=550/global_step=6990, RunningAvgSamplesPerSec=177.09178569656717, CurrSamplesPerSec=176.97158544014218, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:30:42,509] [INFO] [logging.py:96:log_dist] [Rank 0] step=7000, skipped=134, lr=[5.332763463336836e-06, 5.332763463336836e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:30:42,528] [INFO] [timer.py:199:stop] epoch=7/micro_step=560/global_step=7000, RunningAvgSamplesPerSec=177.09134033864643, CurrSamplesPerSec=176.97858604611912, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:30:46,133] [INFO] [logging.py:96:log_dist] [Rank 0] step=7010, skipped=134, lr=[5.322521815089769e-06, 5.322521815089769e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:30:46,151] [INFO] [timer.py:199:stop] epoch=7/micro_step=570/global_step=7010, RunningAvgSamplesPerSec=177.09093661416208, CurrSamplesPerSec=176.85918492125404, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:30:49,766] [INFO] [logging.py:96:log_dist] [Rank 0] step=7020, skipped=134, lr=[5.312277900653901e-06, 5.312277900653901e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:30:49,784] [INFO] [timer.py:199:stop] epoch=7/micro_step=580/global_step=7020, RunningAvgSamplesPerSec=177.0902738322933, CurrSamplesPerSec=177.0167490855233, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:30:53,387] [INFO] [logging.py:96:log_dist] [Rank 0] step=7030, skipped=134, lr=[5.30203176668979e-06, 5.30203176668979e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:30:53,405] [INFO] [timer.py:199:stop] epoch=7/micro_step=590/global_step=7030, RunningAvgSamplesPerSec=177.09009039422764, CurrSamplesPerSec=176.90452437486942, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:30:55,912] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:30:56,245] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:30:56,952] [INFO] [logging.py:96:log_dist] [Rank 0] step=7040, skipped=136, lr=[5.293833292820517e-06, 5.293833292820517e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:30:56,970] [INFO] [timer.py:199:stop] epoch=7/micro_step=600/global_step=7040, RunningAvgSamplesPerSec=177.0937438024618, CurrSamplesPerSec=176.9241126127462, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:31:00,570] [INFO] [logging.py:96:log_dist] [Rank 0] step=7050, skipped=136, lr=[5.2835832813223e-06, 5.2835832813223e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:31:00,589] [INFO] [timer.py:199:stop] epoch=7/micro_step=610/global_step=7050, RunningAvgSamplesPerSec=177.09369922742357, CurrSamplesPerSec=177.24513713224346, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:31:04,191] [INFO] [logging.py:96:log_dist] [Rank 0] step=7060, skipped=136, lr=[5.2733311809984985e-06, 5.2733311809984985e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:31:04,209] [INFO] [timer.py:199:stop] epoch=7/micro_step=620/global_step=7060, RunningAvgSamplesPerSec=177.0935401141144, CurrSamplesPerSec=176.96761867592437, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:31:07,812] [INFO] [logging.py:96:log_dist] [Rank 0] step=7070, skipped=136, lr=[5.263077038546956e-06, 5.263077038546956e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:31:07,830] [INFO] [timer.py:199:stop] epoch=7/micro_step=630/global_step=7070, RunningAvgSamplesPerSec=177.09334361456553, CurrSamplesPerSec=176.7855724092875, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:31:11,432] [INFO] [logging.py:96:log_dist] [Rank 0] step=7080, skipped=136, lr=[5.252820900674813e-06, 5.252820900674813e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:31:11,450] [INFO] [timer.py:199:stop] epoch=7/micro_step=640/global_step=7080, RunningAvgSamplesPerSec=177.09319083500853, CurrSamplesPerSec=177.0134806580286, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:31:15,057] [INFO] [logging.py:96:log_dist] [Rank 0] step=7090, skipped=136, lr=[5.2425628140983045e-06, 5.2425628140983045e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:31:15,075] [INFO] [timer.py:199:stop] epoch=7/micro_step=650/global_step=7090, RunningAvgSamplesPerSec=177.09270066593012, CurrSamplesPerSec=176.85871882650898, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:31:18,678] [INFO] [logging.py:96:log_dist] [Rank 0] step=7100, skipped=136, lr=[5.232302825542539e-06, 5.232302825542539e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:31:18,696] [INFO] [timer.py:199:stop] epoch=7/micro_step=660/global_step=7100, RunningAvgSamplesPerSec=177.09251492154846, CurrSamplesPerSec=177.12864875422554, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:31:22,301] [INFO] [logging.py:96:log_dist] [Rank 0] step=7110, skipped=136, lr=[5.222040981741288e-06, 5.222040981741288e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:31:22,319] [INFO] [timer.py:199:stop] epoch=7/micro_step=670/global_step=7110, RunningAvgSamplesPerSec=177.09216515345702, CurrSamplesPerSec=176.86011711811432, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:31:25,923] [INFO] [logging.py:96:log_dist] [Rank 0] step=7120, skipped=136, lr=[5.211777329436774e-06, 5.211777329436774e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:31:25,941] [INFO] [timer.py:199:stop] epoch=7/micro_step=680/global_step=7120, RunningAvgSamplesPerSec=177.09188184267464, CurrSamplesPerSec=176.7293560435234, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:31:29,545] [INFO] [logging.py:96:log_dist] [Rank 0] step=7130, skipped=136, lr=[5.201511915379459e-06, 5.201511915379459e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:31:29,563] [INFO] [timer.py:199:stop] epoch=7/micro_step=690/global_step=7130, RunningAvgSamplesPerSec=177.09164764649722, CurrSamplesPerSec=176.95140345985268, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:31:32,791] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:31:33,124] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:31:33,125] [INFO] [logging.py:96:log_dist] [Rank 0] step=7140, skipped=138, lr=[5.193298347093025e-06, 5.193298347093025e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:31:33,125] [INFO] [timer.py:199:stop] epoch=7/micro_step=700/global_step=7140, RunningAvgSamplesPerSec=177.09544821272266, CurrSamplesPerSec=192.14063638537004, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:31:36,727] [INFO] [logging.py:96:log_dist] [Rank 0] step=7150, skipped=138, lr=[5.1830298797173054e-06, 5.1830298797173054e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:31:36,745] [INFO] [timer.py:199:stop] epoch=7/micro_step=710/global_step=7150, RunningAvgSamplesPerSec=177.09532435072705, CurrSamplesPerSec=176.98512044475154, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:31:40,346] [INFO] [logging.py:96:log_dist] [Rank 0] step=7160, skipped=138, lr=[5.172759781532084e-06, 5.172759781532084e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:31:40,364] [INFO] [timer.py:199:stop] epoch=7/micro_step=720/global_step=7160, RunningAvgSamplesPerSec=177.09520790255078, CurrSamplesPerSec=176.99959118758687, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:31:43,979] [INFO] [logging.py:96:log_dist] [Rank 0] step=7170, skipped=138, lr=[5.16248809931718e-06, 5.16248809931718e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:31:43,997] [INFO] [timer.py:199:stop] epoch=7/micro_step=730/global_step=7170, RunningAvgSamplesPerSec=177.09418548111609, CurrSamplesPerSec=176.4926624517405, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:31:47,621] [INFO] [logging.py:96:log_dist] [Rank 0] step=7180, skipped=138, lr=[5.1522148798596316e-06, 5.1522148798596316e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:31:47,639] [INFO] [timer.py:199:stop] epoch=7/micro_step=740/global_step=7180, RunningAvgSamplesPerSec=177.09255192498298, CurrSamplesPerSec=177.09814850001055, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:31:51,242] [INFO] [logging.py:96:log_dist] [Rank 0] step=7190, skipped=138, lr=[5.141940169953478e-06, 5.141940169953478e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:31:51,260] [INFO] [timer.py:199:stop] epoch=7/micro_step=750/global_step=7190, RunningAvgSamplesPerSec=177.0923363686964, CurrSamplesPerSec=176.9528032174261, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:31:54,878] [INFO] [logging.py:96:log_dist] [Rank 0] step=7200, skipped=138, lr=[5.1316640163995466e-06, 5.1316640163995466e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:31:54,897] [INFO] [timer.py:199:stop] epoch=7/micro_step=760/global_step=7200, RunningAvgSamplesPerSec=177.0910601993616, CurrSamplesPerSec=177.12046757850104, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:31:58,501] [INFO] [logging.py:96:log_dist] [Rank 0] step=7210, skipped=138, lr=[5.121386466005237e-06, 5.121386466005237e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:31:58,519] [INFO] [timer.py:199:stop] epoch=7/micro_step=770/global_step=7210, RunningAvgSamplesPerSec=177.0907857552004, CurrSamplesPerSec=176.9128022080863, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:32:02,120] [INFO] [logging.py:96:log_dist] [Rank 0] step=7220, skipped=138, lr=[5.1111075655843175e-06, 5.1111075655843175e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:32:02,138] [INFO] [timer.py:199:stop] epoch=7/micro_step=780/global_step=7220, RunningAvgSamplesPerSec=177.09072412258448, CurrSamplesPerSec=176.9868708120343, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:32:05,739] [INFO] [logging.py:96:log_dist] [Rank 0] step=7230, skipped=138, lr=[5.100827361956704e-06, 5.100827361956704e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:32:05,758] [INFO] [timer.py:199:stop] epoch=7/micro_step=790/global_step=7230, RunningAvgSamplesPerSec=177.09062118521942, CurrSamplesPerSec=177.02655509223752, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:32:09,363] [INFO] [logging.py:96:log_dist] [Rank 0] step=7240, skipped=138, lr=[5.090545901948244e-06, 5.090545901948244e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:32:09,382] [INFO] [timer.py:199:stop] epoch=7/micro_step=800/global_step=7240, RunningAvgSamplesPerSec=177.09024578112863, CurrSamplesPerSec=177.0255043966958, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:32:09,714] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:32:10,047] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:32:12,927] [INFO] [logging.py:96:log_dist] [Rank 0] step=7250, skipped=140, lr=[5.0823198608179e-06, 5.0823198608179e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:32:12,945] [INFO] [timer.py:199:stop] epoch=7/micro_step=810/global_step=7250, RunningAvgSamplesPerSec=177.09392909417693, CurrSamplesPerSec=176.7608933279601, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:32:16,548] [INFO] [logging.py:96:log_dist] [Rank 0] step=7260, skipped=140, lr=[5.072036257343196e-06, 5.072036257343196e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:32:16,566] [INFO] [timer.py:199:stop] epoch=7/micro_step=820/global_step=7260, RunningAvgSamplesPerSec=177.09375634135492, CurrSamplesPerSec=176.8981125013015, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:32:20,172] [INFO] [logging.py:96:log_dist] [Rank 0] step=7270, skipped=140, lr=[5.061751528629793e-06, 5.061751528629793e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:32:20,190] [INFO] [timer.py:199:stop] epoch=7/micro_step=830/global_step=7270, RunningAvgSamplesPerSec=177.093331542505, CurrSamplesPerSec=176.83553371387916, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:32:23,793] [INFO] [logging.py:96:log_dist] [Rank 0] step=7280, skipped=140, lr=[5.0514657215241545e-06, 5.0514657215241545e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:32:23,811] [INFO] [timer.py:199:stop] epoch=7/micro_step=840/global_step=7280, RunningAvgSamplesPerSec=177.09312576565821, CurrSamplesPerSec=177.1482866598694, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:32:27,426] [INFO] [logging.py:96:log_dist] [Rank 0] step=7290, skipped=140, lr=[5.041178882877655e-06, 5.041178882877655e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:32:27,445] [INFO] [timer.py:199:stop] epoch=7/micro_step=850/global_step=7290, RunningAvgSamplesPerSec=177.09230579244235, CurrSamplesPerSec=176.93542446369415, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:32:31,046] [INFO] [logging.py:96:log_dist] [Rank 0] step=7300, skipped=140, lr=[5.030891059546367e-06, 5.030891059546367e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:32:31,064] [INFO] [timer.py:199:stop] epoch=7/micro_step=860/global_step=7300, RunningAvgSamplesPerSec=177.09223647616125, CurrSamplesPerSec=176.88832071863578, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:32:34,666] [INFO] [logging.py:96:log_dist] [Rank 0] step=7310, skipped=140, lr=[5.0206022983908484e-06, 5.0206022983908484e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:32:34,684] [INFO] [timer.py:199:stop] epoch=7/micro_step=870/global_step=7310, RunningAvgSamplesPerSec=177.09207117337482, CurrSamplesPerSec=177.00776120032864, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:32:38,286] [INFO] [logging.py:96:log_dist] [Rank 0] step=7320, skipped=140, lr=[5.0103126462759325e-06, 5.0103126462759325e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:32:38,304] [INFO] [timer.py:199:stop] epoch=7/micro_step=880/global_step=7320, RunningAvgSamplesPerSec=177.09192329331498, CurrSamplesPerSec=176.97975286762951, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:32:41,907] [INFO] [logging.py:96:log_dist] [Rank 0] step=7330, skipped=140, lr=[5.000022150070503e-06, 5.000022150070503e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:32:41,925] [INFO] [timer.py:199:stop] epoch=7/micro_step=890/global_step=7330, RunningAvgSamplesPerSec=177.09170966940832, CurrSamplesPerSec=177.08074118687796, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:32:45,529] [INFO] [logging.py:96:log_dist] [Rank 0] step=7340, skipped=140, lr=[4.989730856647296e-06, 4.989730856647296e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:32:45,547] [INFO] [timer.py:199:stop] epoch=7/micro_step=900/global_step=7340, RunningAvgSamplesPerSec=177.0914534656904, CurrSamplesPerSec=176.6993420727834, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:32:46,604] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:32:46,938] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:32:49,109] [INFO] [logging.py:96:log_dist] [Rank 0] step=7350, skipped=142, lr=[4.98149727941273e-06, 4.98149727941273e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:32:49,128] [INFO] [timer.py:199:stop] epoch=7/micro_step=910/global_step=7350, RunningAvgSamplesPerSec=177.09409560567317, CurrSamplesPerSec=176.37194584703528, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:32:52,732] [INFO] [logging.py:96:log_dist] [Rank 0] step=7360, skipped=142, lr=[4.971204669128264e-06, 4.971204669128264e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:32:52,750] [INFO] [timer.py:199:stop] epoch=7/micro_step=920/global_step=7360, RunningAvgSamplesPerSec=177.09380804534197, CurrSamplesPerSec=176.94277211080532, MemAllocated=4.98GB, MaxMemAllocated=24.08GB ***** Evaluating perplexity, Epoch 8/16 ***** ppl: 1.7929623126983643 Beginning of Epoch 9/16, Total Micro Batches 920 [2023-04-21 22:33:04,514] [INFO] [logging.py:96:log_dist] [Rank 0] step=7370, skipped=142, lr=[4.960911392888308e-06, 4.960911392888308e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:33:04,532] [INFO] [timer.py:199:stop] epoch=8/micro_step=10/global_step=7370, RunningAvgSamplesPerSec=177.09181350442555, CurrSamplesPerSec=176.74262127887079, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:33:08,140] [INFO] [logging.py:96:log_dist] [Rank 0] step=7380, skipped=142, lr=[4.950617497578259e-06, 4.950617497578259e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:33:08,159] [INFO] [timer.py:199:stop] epoch=8/micro_step=20/global_step=7380, RunningAvgSamplesPerSec=177.09125141417846, CurrSamplesPerSec=176.73994479910667, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:33:11,769] [INFO] [logging.py:96:log_dist] [Rank 0] step=7390, skipped=142, lr=[4.940323030086334e-06, 4.940323030086334e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:33:11,787] [INFO] [timer.py:199:stop] epoch=8/micro_step=30/global_step=7390, RunningAvgSamplesPerSec=177.09060479274487, CurrSamplesPerSec=176.83937806084737, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:33:15,394] [INFO] [logging.py:96:log_dist] [Rank 0] step=7400, skipped=142, lr=[4.930028037303352e-06, 4.930028037303352e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:33:15,413] [INFO] [timer.py:199:stop] epoch=8/micro_step=40/global_step=7400, RunningAvgSamplesPerSec=177.09010228797192, CurrSamplesPerSec=176.62586253295672, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:33:19,019] [INFO] [logging.py:96:log_dist] [Rank 0] step=7410, skipped=142, lr=[4.919732566122531e-06, 4.919732566122531e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:33:19,038] [INFO] [timer.py:199:stop] epoch=8/micro_step=50/global_step=7410, RunningAvgSamplesPerSec=177.0896433918541, CurrSamplesPerSec=177.05714747426777, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:33:22,641] [INFO] [logging.py:96:log_dist] [Rank 0] step=7420, skipped=142, lr=[4.909436663439265e-06, 4.909436663439265e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:33:22,660] [INFO] [timer.py:199:stop] epoch=8/micro_step=60/global_step=7420, RunningAvgSamplesPerSec=177.08936774857574, CurrSamplesPerSec=176.9274943547547, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:33:26,265] [INFO] [logging.py:96:log_dist] [Rank 0] step=7430, skipped=142, lr=[4.899140376150912e-06, 4.899140376150912e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:33:26,283] [INFO] [timer.py:199:stop] epoch=8/micro_step=70/global_step=7430, RunningAvgSamplesPerSec=177.0890006983712, CurrSamplesPerSec=177.20207940693993, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:33:29,885] [INFO] [logging.py:96:log_dist] [Rank 0] step=7440, skipped=142, lr=[4.888843751156581e-06, 4.888843751156581e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:33:29,903] [INFO] [timer.py:199:stop] epoch=8/micro_step=80/global_step=7440, RunningAvgSamplesPerSec=177.08891014853438, CurrSamplesPerSec=177.0879840826583, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:33:31,683] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:33:32,026] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:33:33,459] [INFO] [logging.py:96:log_dist] [Rank 0] step=7450, skipped=144, lr=[4.880606239530004e-06, 4.880606239530004e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:33:33,478] [INFO] [timer.py:199:stop] epoch=8/micro_step=90/global_step=7450, RunningAvgSamplesPerSec=177.09177977507355, CurrSamplesPerSec=177.0028591021786, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:33:37,083] [INFO] [logging.py:96:log_dist] [Rank 0] step=7460, skipped=144, lr=[4.8703091248554536e-06, 4.8703091248554536e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:33:37,101] [INFO] [timer.py:199:stop] epoch=8/micro_step=100/global_step=7460, RunningAvgSamplesPerSec=177.09142062255685, CurrSamplesPerSec=177.002392249849, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:33:40,703] [INFO] [logging.py:96:log_dist] [Rank 0] step=7470, skipped=144, lr=[4.860011803799938e-06, 4.860011803799938e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:33:40,721] [INFO] [timer.py:199:stop] epoch=8/micro_step=110/global_step=7470, RunningAvgSamplesPerSec=177.09129129038737, CurrSamplesPerSec=177.02772254635647, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:33:44,325] [INFO] [logging.py:96:log_dist] [Rank 0] step=7480, skipped=144, lr=[4.84971432326728e-06, 4.84971432326728e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:33:44,344] [INFO] [timer.py:199:stop] epoch=8/micro_step=120/global_step=7480, RunningAvgSamplesPerSec=177.0909934495265, CurrSamplesPerSec=177.0862317148312, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:33:47,946] [INFO] [logging.py:96:log_dist] [Rank 0] step=7490, skipped=144, lr=[4.839416730162025e-06, 4.839416730162025e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:33:47,965] [INFO] [timer.py:199:stop] epoch=8/micro_step=130/global_step=7490, RunningAvgSamplesPerSec=177.09078457324185, CurrSamplesPerSec=176.95711927602193, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:33:51,570] [INFO] [logging.py:96:log_dist] [Rank 0] step=7500, skipped=144, lr=[4.829119071389233e-06, 4.829119071389233e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:33:51,588] [INFO] [timer.py:199:stop] epoch=8/micro_step=140/global_step=7500, RunningAvgSamplesPerSec=177.090480740865, CurrSamplesPerSec=176.84531965312854, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:33:55,203] [INFO] [logging.py:96:log_dist] [Rank 0] step=7510, skipped=144, lr=[4.818821393854262e-06, 4.818821393854262e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:33:55,221] [INFO] [timer.py:199:stop] epoch=8/micro_step=150/global_step=7510, RunningAvgSamplesPerSec=177.08948893403564, CurrSamplesPerSec=176.7652000297643, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:33:58,825] [INFO] [logging.py:96:log_dist] [Rank 0] step=7520, skipped=144, lr=[4.808523744462554e-06, 4.808523744462554e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:33:58,844] [INFO] [timer.py:199:stop] epoch=8/micro_step=160/global_step=7520, RunningAvgSamplesPerSec=177.08922194934138, CurrSamplesPerSec=176.9775359199144, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:34:02,446] [INFO] [logging.py:96:log_dist] [Rank 0] step=7530, skipped=144, lr=[4.798226170119427e-06, 4.798226170119427e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:34:02,464] [INFO] [timer.py:199:stop] epoch=8/micro_step=170/global_step=7530, RunningAvgSamplesPerSec=177.089036699446, CurrSamplesPerSec=177.01873354681803, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:34:06,073] [INFO] [logging.py:96:log_dist] [Rank 0] step=7540, skipped=144, lr=[4.7879287177298555e-06, 4.7879287177298555e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:34:06,091] [INFO] [timer.py:199:stop] epoch=8/micro_step=180/global_step=7540, RunningAvgSamplesPerSec=177.08850854440502, CurrSamplesPerSec=176.93204241853402, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:34:08,600] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:34:08,933] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:34:09,640] [INFO] [logging.py:96:log_dist] [Rank 0] step=7550, skipped=146, lr=[4.779690875144548e-06, 4.779690875144548e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:34:09,658] [INFO] [timer.py:199:stop] epoch=8/micro_step=190/global_step=7550, RunningAvgSamplesPerSec=177.09181197080653, CurrSamplesPerSec=176.84543615892906, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:34:13,261] [INFO] [logging.py:96:log_dist] [Rank 0] step=7560, skipped=146, lr=[4.769393760469996e-06, 4.769393760469996e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:34:13,280] [INFO] [timer.py:199:stop] epoch=8/micro_step=200/global_step=7560, RunningAvgSamplesPerSec=177.091579458933, CurrSamplesPerSec=177.18523642589864, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:34:16,881] [INFO] [logging.py:96:log_dist] [Rank 0] step=7570, skipped=146, lr=[4.759096899079287e-06, 4.759096899079287e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:34:16,899] [INFO] [timer.py:199:stop] epoch=8/micro_step=210/global_step=7570, RunningAvgSamplesPerSec=177.09146942491464, CurrSamplesPerSec=176.8687404049522, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:34:20,505] [INFO] [logging.py:96:log_dist] [Rank 0] step=7580, skipped=146, lr=[4.748800337874146e-06, 4.748800337874146e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:34:20,523] [INFO] [timer.py:199:stop] epoch=8/micro_step=220/global_step=7580, RunningAvgSamplesPerSec=177.09109958137603, CurrSamplesPerSec=176.37495885237382, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:34:24,125] [INFO] [logging.py:96:log_dist] [Rank 0] step=7590, skipped=146, lr=[4.738504123754934e-06, 4.738504123754934e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:34:24,143] [INFO] [timer.py:199:stop] epoch=8/micro_step=230/global_step=7590, RunningAvgSamplesPerSec=177.09097468884076, CurrSamplesPerSec=176.83996055224225, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:34:27,745] [INFO] [logging.py:96:log_dist] [Rank 0] step=7600, skipped=146, lr=[4.728208303620428e-06, 4.728208303620428e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:34:27,764] [INFO] [timer.py:199:stop] epoch=8/micro_step=240/global_step=7600, RunningAvgSamplesPerSec=177.09078872751294, CurrSamplesPerSec=176.98768766194806, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:34:31,365] [INFO] [logging.py:96:log_dist] [Rank 0] step=7610, skipped=146, lr=[4.717912924367608e-06, 4.717912924367608e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:34:31,384] [INFO] [timer.py:199:stop] epoch=8/micro_step=250/global_step=7610, RunningAvgSamplesPerSec=177.09067905312475, CurrSamplesPerSec=176.81316991912064, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:34:34,994] [INFO] [logging.py:96:log_dist] [Rank 0] step=7620, skipped=146, lr=[4.707618032891456e-06, 4.707618032891456e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:34:35,012] [INFO] [timer.py:199:stop] epoch=8/micro_step=260/global_step=7620, RunningAvgSamplesPerSec=177.0900177898697, CurrSamplesPerSec=177.18453470450592, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:34:38,632] [INFO] [logging.py:96:log_dist] [Rank 0] step=7630, skipped=146, lr=[4.697323676084721e-06, 4.697323676084721e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:34:38,650] [INFO] [timer.py:199:stop] epoch=8/micro_step=270/global_step=7630, RunningAvgSamplesPerSec=177.0887999517402, CurrSamplesPerSec=176.96190218147274, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:34:42,251] [INFO] [logging.py:96:log_dist] [Rank 0] step=7640, skipped=146, lr=[4.68702990083772e-06, 4.68702990083772e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:34:42,269] [INFO] [timer.py:199:stop] epoch=8/micro_step=280/global_step=7640, RunningAvgSamplesPerSec=177.08873520857108, CurrSamplesPerSec=177.05014065184628, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:34:45,498] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:34:45,831] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:34:45,832] [INFO] [logging.py:96:log_dist] [Rank 0] step=7650, skipped=148, lr=[4.678795330871738e-06, 4.678795330871738e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:34:45,832] [INFO] [timer.py:199:stop] epoch=8/micro_step=290/global_step=7650, RunningAvgSamplesPerSec=177.09225649527164, CurrSamplesPerSec=191.91397898662214, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:34:49,437] [INFO] [logging.py:96:log_dist] [Rank 0] step=7660, skipped=148, lr=[4.668502720587272e-06, 4.668502720587272e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:34:49,455] [INFO] [timer.py:199:stop] epoch=8/micro_step=300/global_step=7660, RunningAvgSamplesPerSec=177.09196497126817, CurrSamplesPerSec=176.77032178617173, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:34:53,058] [INFO] [logging.py:96:log_dist] [Rank 0] step=7670, skipped=148, lr=[4.658210823140656e-06, 4.658210823140656e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:34:53,076] [INFO] [timer.py:199:stop] epoch=8/micro_step=310/global_step=7670, RunningAvgSamplesPerSec=177.09178748918978, CurrSamplesPerSec=176.88598950158215, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:34:56,677] [INFO] [logging.py:96:log_dist] [Rank 0] step=7680, skipped=148, lr=[4.647919685411009e-06, 4.647919685411009e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:34:56,695] [INFO] [timer.py:199:stop] epoch=8/micro_step=320/global_step=7680, RunningAvgSamplesPerSec=177.0917050084677, CurrSamplesPerSec=176.9930557203932, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:35:00,307] [INFO] [logging.py:96:log_dist] [Rank 0] step=7690, skipped=148, lr=[4.6376293542739845e-06, 4.6376293542739845e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:35:00,326] [INFO] [timer.py:199:stop] epoch=8/micro_step=330/global_step=7690, RunningAvgSamplesPerSec=177.0909400131148, CurrSamplesPerSec=176.78673668709587, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:35:03,929] [INFO] [logging.py:96:log_dist] [Rank 0] step=7700, skipped=148, lr=[4.627339876601561e-06, 4.627339876601561e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:35:03,947] [INFO] [timer.py:199:stop] epoch=8/micro_step=340/global_step=7700, RunningAvgSamplesPerSec=177.0907121211169, CurrSamplesPerSec=176.94358855371019, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:35:07,548] [INFO] [logging.py:96:log_dist] [Rank 0] step=7710, skipped=148, lr=[4.617051299261837e-06, 4.617051299261837e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:35:07,567] [INFO] [timer.py:199:stop] epoch=8/micro_step=350/global_step=7710, RunningAvgSamplesPerSec=177.09060966805367, CurrSamplesPerSec=176.8468342405089, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:35:11,182] [INFO] [logging.py:96:log_dist] [Rank 0] step=7720, skipped=148, lr=[4.606763669118804e-06, 4.606763669118804e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:35:11,201] [INFO] [timer.py:199:stop] epoch=8/micro_step=360/global_step=7720, RunningAvgSamplesPerSec=177.08961072638868, CurrSamplesPerSec=177.09067111401973, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:35:14,802] [INFO] [logging.py:96:log_dist] [Rank 0] step=7730, skipped=148, lr=[4.596477033032136e-06, 4.596477033032136e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:35:14,820] [INFO] [timer.py:199:stop] epoch=8/micro_step=370/global_step=7730, RunningAvgSamplesPerSec=177.08952378228858, CurrSamplesPerSec=177.01289702296248, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:35:18,421] [INFO] [logging.py:96:log_dist] [Rank 0] step=7740, skipped=148, lr=[4.58619143785699e-06, 4.58619143785699e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:35:18,440] [INFO] [timer.py:199:stop] epoch=8/micro_step=380/global_step=7740, RunningAvgSamplesPerSec=177.08941032460626, CurrSamplesPerSec=176.9948062446386, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:35:22,046] [INFO] [logging.py:96:log_dist] [Rank 0] step=7750, skipped=148, lr=[4.5759069304437725e-06, 4.5759069304437725e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:35:22,064] [INFO] [timer.py:199:stop] epoch=8/micro_step=390/global_step=7750, RunningAvgSamplesPerSec=177.08898312274675, CurrSamplesPerSec=176.98745427548903, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:35:22,397] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:35:22,731] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:35:25,609] [INFO] [logging.py:96:log_dist] [Rank 0] step=7760, skipped=150, lr=[4.5676801391821015e-06, 4.5676801391821015e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:35:25,628] [INFO] [timer.py:199:stop] epoch=8/micro_step=400/global_step=7760, RunningAvgSamplesPerSec=177.09242918689958, CurrSamplesPerSec=177.25005265955514, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:35:29,230] [INFO] [logging.py:96:log_dist] [Rank 0] step=7770, skipped=150, lr=[4.557397707787432e-06, 4.557397707787432e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:35:29,248] [INFO] [timer.py:199:stop] epoch=8/micro_step=410/global_step=7770, RunningAvgSamplesPerSec=177.09226945388752, CurrSamplesPerSec=176.96015230802294, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:35:32,850] [INFO] [logging.py:96:log_dist] [Rank 0] step=7780, skipped=150, lr=[4.547116495308796e-06, 4.547116495308796e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:35:32,869] [INFO] [timer.py:199:stop] epoch=8/micro_step=420/global_step=7780, RunningAvgSamplesPerSec=177.0921332938385, CurrSamplesPerSec=177.1056265174707, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:35:36,471] [INFO] [logging.py:96:log_dist] [Rank 0] step=7790, skipped=150, lr=[4.536836548576639e-06, 4.536836548576639e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:35:36,489] [INFO] [timer.py:199:stop] epoch=8/micro_step=430/global_step=7790, RunningAvgSamplesPerSec=177.09199268321356, CurrSamplesPerSec=176.7911610826304, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:35:40,092] [INFO] [logging.py:96:log_dist] [Rank 0] step=7800, skipped=150, lr=[4.526557914415644e-06, 4.526557914415644e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:35:40,110] [INFO] [timer.py:199:stop] epoch=8/micro_step=440/global_step=7800, RunningAvgSamplesPerSec=177.0918078765498, CurrSamplesPerSec=176.83122358421957, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:35:43,727] [INFO] [logging.py:96:log_dist] [Rank 0] step=7810, skipped=150, lr=[4.516280639644511e-06, 4.516280639644511e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:35:43,745] [INFO] [timer.py:199:stop] epoch=8/micro_step=450/global_step=7810, RunningAvgSamplesPerSec=177.09073964348298, CurrSamplesPerSec=177.14641619553748, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:35:47,349] [INFO] [logging.py:96:log_dist] [Rank 0] step=7820, skipped=150, lr=[4.506004771075747e-06, 4.506004771075747e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:35:47,367] [INFO] [timer.py:199:stop] epoch=8/micro_step=460/global_step=7820, RunningAvgSamplesPerSec=177.09047083984754, CurrSamplesPerSec=176.99760715545577, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:35:50,969] [INFO] [logging.py:96:log_dist] [Rank 0] step=7830, skipped=150, lr=[4.495730355515464e-06, 4.495730355515464e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:35:50,987] [INFO] [timer.py:199:stop] epoch=8/micro_step=470/global_step=7830, RunningAvgSamplesPerSec=177.09035519212597, CurrSamplesPerSec=176.8203909427748, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:35:54,588] [INFO] [logging.py:96:log_dist] [Rank 0] step=7840, skipped=150, lr=[4.485457439763144e-06, 4.485457439763144e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:35:54,607] [INFO] [timer.py:199:stop] epoch=8/micro_step=480/global_step=7840, RunningAvgSamplesPerSec=177.0902588589633, CurrSamplesPerSec=176.86174848626604, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:35:58,213] [INFO] [logging.py:96:log_dist] [Rank 0] step=7850, skipped=150, lr=[4.47518607061144e-06, 4.47518607061144e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:35:58,232] [INFO] [timer.py:199:stop] epoch=8/micro_step=490/global_step=7850, RunningAvgSamplesPerSec=177.08981710961686, CurrSamplesPerSec=176.81899327926268, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:35:59,289] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:35:59,622] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:36:01,778] [INFO] [logging.py:96:log_dist] [Rank 0] step=7860, skipped=152, lr=[4.466970120282696e-06, 4.466970120282696e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:36:01,797] [INFO] [timer.py:199:stop] epoch=8/micro_step=500/global_step=7860, RunningAvgSamplesPerSec=177.09311413925425, CurrSamplesPerSec=176.84229055619568, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:36:05,409] [INFO] [logging.py:96:log_dist] [Rank 0] step=7870, skipped=152, lr=[4.4567016529069755e-06, 4.4567016529069755e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:36:05,427] [INFO] [timer.py:199:stop] epoch=8/micro_step=510/global_step=7870, RunningAvgSamplesPerSec=177.09234152930054, CurrSamplesPerSec=176.88074448786807, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:36:09,029] [INFO] [logging.py:96:log_dist] [Rank 0] step=7880, skipped=152, lr=[4.4464348631131495e-06, 4.4464348631131495e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:36:09,047] [INFO] [timer.py:199:stop] epoch=8/micro_step=520/global_step=7880, RunningAvgSamplesPerSec=177.09220501878082, CurrSamplesPerSec=177.00636057315054, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:36:12,658] [INFO] [logging.py:96:log_dist] [Rank 0] step=7890, skipped=152, lr=[4.436169797665969e-06, 4.436169797665969e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:36:12,676] [INFO] [timer.py:199:stop] epoch=8/micro_step=530/global_step=7890, RunningAvgSamplesPerSec=177.09154738196236, CurrSamplesPerSec=173.03858947053575, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:36:16,280] [INFO] [logging.py:96:log_dist] [Rank 0] step=7900, skipped=152, lr=[4.425906503322332e-06, 4.425906503322332e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:36:16,298] [INFO] [timer.py:199:stop] epoch=8/micro_step=540/global_step=7900, RunningAvgSamplesPerSec=177.0912685012182, CurrSamplesPerSec=177.0538775547894, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:36:19,904] [INFO] [logging.py:96:log_dist] [Rank 0] step=7910, skipped=152, lr=[4.4156450268310666e-06, 4.4156450268310666e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:36:19,922] [INFO] [timer.py:199:stop] epoch=8/micro_step=550/global_step=7910, RunningAvgSamplesPerSec=177.09091603024888, CurrSamplesPerSec=176.92166384578476, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:36:23,528] [INFO] [logging.py:96:log_dist] [Rank 0] step=7920, skipped=152, lr=[4.405385414932725e-06, 4.405385414932725e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:36:23,547] [INFO] [timer.py:199:stop] epoch=8/micro_step=560/global_step=7920, RunningAvgSamplesPerSec=177.09052166185973, CurrSamplesPerSec=176.74052663563802, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:36:27,151] [INFO] [logging.py:96:log_dist] [Rank 0] step=7930, skipped=152, lr=[4.395127714359361e-06, 4.395127714359361e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:36:27,169] [INFO] [timer.py:199:stop] epoch=8/micro_step=570/global_step=7930, RunningAvgSamplesPerSec=177.09023962207365, CurrSamplesPerSec=177.01663235389154, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:36:30,772] [INFO] [logging.py:96:log_dist] [Rank 0] step=7940, skipped=152, lr=[4.3848719718343285e-06, 4.3848719718343285e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:36:30,790] [INFO] [timer.py:199:stop] epoch=8/micro_step=580/global_step=7940, RunningAvgSamplesPerSec=177.09006970606637, CurrSamplesPerSec=177.0131304765271, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:36:34,392] [INFO] [logging.py:96:log_dist] [Rank 0] step=7950, skipped=152, lr=[4.374618234072057e-06, 4.374618234072057e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:36:34,410] [INFO] [timer.py:199:stop] epoch=8/micro_step=590/global_step=7950, RunningAvgSamplesPerSec=177.08995201748996, CurrSamplesPerSec=176.85196072876573, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:36:36,190] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:36:36,524] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:36:37,955] [INFO] [logging.py:96:log_dist] [Rank 0] step=7960, skipped=154, lr=[4.366416718677702e-06, 4.366416718677702e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:36:37,973] [INFO] [timer.py:199:stop] epoch=8/micro_step=600/global_step=7960, RunningAvgSamplesPerSec=177.09333288722567, CurrSamplesPerSec=176.78475742394616, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:36:41,574] [INFO] [logging.py:96:log_dist] [Rank 0] step=7970, skipped=154, lr=[4.356166707179485e-06, 4.356166707179485e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:36:41,593] [INFO] [timer.py:199:stop] epoch=8/micro_step=610/global_step=7970, RunningAvgSamplesPerSec=177.0932357197583, CurrSamplesPerSec=176.88097759298367, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:36:45,198] [INFO] [logging.py:96:log_dist] [Rank 0] step=7980, skipped=154, lr=[4.345918831195178e-06, 4.345918831195178e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:36:45,216] [INFO] [timer.py:199:stop] epoch=8/micro_step=620/global_step=7980, RunningAvgSamplesPerSec=177.09289764810228, CurrSamplesPerSec=175.8636463694588, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:36:48,820] [INFO] [logging.py:96:log_dist] [Rank 0] step=7990, skipped=154, lr=[4.335673137403381e-06, 4.335673137403381e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:36:48,838] [INFO] [timer.py:199:stop] epoch=8/micro_step=630/global_step=7990, RunningAvgSamplesPerSec=177.09264053028988, CurrSamplesPerSec=177.0744333410073, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:36:52,440] [INFO] [logging.py:96:log_dist] [Rank 0] step=8000, skipped=154, lr=[4.325429672472757e-06, 4.325429672472757e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:36:52,458] [INFO] [timer.py:199:stop] epoch=8/micro_step=640/global_step=8000, RunningAvgSamplesPerSec=177.0925102073904, CurrSamplesPerSec=176.85813621153235, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:36:56,061] [INFO] [logging.py:96:log_dist] [Rank 0] step=8010, skipped=154, lr=[4.315188483061812e-06, 4.315188483061812e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:36:56,079] [INFO] [timer.py:199:stop] epoch=8/micro_step=650/global_step=8010, RunningAvgSamplesPerSec=177.09229648194997, CurrSamplesPerSec=176.8956644540056, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:36:59,687] [INFO] [logging.py:96:log_dist] [Rank 0] step=8020, skipped=154, lr=[4.304949615818686e-06, 4.304949615818686e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:36:59,705] [INFO] [timer.py:199:stop] epoch=8/micro_step=660/global_step=8020, RunningAvgSamplesPerSec=177.09183609175707, CurrSamplesPerSec=176.75065120456384, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:37:03,311] [INFO] [logging.py:96:log_dist] [Rank 0] step=8030, skipped=154, lr=[4.2947131173809494e-06, 4.2947131173809494e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:37:03,330] [INFO] [timer.py:199:stop] epoch=8/micro_step=670/global_step=8030, RunningAvgSamplesPerSec=177.09145284584383, CurrSamplesPerSec=176.82656422093288, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:37:06,951] [INFO] [logging.py:96:log_dist] [Rank 0] step=8040, skipped=154, lr=[4.284479034375376e-06, 4.284479034375376e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:37:06,973] [INFO] [timer.py:199:stop] epoch=8/micro_step=680/global_step=8040, RunningAvgSamplesPerSec=177.09015522395165, CurrSamplesPerSec=169.33020959099917, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:37:10,579] [INFO] [logging.py:96:log_dist] [Rank 0] step=8050, skipped=154, lr=[4.274247413417738e-06, 4.274247413417738e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:37:10,597] [INFO] [timer.py:199:stop] epoch=8/micro_step=690/global_step=8050, RunningAvgSamplesPerSec=177.08972963342637, CurrSamplesPerSec=175.91527822923112, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:37:13,104] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:37:13,438] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:37:14,145] [INFO] [logging.py:96:log_dist] [Rank 0] step=8060, skipped=156, lr=[4.266063920644788e-06, 4.266063920644788e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:37:14,163] [INFO] [timer.py:199:stop] epoch=8/micro_step=700/global_step=8060, RunningAvgSamplesPerSec=177.09291013712897, CurrSamplesPerSec=176.79209256254083, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:37:17,768] [INFO] [logging.py:96:log_dist] [Rank 0] step=8070, skipped=156, lr=[4.255836848809254e-06, 4.255836848809254e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:37:17,786] [INFO] [timer.py:199:stop] epoch=8/micro_step=710/global_step=8070, RunningAvgSamplesPerSec=177.09258836198848, CurrSamplesPerSec=176.83518423607936, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:37:21,395] [INFO] [logging.py:96:log_dist] [Rank 0] step=8080, skipped=156, lr=[4.245612369485483e-06, 4.245612369485483e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:37:21,414] [INFO] [timer.py:199:stop] epoch=8/micro_step=720/global_step=8080, RunningAvgSamplesPerSec=177.0920288194219, CurrSamplesPerSec=177.03391031026965, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:37:25,019] [INFO] [logging.py:96:log_dist] [Rank 0] step=8090, skipped=156, lr=[4.2353905292455066e-06, 4.2353905292455066e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:37:25,037] [INFO] [timer.py:199:stop] epoch=8/micro_step=730/global_step=8090, RunningAvgSamplesPerSec=177.0916727627146, CurrSamplesPerSec=176.80746339831225, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:37:28,643] [INFO] [logging.py:96:log_dist] [Rank 0] step=8100, skipped=156, lr=[4.225171374649331e-06, 4.225171374649331e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:37:28,661] [INFO] [timer.py:199:stop] epoch=8/micro_step=740/global_step=8100, RunningAvgSamplesPerSec=177.0912976919333, CurrSamplesPerSec=177.21471374427546, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:37:32,262] [INFO] [logging.py:96:log_dist] [Rank 0] step=8110, skipped=156, lr=[4.21495495224473e-06, 4.21495495224473e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:37:32,280] [INFO] [timer.py:199:stop] epoch=8/micro_step=750/global_step=8110, RunningAvgSamplesPerSec=177.09124997387713, CurrSamplesPerSec=177.0021588246077, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:37:35,881] [INFO] [logging.py:96:log_dist] [Rank 0] step=8120, skipped=156, lr=[4.204741308567039e-06, 4.204741308567039e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:37:35,900] [INFO] [timer.py:199:stop] epoch=8/micro_step=760/global_step=8120, RunningAvgSamplesPerSec=177.09116638035755, CurrSamplesPerSec=177.06111825315736, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:37:39,501] [INFO] [logging.py:96:log_dist] [Rank 0] step=8130, skipped=156, lr=[4.1945304901389275e-06, 4.1945304901389275e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:37:39,519] [INFO] [timer.py:199:stop] epoch=8/micro_step=770/global_step=8130, RunningAvgSamplesPerSec=177.09106890634425, CurrSamplesPerSec=177.13122013697483, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:37:43,121] [INFO] [logging.py:96:log_dist] [Rank 0] step=8140, skipped=156, lr=[4.1843225434702e-06, 4.1843225434702e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:37:43,139] [INFO] [timer.py:199:stop] epoch=8/micro_step=780/global_step=8140, RunningAvgSamplesPerSec=177.09095754562833, CurrSamplesPerSec=177.09113843214558, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:37:46,741] [INFO] [logging.py:96:log_dist] [Rank 0] step=8150, skipped=156, lr=[4.174117515057583e-06, 4.174117515057583e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:37:46,759] [INFO] [timer.py:199:stop] epoch=8/micro_step=790/global_step=8150, RunningAvgSamplesPerSec=177.09079539929428, CurrSamplesPerSec=176.91455114521224, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:37:49,988] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:37:50,322] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:37:50,323] [INFO] [logging.py:96:log_dist] [Rank 0] step=8160, skipped=158, lr=[4.165955624709205e-06, 4.165955624709205e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:37:50,323] [INFO] [timer.py:199:stop] epoch=8/micro_step=800/global_step=8160, RunningAvgSamplesPerSec=177.094057851461, CurrSamplesPerSec=192.0371318135025, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:37:53,949] [INFO] [logging.py:96:log_dist] [Rank 0] step=8170, skipped=158, lr=[4.155755966286761e-06, 4.155755966286761e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:37:53,968] [INFO] [timer.py:199:stop] epoch=8/micro_step=810/global_step=8170, RunningAvgSamplesPerSec=177.09255896737798, CurrSamplesPerSec=177.01278029641108, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:37:57,569] [INFO] [logging.py:96:log_dist] [Rank 0] step=8180, skipped=158, lr=[4.145559356239861e-06, 4.145559356239861e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:37:57,588] [INFO] [timer.py:199:stop] epoch=8/micro_step=820/global_step=8180, RunningAvgSamplesPerSec=177.09243648432337, CurrSamplesPerSec=176.86664277130026, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:38:01,190] [INFO] [logging.py:96:log_dist] [Rank 0] step=8190, skipped=158, lr=[4.135365841013592e-06, 4.135365841013592e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:38:01,208] [INFO] [timer.py:199:stop] epoch=8/micro_step=830/global_step=8190, RunningAvgSamplesPerSec=177.0922739632589, CurrSamplesPerSec=177.0650892101526, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:38:04,814] [INFO] [logging.py:96:log_dist] [Rank 0] step=8200, skipped=158, lr=[4.12517546703894e-06, 4.12517546703894e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:38:04,832] [INFO] [timer.py:199:stop] epoch=8/micro_step=840/global_step=8200, RunningAvgSamplesPerSec=177.09193840236395, CurrSamplesPerSec=176.82108978281676, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:38:08,435] [INFO] [logging.py:96:log_dist] [Rank 0] step=8210, skipped=158, lr=[4.114988280732588e-06, 4.114988280732588e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:38:08,453] [INFO] [timer.py:199:stop] epoch=8/micro_step=850/global_step=8210, RunningAvgSamplesPerSec=177.0917334720898, CurrSamplesPerSec=177.11824710406367, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:38:12,069] [INFO] [logging.py:96:log_dist] [Rank 0] step=8220, skipped=158, lr=[4.104804328496698e-06, 4.104804328496698e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:38:12,088] [INFO] [timer.py:199:stop] epoch=8/micro_step=860/global_step=8220, RunningAvgSamplesPerSec=177.09073457542627, CurrSamplesPerSec=177.0260881149014, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:38:15,692] [INFO] [logging.py:96:log_dist] [Rank 0] step=8230, skipped=158, lr=[4.0946236567186964e-06, 4.0946236567186964e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:38:15,710] [INFO] [timer.py:199:stop] epoch=8/micro_step=870/global_step=8230, RunningAvgSamplesPerSec=177.09049462085045, CurrSamplesPerSec=176.98372017585268, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:38:19,312] [INFO] [logging.py:96:log_dist] [Rank 0] step=8240, skipped=158, lr=[4.084446311771076e-06, 4.084446311771076e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:38:19,331] [INFO] [timer.py:199:stop] epoch=8/micro_step=880/global_step=8240, RunningAvgSamplesPerSec=177.0903395737835, CurrSamplesPerSec=177.1263111983282, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:38:22,938] [INFO] [logging.py:96:log_dist] [Rank 0] step=8250, skipped=158, lr=[4.074272340011168e-06, 4.074272340011168e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:38:22,956] [INFO] [timer.py:199:stop] epoch=8/micro_step=890/global_step=8250, RunningAvgSamplesPerSec=177.08987590548847, CurrSamplesPerSec=174.3749775076896, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:38:26,572] [INFO] [logging.py:96:log_dist] [Rank 0] step=8260, skipped=158, lr=[4.064101787780942e-06, 4.064101787780942e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:38:26,590] [INFO] [timer.py:199:stop] epoch=8/micro_step=900/global_step=8260, RunningAvgSamplesPerSec=177.08929786491635, CurrSamplesPerSec=176.93519121504937, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:38:26,923] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:38:27,256] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:38:30,138] [INFO] [logging.py:96:log_dist] [Rank 0] step=8270, skipped=160, lr=[4.05596783919e-06, 4.05596783919e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:38:30,156] [INFO] [timer.py:199:stop] epoch=8/micro_step=910/global_step=8270, RunningAvgSamplesPerSec=177.09241456853778, CurrSamplesPerSec=176.98290336256068, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:38:33,759] [INFO] [logging.py:96:log_dist] [Rank 0] step=8280, skipped=160, lr=[4.045803558845116e-06, 4.045803558845116e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:38:33,777] [INFO] [timer.py:199:stop] epoch=8/micro_step=920/global_step=8280, RunningAvgSamplesPerSec=177.09225073598273, CurrSamplesPerSec=177.00566026787362, MemAllocated=4.98GB, MaxMemAllocated=24.08GB ***** Evaluating perplexity, Epoch 9/16 ***** ppl: 1.7858991622924805 Beginning of Epoch 10/16, Total Micro Batches 920 [2023-04-21 22:38:45,531] [INFO] [logging.py:96:log_dist] [Rank 0] step=8290, skipped=160, lr=[4.0356428277038916e-06, 4.0356428277038916e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:38:45,549] [INFO] [timer.py:199:stop] epoch=9/micro_step=10/global_step=8290, RunningAvgSamplesPerSec=177.09108693850652, CurrSamplesPerSec=176.93764035649147, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:38:49,153] [INFO] [logging.py:96:log_dist] [Rank 0] step=8300, skipped=160, lr=[4.0254856920479895e-06, 4.0254856920479895e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:38:49,171] [INFO] [timer.py:199:stop] epoch=9/micro_step=20/global_step=8300, RunningAvgSamplesPerSec=177.09087229197422, CurrSamplesPerSec=177.03566164274517, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:38:52,774] [INFO] [logging.py:96:log_dist] [Rank 0] step=8310, skipped=160, lr=[4.01533219814269e-06, 4.01533219814269e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:38:52,792] [INFO] [timer.py:199:stop] epoch=9/micro_step=30/global_step=8310, RunningAvgSamplesPerSec=177.09071571950713, CurrSamplesPerSec=177.12385683103665, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:38:56,394] [INFO] [logging.py:96:log_dist] [Rank 0] step=8320, skipped=160, lr=[4.005182392236684e-06, 4.005182392236684e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:38:56,412] [INFO] [timer.py:199:stop] epoch=9/micro_step=40/global_step=8320, RunningAvgSamplesPerSec=177.09057561973503, CurrSamplesPerSec=177.11146915803988, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:39:00,043] [INFO] [logging.py:96:log_dist] [Rank 0] step=8330, skipped=160, lr=[3.995036320561872e-06, 3.995036320561872e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:39:00,061] [INFO] [timer.py:199:stop] epoch=9/micro_step=50/global_step=8330, RunningAvgSamplesPerSec=177.08876947509756, CurrSamplesPerSec=176.72958874979014, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:39:03,664] [INFO] [logging.py:96:log_dist] [Rank 0] step=8340, skipped=160, lr=[3.9848940293331355e-06, 3.9848940293331355e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:39:03,683] [INFO] [timer.py:199:stop] epoch=9/micro_step=60/global_step=8340, RunningAvgSamplesPerSec=177.088602316552, CurrSamplesPerSec=176.8711877070994, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:39:07,285] [INFO] [logging.py:96:log_dist] [Rank 0] step=8350, skipped=160, lr=[3.974755564748145e-06, 3.974755564748145e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:39:07,303] [INFO] [timer.py:199:stop] epoch=9/micro_step=70/global_step=8350, RunningAvgSamplesPerSec=177.08848143285303, CurrSamplesPerSec=177.1017705858563, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:39:10,905] [INFO] [logging.py:96:log_dist] [Rank 0] step=8360, skipped=160, lr=[3.964620972987135e-06, 3.964620972987135e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:39:10,924] [INFO] [timer.py:199:stop] epoch=9/micro_step=80/global_step=8360, RunningAvgSamplesPerSec=177.08836642969007, CurrSamplesPerSec=177.08412891921702, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:39:11,981] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:39:12,314] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:39:14,469] [INFO] [logging.py:96:log_dist] [Rank 0] step=8370, skipped=162, lr=[3.956516119033455e-06, 3.956516119033455e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:39:14,487] [INFO] [timer.py:199:stop] epoch=9/micro_step=90/global_step=8370, RunningAvgSamplesPerSec=177.0915265068758, CurrSamplesPerSec=176.78673668709587, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:39:18,093] [INFO] [logging.py:96:log_dist] [Rank 0] step=8380, skipped=162, lr=[3.946388614673359e-06, 3.946388614673359e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:39:18,111] [INFO] [timer.py:199:stop] epoch=9/micro_step=100/global_step=8380, RunningAvgSamplesPerSec=177.0912007237519, CurrSamplesPerSec=176.93134270119492, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:39:21,713] [INFO] [logging.py:96:log_dist] [Rank 0] step=8390, skipped=162, lr=[3.936265112347387e-06, 3.936265112347387e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:39:21,732] [INFO] [timer.py:199:stop] epoch=9/micro_step=110/global_step=8390, RunningAvgSamplesPerSec=177.0910644037313, CurrSamplesPerSec=176.9697187054173, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:39:25,334] [INFO] [logging.py:96:log_dist] [Rank 0] step=8400, skipped=162, lr=[3.926145658167621e-06, 3.926145658167621e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:39:25,352] [INFO] [timer.py:199:stop] epoch=9/micro_step=120/global_step=8400, RunningAvgSamplesPerSec=177.090921731004, CurrSamplesPerSec=177.00869496409518, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:39:28,953] [INFO] [logging.py:96:log_dist] [Rank 0] step=8410, skipped=162, lr=[3.916030298227706e-06, 3.916030298227706e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:39:28,972] [INFO] [timer.py:199:stop] epoch=9/micro_step=130/global_step=8410, RunningAvgSamplesPerSec=177.0908562235033, CurrSamplesPerSec=176.85871882650898, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:39:32,593] [INFO] [logging.py:96:log_dist] [Rank 0] step=8420, skipped=162, lr=[3.905919078602639e-06, 3.905919078602639e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:39:32,611] [INFO] [timer.py:199:stop] epoch=9/micro_step=140/global_step=8420, RunningAvgSamplesPerSec=177.08961499413323, CurrSamplesPerSec=176.96808534483782, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:39:36,214] [INFO] [logging.py:96:log_dist] [Rank 0] step=8430, skipped=162, lr=[3.89581204534855e-06, 3.89581204534855e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:39:36,233] [INFO] [timer.py:199:stop] epoch=9/micro_step=150/global_step=8430, RunningAvgSamplesPerSec=177.08939857457403, CurrSamplesPerSec=177.02655509223752, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:39:39,835] [INFO] [logging.py:96:log_dist] [Rank 0] step=8440, skipped=162, lr=[3.885709244502516e-06, 3.885709244502516e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:39:39,853] [INFO] [timer.py:199:stop] epoch=9/micro_step=160/global_step=8440, RunningAvgSamplesPerSec=177.08927634635086, CurrSamplesPerSec=177.14676690459095, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:39:43,454] [INFO] [logging.py:96:log_dist] [Rank 0] step=8450, skipped=162, lr=[3.875610722082321e-06, 3.875610722082321e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:39:43,472] [INFO] [timer.py:199:stop] epoch=9/micro_step=170/global_step=8450, RunningAvgSamplesPerSec=177.089216859357, CurrSamplesPerSec=177.0618189973319, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:39:47,073] [INFO] [logging.py:96:log_dist] [Rank 0] step=8460, skipped=162, lr=[3.865516524086265e-06, 3.865516524086265e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:39:47,091] [INFO] [timer.py:199:stop] epoch=9/micro_step=180/global_step=8460, RunningAvgSamplesPerSec=177.08912416938443, CurrSamplesPerSec=177.0967464420395, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:39:48,872] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:39:49,205] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:39:50,636] [INFO] [logging.py:96:log_dist] [Rank 0] step=8470, skipped=164, lr=[3.8574443101730934e-06, 3.8574443101730934e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:39:50,655] [INFO] [timer.py:199:stop] epoch=9/micro_step=190/global_step=8470, RunningAvgSamplesPerSec=177.0922887058228, CurrSamplesPerSec=176.75623767018112, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:39:54,257] [INFO] [logging.py:96:log_dist] [Rank 0] step=8480, skipped=164, lr=[3.847358011993206e-06, 3.847358011993206e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:39:54,275] [INFO] [timer.py:199:stop] epoch=9/micro_step=200/global_step=8480, RunningAvgSamplesPerSec=177.09214890991655, CurrSamplesPerSec=176.8293598094395, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:39:57,877] [INFO] [logging.py:96:log_dist] [Rank 0] step=8490, skipped=164, lr=[3.837276166927244e-06, 3.837276166927244e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:39:57,896] [INFO] [timer.py:199:stop] epoch=9/micro_step=210/global_step=8490, RunningAvgSamplesPerSec=177.0920009235539, CurrSamplesPerSec=177.06800914510666, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:40:01,497] [INFO] [logging.py:96:log_dist] [Rank 0] step=8500, skipped=164, lr=[3.827198820897545e-06, 3.827198820897545e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:40:01,515] [INFO] [timer.py:199:stop] epoch=9/micro_step=220/global_step=8500, RunningAvgSamplesPerSec=177.09193628263597, CurrSamplesPerSec=176.87841347050363, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:40:05,124] [INFO] [logging.py:96:log_dist] [Rank 0] step=8510, skipped=164, lr=[3.817126019805953e-06, 3.817126019805953e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:40:05,143] [INFO] [timer.py:199:stop] epoch=9/micro_step=230/global_step=8510, RunningAvgSamplesPerSec=177.09135764628175, CurrSamplesPerSec=177.20746047851577, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:40:08,743] [INFO] [logging.py:96:log_dist] [Rank 0] step=8520, skipped=164, lr=[3.807057809533608e-06, 3.807057809533608e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:40:08,761] [INFO] [timer.py:199:stop] epoch=9/micro_step=240/global_step=8520, RunningAvgSamplesPerSec=177.09137766069898, CurrSamplesPerSec=177.33740723882357, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:40:12,363] [INFO] [logging.py:96:log_dist] [Rank 0] step=8530, skipped=164, lr=[3.796994235940744e-06, 3.796994235940744e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:40:12,382] [INFO] [timer.py:199:stop] epoch=9/micro_step=250/global_step=8530, RunningAvgSamplesPerSec=177.09122913697266, CurrSamplesPerSec=177.1606794838204, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:40:15,983] [INFO] [logging.py:96:log_dist] [Rank 0] step=8540, skipped=164, lr=[3.786935344866471e-06, 3.786935344866471e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:40:16,002] [INFO] [timer.py:199:stop] epoch=9/micro_step=260/global_step=8540, RunningAvgSamplesPerSec=177.09113668065584, CurrSamplesPerSec=176.95945236833293, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:40:19,605] [INFO] [logging.py:96:log_dist] [Rank 0] step=8550, skipped=164, lr=[3.7768811821285694e-06, 3.7768811821285694e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:40:19,623] [INFO] [timer.py:199:stop] epoch=9/micro_step=270/global_step=8550, RunningAvgSamplesPerSec=177.09095282761106, CurrSamplesPerSec=176.97625244925462, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:40:23,225] [INFO] [logging.py:96:log_dist] [Rank 0] step=8560, skipped=164, lr=[3.7668317935232878e-06, 3.7668317935232878e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:40:23,244] [INFO] [timer.py:199:stop] epoch=9/micro_step=280/global_step=8560, RunningAvgSamplesPerSec=177.09081523263538, CurrSamplesPerSec=176.87783072576224, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:40:25,749] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:40:26,083] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:40:26,793] [INFO] [logging.py:96:log_dist] [Rank 0] step=8570, skipped=166, lr=[3.7587957507757475e-06, 3.7587957507757475e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:40:26,811] [INFO] [timer.py:199:stop] epoch=9/micro_step=290/global_step=8570, RunningAvgSamplesPerSec=177.09371964836006, CurrSamplesPerSec=176.6394609391451, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:40:30,414] [INFO] [logging.py:96:log_dist] [Rank 0] step=8580, skipped=166, lr=[3.7487550709461683e-06, 3.7487550709461683e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:40:30,433] [INFO] [timer.py:199:stop] epoch=9/micro_step=300/global_step=8580, RunningAvgSamplesPerSec=177.09352834148373, CurrSamplesPerSec=176.86652623755626, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:40:34,037] [INFO] [logging.py:96:log_dist] [Rank 0] step=8590, skipped=166, lr=[3.7387192933623415e-06, 3.7387192933623415e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:40:34,055] [INFO] [timer.py:199:stop] epoch=9/micro_step=310/global_step=8590, RunningAvgSamplesPerSec=177.09329385192996, CurrSamplesPerSec=177.11824710406367, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:40:37,687] [INFO] [logging.py:96:log_dist] [Rank 0] step=8600, skipped=166, lr=[3.7286884637367676e-06, 3.7286884637367676e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:40:37,706] [INFO] [timer.py:199:stop] epoch=9/micro_step=320/global_step=8600, RunningAvgSamplesPerSec=177.0915863761139, CurrSamplesPerSec=176.94090598388627, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:40:41,309] [INFO] [logging.py:96:log_dist] [Rank 0] step=8610, skipped=166, lr=[3.718662627759408e-06, 3.718662627759408e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:40:41,328] [INFO] [timer.py:199:stop] epoch=9/micro_step=330/global_step=8610, RunningAvgSamplesPerSec=177.09137349007779, CurrSamplesPerSec=177.1146243442379, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:40:44,930] [INFO] [logging.py:96:log_dist] [Rank 0] step=8620, skipped=166, lr=[3.708641831097484e-06, 3.708641831097484e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:40:44,948] [INFO] [timer.py:199:stop] epoch=9/micro_step=340/global_step=8620, RunningAvgSamplesPerSec=177.09123713701183, CurrSamplesPerSec=176.71713982509684, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:40:48,551] [INFO] [logging.py:96:log_dist] [Rank 0] step=8630, skipped=166, lr=[3.6986261193952582e-06, 3.6986261193952582e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:40:48,570] [INFO] [timer.py:199:stop] epoch=9/micro_step=350/global_step=8630, RunningAvgSamplesPerSec=177.09103990949362, CurrSamplesPerSec=176.46655697033006, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:40:52,172] [INFO] [logging.py:96:log_dist] [Rank 0] step=8640, skipped=166, lr=[3.688615538273831e-06, 3.688615538273831e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:40:52,191] [INFO] [timer.py:199:stop] epoch=9/micro_step=360/global_step=8640, RunningAvgSamplesPerSec=177.09089775326467, CurrSamplesPerSec=176.92901035003456, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:40:55,793] [INFO] [logging.py:96:log_dist] [Rank 0] step=8650, skipped=166, lr=[3.678610133330939e-06, 3.678610133330939e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:40:55,811] [INFO] [timer.py:199:stop] epoch=9/micro_step=370/global_step=8650, RunningAvgSamplesPerSec=177.09076673354886, CurrSamplesPerSec=176.88097759298367, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:40:59,418] [INFO] [logging.py:96:log_dist] [Rank 0] step=8660, skipped=166, lr=[3.6686099501407364e-06, 3.6686099501407364e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:40:59,437] [INFO] [timer.py:199:stop] epoch=9/micro_step=380/global_step=8660, RunningAvgSamplesPerSec=177.09037553307007, CurrSamplesPerSec=177.05200908359984, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:41:02,667] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:41:03,000] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:41:03,001] [INFO] [logging.py:96:log_dist] [Rank 0] step=8670, skipped=168, lr=[3.6606135938611617e-06, 3.6606135938611617e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:41:03,001] [INFO] [timer.py:199:stop] epoch=9/micro_step=390/global_step=8670, RunningAvgSamplesPerSec=177.09341552460188, CurrSamplesPerSec=191.992492969311, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:41:06,604] [INFO] [logging.py:96:log_dist] [Rank 0] step=8680, skipped=168, lr=[3.650622924596618e-06, 3.650622924596618e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:41:06,623] [INFO] [timer.py:199:stop] epoch=9/micro_step=400/global_step=8680, RunningAvgSamplesPerSec=177.09322592515534, CurrSamplesPerSec=176.86501131285817, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:41:10,246] [INFO] [logging.py:96:log_dist] [Rank 0] step=8690, skipped=168, lr=[3.6406376045652013e-06, 3.6406376045652013e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:41:10,264] [INFO] [timer.py:199:stop] epoch=9/micro_step=410/global_step=8690, RunningAvgSamplesPerSec=177.09190756526445, CurrSamplesPerSec=176.8003598768884, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:41:13,867] [INFO] [logging.py:96:log_dist] [Rank 0] step=8700, skipped=168, lr=[3.630657679249581e-06, 3.630657679249581e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:41:13,886] [INFO] [timer.py:199:stop] epoch=9/micro_step=420/global_step=8700, RunningAvgSamplesPerSec=177.09172195197226, CurrSamplesPerSec=176.89671360883577, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:41:17,490] [INFO] [logging.py:96:log_dist] [Rank 0] step=8710, skipped=168, lr=[3.6206831941078554e-06, 3.6206831941078554e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:41:17,508] [INFO] [timer.py:199:stop] epoch=9/micro_step=430/global_step=8710, RunningAvgSamplesPerSec=177.09148673547992, CurrSamplesPerSec=176.7869695444978, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:41:21,112] [INFO] [logging.py:96:log_dist] [Rank 0] step=8720, skipped=168, lr=[3.61071419457334e-06, 3.61071419457334e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:41:21,131] [INFO] [timer.py:199:stop] epoch=9/micro_step=440/global_step=8720, RunningAvgSamplesPerSec=177.09126830124484, CurrSamplesPerSec=176.87316890606138, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:41:24,740] [INFO] [logging.py:96:log_dist] [Rank 0] step=8730, skipped=168, lr=[3.600750726054367e-06, 3.600750726054367e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:41:24,759] [INFO] [timer.py:199:stop] epoch=9/micro_step=450/global_step=8730, RunningAvgSamplesPerSec=177.0907318443551, CurrSamplesPerSec=176.85475712036262, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:41:28,364] [INFO] [logging.py:96:log_dist] [Rank 0] step=8740, skipped=168, lr=[3.590792833934074e-06, 3.590792833934074e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:41:28,383] [INFO] [timer.py:199:stop] epoch=9/micro_step=460/global_step=8740, RunningAvgSamplesPerSec=177.09041280121366, CurrSamplesPerSec=177.0306412490223, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:41:31,984] [INFO] [logging.py:96:log_dist] [Rank 0] step=8750, skipped=168, lr=[3.580840563570196e-06, 3.580840563570196e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:41:32,003] [INFO] [timer.py:199:stop] epoch=9/micro_step=470/global_step=8750, RunningAvgSamplesPerSec=177.09030222397854, CurrSamplesPerSec=177.08634853827402, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:41:35,607] [INFO] [logging.py:96:log_dist] [Rank 0] step=8760, skipped=168, lr=[3.570893960294865e-06, 3.570893960294865e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:41:35,626] [INFO] [timer.py:199:stop] epoch=9/micro_step=480/global_step=8760, RunningAvgSamplesPerSec=177.09005696901488, CurrSamplesPerSec=177.0092785714521, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:41:39,230] [INFO] [logging.py:96:log_dist] [Rank 0] step=8770, skipped=168, lr=[3.5609530694143975e-06, 3.5609530694143975e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:41:39,249] [INFO] [timer.py:199:stop] epoch=9/micro_step=490/global_step=8770, RunningAvgSamplesPerSec=177.08978197454437, CurrSamplesPerSec=176.79721587750703, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:41:39,583] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:41:39,917] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:41:42,798] [INFO] [logging.py:96:log_dist] [Rank 0] step=8780, skipped=170, lr=[3.553004500063564e-06, 3.553004500063564e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:41:42,817] [INFO] [timer.py:199:stop] epoch=9/micro_step=500/global_step=8780, RunningAvgSamplesPerSec=177.0925817941946, CurrSamplesPerSec=176.79372267600309, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:41:46,421] [INFO] [logging.py:96:log_dist] [Rank 0] step=8790, skipped=170, lr=[3.543074005582579e-06, 3.543074005582579e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:41:46,440] [INFO] [timer.py:199:stop] epoch=9/micro_step=510/global_step=8790, RunningAvgSamplesPerSec=177.09233250284447, CurrSamplesPerSec=176.776957230105, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:41:50,046] [INFO] [logging.py:96:log_dist] [Rank 0] step=8800, skipped=170, lr=[3.533149350215063e-06, 3.533149350215063e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:41:50,065] [INFO] [timer.py:199:stop] epoch=9/micro_step=520/global_step=8800, RunningAvgSamplesPerSec=177.0919484631553, CurrSamplesPerSec=176.78673668709587, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:41:53,667] [INFO] [logging.py:96:log_dist] [Rank 0] step=8810, skipped=170, lr=[3.5232305791673577e-06, 3.5232305791673577e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:41:53,686] [INFO] [timer.py:199:stop] epoch=9/micro_step=530/global_step=8810, RunningAvgSamplesPerSec=177.09178118429398, CurrSamplesPerSec=177.1129883077321, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:41:57,290] [INFO] [logging.py:96:log_dist] [Rank 0] step=8820, skipped=170, lr=[3.5133177376190076e-06, 3.5133177376190076e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:41:57,308] [INFO] [timer.py:199:stop] epoch=9/micro_step=540/global_step=8820, RunningAvgSamplesPerSec=177.09155871848859, CurrSamplesPerSec=176.74681071428807, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:42:00,910] [INFO] [logging.py:96:log_dist] [Rank 0] step=8830, skipped=170, lr=[3.5034108707225454e-06, 3.5034108707225454e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:42:00,929] [INFO] [timer.py:199:stop] epoch=9/micro_step=550/global_step=8830, RunningAvgSamplesPerSec=177.09144139885348, CurrSamplesPerSec=176.9806863359156, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:42:04,530] [INFO] [logging.py:96:log_dist] [Rank 0] step=8840, skipped=170, lr=[3.4935100236032875e-06, 3.4935100236032875e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:42:04,548] [INFO] [timer.py:199:stop] epoch=9/micro_step=560/global_step=8840, RunningAvgSamplesPerSec=177.0913637905573, CurrSamplesPerSec=177.08331117885362, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:42:08,151] [INFO] [logging.py:96:log_dist] [Rank 0] step=8850, skipped=170, lr=[3.483615241359139e-06, 3.483615241359139e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:42:08,169] [INFO] [timer.py:199:stop] epoch=9/micro_step=570/global_step=8850, RunningAvgSamplesPerSec=177.09122160478415, CurrSamplesPerSec=176.63237091056962, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:42:11,774] [INFO] [logging.py:96:log_dist] [Rank 0] step=8860, skipped=170, lr=[3.4737265690603706e-06, 3.4737265690603706e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:42:11,792] [INFO] [timer.py:199:stop] epoch=9/micro_step=580/global_step=8860, RunningAvgSamplesPerSec=177.0909622775071, CurrSamplesPerSec=176.99772386199695, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:42:15,414] [INFO] [logging.py:96:log_dist] [Rank 0] step=8870, skipped=170, lr=[3.463844051749425e-06, 3.463844051749425e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:42:15,432] [INFO] [timer.py:199:stop] epoch=9/micro_step=590/global_step=8870, RunningAvgSamplesPerSec=177.0897746110439, CurrSamplesPerSec=177.01383084091566, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:42:16,489] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:42:16,823] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:42:18,977] [INFO] [logging.py:96:log_dist] [Rank 0] step=8880, skipped=172, lr=[3.455942499742533e-06, 3.455942499742533e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:42:18,996] [INFO] [timer.py:199:stop] epoch=9/micro_step=600/global_step=8880, RunningAvgSamplesPerSec=177.09277210421067, CurrSamplesPerSec=176.90953761796806, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:42:22,599] [INFO] [logging.py:96:log_dist] [Rank 0] step=8890, skipped=172, lr=[3.4460711748270122e-06, 3.4460711748270122e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:42:22,617] [INFO] [timer.py:199:stop] epoch=9/micro_step=610/global_step=8890, RunningAvgSamplesPerSec=177.0926097537333, CurrSamplesPerSec=177.1293500330259, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:42:26,221] [INFO] [logging.py:96:log_dist] [Rank 0] step=8900, skipped=172, lr=[3.4362061308683534e-06, 3.4362061308683534e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:42:26,240] [INFO] [timer.py:199:stop] epoch=9/micro_step=620/global_step=8900, RunningAvgSamplesPerSec=177.09237657713157, CurrSamplesPerSec=176.88284245602728, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:42:29,851] [INFO] [logging.py:96:log_dist] [Rank 0] step=8910, skipped=172, lr=[3.4263474128013763e-06, 3.4263474128013763e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:42:29,869] [INFO] [timer.py:199:stop] epoch=9/micro_step=630/global_step=8910, RunningAvgSamplesPerSec=177.09195355616814, CurrSamplesPerSec=176.2154112976971, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:42:33,470] [INFO] [logging.py:96:log_dist] [Rank 0] step=8920, skipped=172, lr=[3.416495065532083e-06, 3.416495065532083e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:42:33,489] [INFO] [timer.py:199:stop] epoch=9/micro_step=640/global_step=8920, RunningAvgSamplesPerSec=177.09189772455593, CurrSamplesPerSec=177.1978683614861, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:42:37,089] [INFO] [logging.py:96:log_dist] [Rank 0] step=8930, skipped=172, lr=[3.406649133937459e-06, 3.406649133937459e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:42:37,108] [INFO] [timer.py:199:stop] epoch=9/micro_step=650/global_step=8930, RunningAvgSamplesPerSec=177.0918270870578, CurrSamplesPerSec=177.09160575273782, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:42:40,709] [INFO] [logging.py:96:log_dist] [Rank 0] step=8940, skipped=172, lr=[3.396809662865268e-06, 3.396809662865268e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:42:40,728] [INFO] [timer.py:199:stop] epoch=9/micro_step=660/global_step=8940, RunningAvgSamplesPerSec=177.09173432123225, CurrSamplesPerSec=176.9972570367557, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:42:44,329] [INFO] [logging.py:96:log_dist] [Rank 0] step=8950, skipped=172, lr=[3.386976697133843e-06, 3.386976697133843e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:42:44,348] [INFO] [timer.py:199:stop] epoch=9/micro_step=670/global_step=8950, RunningAvgSamplesPerSec=177.09163613545113, CurrSamplesPerSec=176.81724623094394, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:42:47,980] [INFO] [logging.py:96:log_dist] [Rank 0] step=8960, skipped=172, lr=[3.377150281531885e-06, 3.377150281531885e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:42:47,999] [INFO] [timer.py:199:stop] epoch=9/micro_step=680/global_step=8960, RunningAvgSamplesPerSec=177.08984440570808, CurrSamplesPerSec=177.1330902804127, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:42:51,601] [INFO] [logging.py:96:log_dist] [Rank 0] step=8970, skipped=172, lr=[3.367330460818266e-06, 3.367330460818266e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:42:51,620] [INFO] [timer.py:199:stop] epoch=9/micro_step=690/global_step=8970, RunningAvgSamplesPerSec=177.0897014159335, CurrSamplesPerSec=177.02246912407873, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:42:53,400] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:42:53,734] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:42:55,164] [INFO] [logging.py:96:log_dist] [Rank 0] step=8980, skipped=174, lr=[3.359479382625759e-06, 3.359479382625759e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:42:55,183] [INFO] [timer.py:199:stop] epoch=9/micro_step=700/global_step=8980, RunningAvgSamplesPerSec=177.09269849771985, CurrSamplesPerSec=176.82528293907512, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:42:58,786] [INFO] [logging.py:96:log_dist] [Rank 0] step=8990, skipped=174, lr=[3.349671545407474e-06, 3.349671545407474e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:42:58,805] [INFO] [timer.py:199:stop] epoch=9/micro_step=710/global_step=8990, RunningAvgSamplesPerSec=177.0925133771732, CurrSamplesPerSec=176.88913665912153, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:43:02,411] [INFO] [logging.py:96:log_dist] [Rank 0] step=9000, skipped=174, lr=[3.3398704282418955e-06, 3.3398704282418955e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:43:02,429] [INFO] [timer.py:199:stop] epoch=9/micro_step=720/global_step=9000, RunningAvgSamplesPerSec=177.09216879474909, CurrSamplesPerSec=176.78464099808224, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:43:06,034] [INFO] [logging.py:96:log_dist] [Rank 0] step=9010, skipped=174, lr=[3.3300760757726578e-06, 3.3300760757726578e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:43:06,053] [INFO] [timer.py:199:stop] epoch=9/micro_step=730/global_step=9010, RunningAvgSamplesPerSec=177.09188917858697, CurrSamplesPerSec=176.74192305894329, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:43:09,656] [INFO] [logging.py:96:log_dist] [Rank 0] step=9020, skipped=174, lr=[3.32028853261258e-06, 3.32028853261258e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:43:09,675] [INFO] [timer.py:199:stop] epoch=9/micro_step=740/global_step=9020, RunningAvgSamplesPerSec=177.09167383260967, CurrSamplesPerSec=176.74878912782964, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:43:13,278] [INFO] [logging.py:96:log_dist] [Rank 0] step=9030, skipped=174, lr=[3.3105078433434694e-06, 3.3105078433434694e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:43:13,297] [INFO] [timer.py:199:stop] epoch=9/micro_step=750/global_step=9030, RunningAvgSamplesPerSec=177.09146746637518, CurrSamplesPerSec=176.9510535239195, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:43:16,899] [INFO] [logging.py:96:log_dist] [Rank 0] step=9040, skipped=174, lr=[3.300734052515911e-06, 3.300734052515911e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:43:16,918] [INFO] [timer.py:199:stop] epoch=9/micro_step=760/global_step=9040, RunningAvgSamplesPerSec=177.091321239105, CurrSamplesPerSec=176.91070352917916, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:43:20,526] [INFO] [logging.py:96:log_dist] [Rank 0] step=9050, skipped=174, lr=[3.2909672046490673e-06, 3.2909672046490673e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:43:20,545] [INFO] [timer.py:199:stop] epoch=9/micro_step=770/global_step=9050, RunningAvgSamplesPerSec=177.0908656746595, CurrSamplesPerSec=176.78464099808224, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:43:24,151] [INFO] [logging.py:96:log_dist] [Rank 0] step=9060, skipped=174, lr=[3.2812073442304823e-06, 3.2812073442304823e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:43:24,170] [INFO] [timer.py:199:stop] epoch=9/micro_step=780/global_step=9060, RunningAvgSamplesPerSec=177.09050540189182, CurrSamplesPerSec=176.99550646403227, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:43:27,770] [INFO] [logging.py:96:log_dist] [Rank 0] step=9070, skipped=174, lr=[3.271454515715864e-06, 3.271454515715864e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:43:27,789] [INFO] [timer.py:199:stop] epoch=9/micro_step=790/global_step=9070, RunningAvgSamplesPerSec=177.09047447060846, CurrSamplesPerSec=177.0942928940075, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:43:30,294] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:43:30,628] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:43:31,335] [INFO] [logging.py:96:log_dist] [Rank 0] step=9080, skipped=176, lr=[3.2636573457288193e-06, 3.2636573457288193e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:43:31,354] [INFO] [timer.py:199:stop] epoch=9/micro_step=800/global_step=9080, RunningAvgSamplesPerSec=177.09335520250363, CurrSamplesPerSec=176.7812647147342, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:43:34,968] [INFO] [logging.py:96:log_dist] [Rank 0] step=9090, skipped=176, lr=[3.253917286567367e-06, 3.253917286567367e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:43:34,986] [INFO] [timer.py:199:stop] epoch=9/micro_step=810/global_step=9090, RunningAvgSamplesPerSec=177.09257612704016, CurrSamplesPerSec=177.02410348870796, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:43:38,588] [INFO] [logging.py:96:log_dist] [Rank 0] step=9100, skipped=176, lr=[3.24418438361483e-06, 3.24418438361483e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:43:38,607] [INFO] [timer.py:199:stop] epoch=9/micro_step=820/global_step=9100, RunningAvgSamplesPerSec=177.09245292527447, CurrSamplesPerSec=177.03356004793255, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:43:42,208] [INFO] [logging.py:96:log_dist] [Rank 0] step=9110, skipped=176, lr=[3.2344586812041282e-06, 3.2344586812041282e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:43:42,226] [INFO] [timer.py:199:stop] epoch=9/micro_step=830/global_step=9110, RunningAvgSamplesPerSec=177.09238656280203, CurrSamplesPerSec=177.0270220720373, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:43:45,829] [INFO] [logging.py:96:log_dist] [Rank 0] step=9120, skipped=176, lr=[3.2247402236353862e-06, 3.2247402236353862e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:43:45,848] [INFO] [timer.py:199:stop] epoch=9/micro_step=840/global_step=9120, RunningAvgSamplesPerSec=177.09219924780734, CurrSamplesPerSec=176.90289221749856, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:43:49,449] [INFO] [logging.py:96:log_dist] [Rank 0] step=9130, skipped=176, lr=[3.215029055175729e-06, 3.215029055175729e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:43:49,468] [INFO] [timer.py:199:stop] epoch=9/micro_step=850/global_step=9130, RunningAvgSamplesPerSec=177.0921079152184, CurrSamplesPerSec=176.889836042672, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:43:53,089] [INFO] [logging.py:96:log_dist] [Rank 0] step=9140, skipped=176, lr=[3.2053252200590755e-06, 3.2053252200590755e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:43:53,108] [INFO] [timer.py:199:stop] epoch=9/micro_step=860/global_step=9140, RunningAvgSamplesPerSec=177.09092072848333, CurrSamplesPerSec=176.76694604974108, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:43:56,711] [INFO] [logging.py:96:log_dist] [Rank 0] step=9150, skipped=176, lr=[3.1956287624859495e-06, 3.1956287624859495e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:43:56,730] [INFO] [timer.py:199:stop] epoch=9/micro_step=870/global_step=9150, RunningAvgSamplesPerSec=177.09074079269183, CurrSamplesPerSec=176.92131402746662, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:44:00,334] [INFO] [logging.py:96:log_dist] [Rank 0] step=9160, skipped=176, lr=[3.185939726623261e-06, 3.185939726623261e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:44:00,353] [INFO] [timer.py:199:stop] epoch=9/micro_step=880/global_step=9160, RunningAvgSamplesPerSec=177.09046277859724, CurrSamplesPerSec=176.42874672608193, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:44:03,957] [INFO] [logging.py:96:log_dist] [Rank 0] step=9170, skipped=176, lr=[3.1762581566041202e-06, 3.1762581566041202e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:44:03,976] [INFO] [timer.py:199:stop] epoch=9/micro_step=890/global_step=9170, RunningAvgSamplesPerSec=177.090223588374, CurrSamplesPerSec=176.87083808836047, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:44:07,207] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:44:07,540] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:44:07,541] [INFO] [logging.py:96:log_dist] [Rank 0] step=9180, skipped=178, lr=[3.1685183056319086e-06, 3.1685183056319086e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:44:07,541] [INFO] [timer.py:199:stop] epoch=9/micro_step=900/global_step=9180, RunningAvgSamplesPerSec=177.09306531806064, CurrSamplesPerSec=192.09691168430425, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:44:11,146] [INFO] [logging.py:96:log_dist] [Rank 0] step=9190, skipped=178, lr=[3.158850285237914e-06, 3.158850285237914e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:44:11,164] [INFO] [timer.py:199:stop] epoch=9/micro_step=910/global_step=9190, RunningAvgSamplesPerSec=177.09279730934077, CurrSamplesPerSec=176.78184682335214, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:44:14,767] [INFO] [logging.py:96:log_dist] [Rank 0] step=9200, skipped=178, lr=[3.149189854078616e-06, 3.149189854078616e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:44:14,786] [INFO] [timer.py:199:stop] epoch=9/micro_step=920/global_step=9200, RunningAvgSamplesPerSec=177.0926427910578, CurrSamplesPerSec=176.74809085916934, MemAllocated=4.98GB, MaxMemAllocated=24.08GB ***** Evaluating perplexity, Epoch 10/16 ***** ppl: 1.7911655902862549 Beginning of Epoch 11/16, Total Micro Batches 920 [2023-04-21 22:44:26,553] [INFO] [logging.py:96:log_dist] [Rank 0] step=9210, skipped=178, lr=[3.139537056156834e-06, 3.139537056156834e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:44:26,571] [INFO] [timer.py:199:stop] epoch=10/micro_step=10/global_step=9210, RunningAvgSamplesPerSec=177.0903788003762, CurrSamplesPerSec=176.79523637970652, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:44:30,178] [INFO] [logging.py:96:log_dist] [Rank 0] step=9220, skipped=178, lr=[3.1298919354406117e-06, 3.1298919354406117e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:44:30,196] [INFO] [timer.py:199:stop] epoch=10/micro_step=20/global_step=9220, RunningAvgSamplesPerSec=177.09003537420352, CurrSamplesPerSec=176.81573215032623, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:44:33,801] [INFO] [logging.py:96:log_dist] [Rank 0] step=9230, skipped=178, lr=[3.120254535863029e-06, 3.120254535863029e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:44:33,820] [INFO] [timer.py:199:stop] epoch=10/micro_step=30/global_step=9230, RunningAvgSamplesPerSec=177.08974296717108, CurrSamplesPerSec=176.66050191543397, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:44:37,443] [INFO] [logging.py:96:log_dist] [Rank 0] step=9240, skipped=178, lr=[3.1106249013219936e-06, 3.1106249013219936e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:44:37,461] [INFO] [timer.py:199:stop] epoch=10/micro_step=40/global_step=9240, RunningAvgSamplesPerSec=177.08868395191737, CurrSamplesPerSec=170.81175518349022, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:44:41,067] [INFO] [logging.py:96:log_dist] [Rank 0] step=9250, skipped=178, lr=[3.1010030756800415e-06, 3.1010030756800415e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:44:41,086] [INFO] [timer.py:199:stop] epoch=10/micro_step=50/global_step=9250, RunningAvgSamplesPerSec=177.08832159935488, CurrSamplesPerSec=176.75577211789187, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:44:44,690] [INFO] [logging.py:96:log_dist] [Rank 0] step=9260, skipped=178, lr=[3.0913891027641468e-06, 3.0913891027641468e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:44:44,708] [INFO] [timer.py:199:stop] epoch=10/micro_step=60/global_step=9260, RunningAvgSamplesPerSec=177.0881493524067, CurrSamplesPerSec=177.03472759444617, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:44:48,313] [INFO] [logging.py:96:log_dist] [Rank 0] step=9270, skipped=178, lr=[3.0817830263655086e-06, 3.0817830263655086e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:44:48,331] [INFO] [timer.py:199:stop] epoch=10/micro_step=70/global_step=9270, RunningAvgSamplesPerSec=177.08789597183807, CurrSamplesPerSec=176.88482391609756, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:44:51,938] [INFO] [logging.py:96:log_dist] [Rank 0] step=9280, skipped=178, lr=[3.0721848902393567e-06, 3.0721848902393567e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:44:51,956] [INFO] [timer.py:199:stop] epoch=10/micro_step=80/global_step=9280, RunningAvgSamplesPerSec=177.08754157408558, CurrSamplesPerSec=176.9193317497937, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:44:52,289] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:44:52,623] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:44:55,504] [INFO] [logging.py:96:log_dist] [Rank 0] step=9290, skipped=180, lr=[3.0645121277150607e-06, 3.0645121277150607e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:44:55,522] [INFO] [timer.py:199:stop] epoch=10/micro_step=90/global_step=9290, RunningAvgSamplesPerSec=177.09028329417472, CurrSamplesPerSec=176.96446872513414, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:44:59,129] [INFO] [logging.py:96:log_dist] [Rank 0] step=9300, skipped=180, lr=[3.054928394227003e-06, 3.054928394227003e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:44:59,147] [INFO] [timer.py:199:stop] epoch=10/micro_step=100/global_step=9300, RunningAvgSamplesPerSec=177.08996025289161, CurrSamplesPerSec=177.1097163253928, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:45:02,753] [INFO] [logging.py:96:log_dist] [Rank 0] step=9310, skipped=180, lr=[3.0453527233330375e-06, 3.0453527233330375e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:45:02,771] [INFO] [timer.py:199:stop] epoch=10/micro_step=110/global_step=9310, RunningAvgSamplesPerSec=177.08966036124627, CurrSamplesPerSec=176.9061565623581, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:45:06,375] [INFO] [logging.py:96:log_dist] [Rank 0] step=9320, skipped=180, lr=[3.035785158649902e-06, 3.035785158649902e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:45:06,393] [INFO] [timer.py:199:stop] epoch=10/micro_step=120/global_step=9320, RunningAvgSamplesPerSec=177.08946678318537, CurrSamplesPerSec=177.04009850676772, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:45:09,993] [INFO] [logging.py:96:log_dist] [Rank 0] step=9330, skipped=180, lr=[3.0262257437574108e-06, 3.0262257437574108e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:45:10,012] [INFO] [timer.py:199:stop] epoch=10/micro_step=130/global_step=9330, RunningAvgSamplesPerSec=177.0894286224066, CurrSamplesPerSec=177.08716630668977, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:45:13,612] [INFO] [logging.py:96:log_dist] [Rank 0] step=9340, skipped=180, lr=[3.016674522198254e-06, 3.016674522198254e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:45:13,631] [INFO] [timer.py:199:stop] epoch=10/micro_step=140/global_step=9340, RunningAvgSamplesPerSec=177.08939749948172, CurrSamplesPerSec=176.84240705800508, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:45:17,233] [INFO] [logging.py:96:log_dist] [Rank 0] step=9350, skipped=180, lr=[3.0071315374778044e-06, 3.0071315374778044e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:45:17,251] [INFO] [timer.py:199:stop] epoch=10/micro_step=150/global_step=9350, RunningAvgSamplesPerSec=177.08928573373464, CurrSamplesPerSec=177.11707845566917, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:45:20,851] [INFO] [logging.py:96:log_dist] [Rank 0] step=9360, skipped=180, lr=[2.9975968330639143e-06, 2.9975968330639143e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:45:20,869] [INFO] [timer.py:199:stop] epoch=10/micro_step=160/global_step=9360, RunningAvgSamplesPerSec=177.08925629072144, CurrSamplesPerSec=176.90557363479755, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:45:24,472] [INFO] [logging.py:96:log_dist] [Rank 0] step=9370, skipped=180, lr=[2.988070452386718e-06, 2.988070452386718e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:45:24,490] [INFO] [timer.py:199:stop] epoch=10/micro_step=170/global_step=9370, RunningAvgSamplesPerSec=177.08912658230616, CurrSamplesPerSec=176.94813744496645, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:45:28,092] [INFO] [logging.py:96:log_dist] [Rank 0] step=9380, skipped=180, lr=[2.978552438838442e-06, 2.978552438838442e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:45:28,110] [INFO] [timer.py:199:stop] epoch=10/micro_step=180/global_step=9380, RunningAvgSamplesPerSec=177.08898958896336, CurrSamplesPerSec=177.12923315284021, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:45:29,167] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:45:29,501] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:45:31,655] [INFO] [logging.py:96:log_dist] [Rank 0] step=9390, skipped=182, lr=[2.9709440814678908e-06, 2.9709440814678908e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:45:31,673] [INFO] [timer.py:199:stop] epoch=10/micro_step=190/global_step=9390, RunningAvgSamplesPerSec=177.09186090594383, CurrSamplesPerSec=177.12338934021673, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:45:35,302] [INFO] [logging.py:96:log_dist] [Rank 0] step=9400, skipped=182, lr=[2.9614412379782863e-06, 2.9614412379782863e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:45:35,320] [INFO] [timer.py:199:stop] epoch=10/micro_step=200/global_step=9400, RunningAvgSamplesPerSec=177.09036510601928, CurrSamplesPerSec=177.07793764434334, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:45:38,922] [INFO] [logging.py:96:log_dist] [Rank 0] step=9410, skipped=182, lr=[2.9519468829124396e-06, 2.9519468829124396e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:45:38,940] [INFO] [timer.py:199:stop] epoch=10/micro_step=210/global_step=9410, RunningAvgSamplesPerSec=177.09025528343076, CurrSamplesPerSec=177.0796898480245, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:45:42,558] [INFO] [logging.py:96:log_dist] [Rank 0] step=9420, skipped=182, lr=[2.9424610595166944e-06, 2.9424610595166944e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:45:42,576] [INFO] [timer.py:199:stop] epoch=10/micro_step=220/global_step=9420, RunningAvgSamplesPerSec=177.08930945247886, CurrSamplesPerSec=176.72272417252813, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:45:46,190] [INFO] [logging.py:96:log_dist] [Rank 0] step=9430, skipped=182, lr=[2.932983810998537e-06, 2.932983810998537e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:45:46,208] [INFO] [timer.py:199:stop] epoch=10/micro_step=230/global_step=9430, RunningAvgSamplesPerSec=177.08884481501696, CurrSamplesPerSec=176.8531258811841, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:45:49,809] [INFO] [logging.py:96:log_dist] [Rank 0] step=9440, skipped=182, lr=[2.9235151805263955e-06, 2.9235151805263955e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:45:49,827] [INFO] [timer.py:199:stop] epoch=10/micro_step=240/global_step=9440, RunningAvgSamplesPerSec=177.088792124428, CurrSamplesPerSec=177.0131304765271, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:45:53,429] [INFO] [logging.py:96:log_dist] [Rank 0] step=9450, skipped=182, lr=[2.914055211229443e-06, 2.914055211229443e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:45:53,448] [INFO] [timer.py:199:stop] epoch=10/micro_step=250/global_step=9450, RunningAvgSamplesPerSec=177.08868735179618, CurrSamplesPerSec=177.12970067450843, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:45:57,048] [INFO] [logging.py:96:log_dist] [Rank 0] step=9460, skipped=182, lr=[2.904603946197398e-06, 2.904603946197398e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:45:57,067] [INFO] [timer.py:199:stop] epoch=10/micro_step=260/global_step=9460, RunningAvgSamplesPerSec=177.08864992235408, CurrSamplesPerSec=177.14957262701083, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:46:00,667] [INFO] [logging.py:96:log_dist] [Rank 0] step=9470, skipped=182, lr=[2.8951614284803398e-06, 2.8951614284803398e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:46:00,686] [INFO] [timer.py:199:stop] epoch=10/micro_step=270/global_step=9470, RunningAvgSamplesPerSec=177.08860579783254, CurrSamplesPerSec=176.96493537743467, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:46:04,287] [INFO] [logging.py:96:log_dist] [Rank 0] step=9480, skipped=182, lr=[2.885727701088495e-06, 2.885727701088495e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:46:04,305] [INFO] [timer.py:199:stop] epoch=10/micro_step=280/global_step=9480, RunningAvgSamplesPerSec=177.08854438669294, CurrSamplesPerSec=176.98943808001056, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:46:06,109] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:46:06,443] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:46:07,874] [INFO] [logging.py:96:log_dist] [Rank 0] step=9490, skipped=184, lr=[2.8781870770864895e-06, 2.8781870770864895e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:46:07,892] [INFO] [timer.py:199:stop] epoch=10/micro_step=290/global_step=9490, RunningAvgSamplesPerSec=177.09014953365624, CurrSamplesPerSec=176.83728112359847, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:46:11,496] [INFO] [logging.py:96:log_dist] [Rank 0] step=9500, skipped=184, lr=[2.8687692805378802e-06, 2.8687692805378802e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:46:11,514] [INFO] [timer.py:199:stop] epoch=10/micro_step=300/global_step=9500, RunningAvgSamplesPerSec=177.08992485174338, CurrSamplesPerSec=176.90569022000233, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:46:15,115] [INFO] [logging.py:96:log_dist] [Rank 0] step=9510, skipped=184, lr=[2.859360394529495e-06, 2.859360394529495e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:46:15,134] [INFO] [timer.py:199:stop] epoch=10/micro_step=310/global_step=9510, RunningAvgSamplesPerSec=177.08984895122424, CurrSamplesPerSec=176.89519816696463, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:46:18,736] [INFO] [logging.py:96:log_dist] [Rank 0] step=9520, skipped=184, lr=[2.8499604619183716e-06, 2.8499604619183716e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:46:18,754] [INFO] [timer.py:199:stop] epoch=10/micro_step=320/global_step=9520, RunningAvgSamplesPerSec=177.08973071618686, CurrSamplesPerSec=176.73447572295856, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:46:22,356] [INFO] [logging.py:96:log_dist] [Rank 0] step=9530, skipped=184, lr=[2.8405695255207722e-06, 2.8405695255207722e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:46:22,374] [INFO] [timer.py:199:stop] epoch=10/micro_step=330/global_step=9530, RunningAvgSamplesPerSec=177.08964733152646, CurrSamplesPerSec=176.57973653362663, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:46:25,974] [INFO] [logging.py:96:log_dist] [Rank 0] step=9540, skipped=184, lr=[2.831187628111973e-06, 2.831187628111973e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:46:25,992] [INFO] [timer.py:199:stop] epoch=10/micro_step=340/global_step=9540, RunningAvgSamplesPerSec=177.08962110259748, CurrSamplesPerSec=176.9810363890616, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:46:29,595] [INFO] [logging.py:96:log_dist] [Rank 0] step=9550, skipped=184, lr=[2.8218148124260823e-06, 2.8218148124260823e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:46:29,613] [INFO] [timer.py:199:stop] epoch=10/micro_step=350/global_step=9550, RunningAvgSamplesPerSec=177.0894803278113, CurrSamplesPerSec=176.90557363479755, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:46:33,215] [INFO] [logging.py:96:log_dist] [Rank 0] step=9560, skipped=184, lr=[2.8124511211558416e-06, 2.8124511211558416e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:46:33,233] [INFO] [timer.py:199:stop] epoch=10/micro_step=360/global_step=9560, RunningAvgSamplesPerSec=177.08936291262825, CurrSamplesPerSec=176.70504161630842, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:46:36,847] [INFO] [logging.py:96:log_dist] [Rank 0] step=9570, skipped=184, lr=[2.8030965969524295e-06, 2.8030965969524295e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:46:36,865] [INFO] [timer.py:199:stop] epoch=10/micro_step=370/global_step=9570, RunningAvgSamplesPerSec=177.08864584496484, CurrSamplesPerSec=176.0110156277662, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:46:40,474] [INFO] [logging.py:96:log_dist] [Rank 0] step=9580, skipped=184, lr=[2.79375128242527e-06, 2.79375128242527e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:46:40,492] [INFO] [timer.py:199:stop] epoch=10/micro_step=380/global_step=9580, RunningAvgSamplesPerSec=177.0882191248174, CurrSamplesPerSec=176.89006917175115, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:46:42,998] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:46:43,332] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:46:44,039] [INFO] [logging.py:96:log_dist] [Rank 0] step=9590, skipped=186, lr=[2.7862816903772034e-06, 2.7862816903772034e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:46:44,058] [INFO] [timer.py:199:stop] epoch=10/micro_step=390/global_step=9590, RunningAvgSamplesPerSec=177.09091677497463, CurrSamplesPerSec=176.47282158153723, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:46:47,663] [INFO] [logging.py:96:log_dist] [Rank 0] step=9600, skipped=186, lr=[2.7769530605090217e-06, 2.7769530605090217e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:46:47,681] [INFO] [timer.py:199:stop] epoch=10/micro_step=400/global_step=9600, RunningAvgSamplesPerSec=177.0906569820425, CurrSamplesPerSec=177.03472759444617, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:46:51,285] [INFO] [logging.py:96:log_dist] [Rank 0] step=9610, skipped=186, lr=[2.7676337593996896e-06, 2.7676337593996896e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:46:51,304] [INFO] [timer.py:199:stop] epoch=10/micro_step=410/global_step=9610, RunningAvgSamplesPerSec=177.09040270390693, CurrSamplesPerSec=177.06695795742908, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:46:54,906] [INFO] [logging.py:96:log_dist] [Rank 0] step=9620, skipped=186, lr=[2.758323829498193e-06, 2.758323829498193e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:46:54,924] [INFO] [timer.py:199:stop] epoch=10/micro_step=420/global_step=9620, RunningAvgSamplesPerSec=177.09030167790192, CurrSamplesPerSec=177.0409158480815, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:46:58,525] [INFO] [logging.py:96:log_dist] [Rank 0] step=9630, skipped=186, lr=[2.749023313210828e-06, 2.749023313210828e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:46:58,543] [INFO] [timer.py:199:stop] epoch=10/micro_step=430/global_step=9630, RunningAvgSamplesPerSec=177.09023244734078, CurrSamplesPerSec=176.80583303145067, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:47:02,144] [INFO] [logging.py:96:log_dist] [Rank 0] step=9640, skipped=186, lr=[2.739732252901016e-06, 2.739732252901016e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:47:02,162] [INFO] [timer.py:199:stop] epoch=10/micro_step=440/global_step=9640, RunningAvgSamplesPerSec=177.09015333588349, CurrSamplesPerSec=176.986754119805, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:47:05,763] [INFO] [logging.py:96:log_dist] [Rank 0] step=9650, skipped=186, lr=[2.7304506908891064e-06, 2.7304506908891064e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:47:05,782] [INFO] [timer.py:199:stop] epoch=10/micro_step=450/global_step=9650, RunningAvgSamplesPerSec=177.0901125805493, CurrSamplesPerSec=177.04710453435308, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:47:09,383] [INFO] [logging.py:96:log_dist] [Rank 0] step=9660, skipped=186, lr=[2.721178669452184e-06, 2.721178669452184e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:47:09,401] [INFO] [timer.py:199:stop] epoch=10/micro_step=460/global_step=9660, RunningAvgSamplesPerSec=177.09003131363474, CurrSamplesPerSec=176.94323864868554, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:47:13,014] [INFO] [logging.py:96:log_dist] [Rank 0] step=9670, skipped=186, lr=[2.711916230823877e-06, 2.711916230823877e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:47:13,033] [INFO] [timer.py:199:stop] epoch=10/micro_step=470/global_step=9670, RunningAvgSamplesPerSec=177.08962284742265, CurrSamplesPerSec=176.9787027275778, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:47:16,633] [INFO] [logging.py:96:log_dist] [Rank 0] step=9680, skipped=186, lr=[2.7026634171941642e-06, 2.7026634171941642e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:47:16,651] [INFO] [timer.py:199:stop] epoch=10/micro_step=480/global_step=9680, RunningAvgSamplesPerSec=177.08959020283461, CurrSamplesPerSec=177.09090477277437, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:47:19,878] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:47:20,212] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:47:20,212] [INFO] [logging.py:96:log_dist] [Rank 0] step=9690, skipped=188, lr=[2.6952681246130607e-06, 2.6952681246130607e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:47:20,212] [INFO] [timer.py:199:stop] epoch=10/micro_step=490/global_step=9690, RunningAvgSamplesPerSec=177.0924663082579, CurrSamplesPerSec=192.25251367039925, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:47:23,813] [INFO] [logging.py:96:log_dist] [Rank 0] step=9700, skipped=188, lr=[2.686032742159498e-06, 2.686032742159498e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:47:23,832] [INFO] [timer.py:199:stop] epoch=10/micro_step=500/global_step=9700, RunningAvgSamplesPerSec=177.09240351150277, CurrSamplesPerSec=176.91443454832805, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:47:27,434] [INFO] [logging.py:96:log_dist] [Rank 0] step=9710, skipped=188, lr=[2.676807102602617e-06, 2.676807102602617e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:47:27,452] [INFO] [timer.py:199:stop] epoch=10/micro_step=510/global_step=9710, RunningAvgSamplesPerSec=177.09227657968134, CurrSamplesPerSec=177.20231335977815, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:47:31,056] [INFO] [logging.py:96:log_dist] [Rank 0] step=9720, skipped=188, lr=[2.6675912479647796e-06, 2.6675912479647796e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:47:31,074] [INFO] [timer.py:199:stop] epoch=10/micro_step=520/global_step=9720, RunningAvgSamplesPerSec=177.0920825372144, CurrSamplesPerSec=177.1447795716487, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:47:34,678] [INFO] [logging.py:96:log_dist] [Rank 0] step=9730, skipped=188, lr=[2.6583852202237785e-06, 2.6583852202237785e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:47:34,696] [INFO] [timer.py:199:stop] epoch=10/micro_step=530/global_step=9730, RunningAvgSamplesPerSec=177.09184826513874, CurrSamplesPerSec=176.80024343047282, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:47:38,298] [INFO] [logging.py:96:log_dist] [Rank 0] step=9740, skipped=188, lr=[2.6491890613126433e-06, 2.6491890613126433e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:47:38,316] [INFO] [timer.py:199:stop] epoch=10/micro_step=540/global_step=9740, RunningAvgSamplesPerSec=177.09176049504046, CurrSamplesPerSec=177.05317687347184, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:47:41,936] [INFO] [logging.py:96:log_dist] [Rank 0] step=9750, skipped=188, lr=[2.6400028131194465e-06, 2.6400028131194465e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:47:41,954] [INFO] [timer.py:199:stop] epoch=10/micro_step=550/global_step=9750, RunningAvgSamplesPerSec=177.09076475242793, CurrSamplesPerSec=168.87289463635156, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:47:45,556] [INFO] [logging.py:96:log_dist] [Rank 0] step=9760, skipped=188, lr=[2.6308265174871297e-06, 2.6308265174871297e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:47:45,575] [INFO] [timer.py:199:stop] epoch=10/micro_step=560/global_step=9760, RunningAvgSamplesPerSec=177.0906396977636, CurrSamplesPerSec=176.91630011691805, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:47:49,179] [INFO] [logging.py:96:log_dist] [Rank 0] step=9770, skipped=188, lr=[2.6216602162132887e-06, 2.6216602162132887e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:47:49,197] [INFO] [timer.py:199:stop] epoch=10/micro_step=570/global_step=9770, RunningAvgSamplesPerSec=177.09041328397294, CurrSamplesPerSec=176.8554562320796, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:47:52,817] [INFO] [logging.py:96:log_dist] [Rank 0] step=9780, skipped=188, lr=[2.612503951050003e-06, 2.612503951050003e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:47:52,835] [INFO] [timer.py:199:stop] epoch=10/micro_step=580/global_step=9780, RunningAvgSamplesPerSec=177.08953974886657, CurrSamplesPerSec=176.77043819317586, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:47:56,438] [INFO] [logging.py:96:log_dist] [Rank 0] step=9790, skipped=188, lr=[2.603357763703635e-06, 2.603357763703635e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:47:56,456] [INFO] [timer.py:199:stop] epoch=10/micro_step=590/global_step=9790, RunningAvgSamplesPerSec=177.08937927010456, CurrSamplesPerSec=176.98488706506302, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:47:56,789] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:47:57,122] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:48:00,003] [INFO] [logging.py:96:log_dist] [Rank 0] step=9800, skipped=190, lr=[2.596048097852099e-06, 2.596048097852099e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:48:00,021] [INFO] [timer.py:199:stop] epoch=10/micro_step=600/global_step=9800, RunningAvgSamplesPerSec=177.09202303090214, CurrSamplesPerSec=176.73680294803347, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:48:03,623] [INFO] [logging.py:96:log_dist] [Rank 0] step=9810, skipped=190, lr=[2.586920155529573e-06, 2.586920155529573e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:48:03,641] [INFO] [timer.py:199:stop] epoch=10/micro_step=610/global_step=9810, RunningAvgSamplesPerSec=177.0919022843702, CurrSamplesPerSec=177.05411111646103, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:48:07,246] [INFO] [logging.py:96:log_dist] [Rank 0] step=9820, skipped=190, lr=[2.57780240755697e-06, 2.57780240755697e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:48:07,264] [INFO] [timer.py:199:stop] epoch=10/micro_step=620/global_step=9820, RunningAvgSamplesPerSec=177.09168133914807, CurrSamplesPerSec=176.7213280525828, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:48:10,868] [INFO] [logging.py:96:log_dist] [Rank 0] step=9830, skipped=190, lr=[2.568694895465204e-06, 2.568694895465204e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:48:10,886] [INFO] [timer.py:199:stop] epoch=10/micro_step=630/global_step=9830, RunningAvgSamplesPerSec=177.09147274347913, CurrSamplesPerSec=176.74809085916934, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:48:14,488] [INFO] [logging.py:96:log_dist] [Rank 0] step=9840, skipped=190, lr=[2.559597660738574e-06, 2.559597660738574e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:48:14,507] [INFO] [timer.py:199:stop] epoch=10/micro_step=640/global_step=9840, RunningAvgSamplesPerSec=177.09135414841154, CurrSamplesPerSec=177.0270220720373, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:48:18,113] [INFO] [logging.py:96:log_dist] [Rank 0] step=9850, skipped=190, lr=[2.5505107448145615e-06, 2.5505107448145615e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:48:18,131] [INFO] [timer.py:199:stop] epoch=10/micro_step=650/global_step=9850, RunningAvgSamplesPerSec=177.09105358582008, CurrSamplesPerSec=176.9812697585949, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:48:21,734] [INFO] [logging.py:96:log_dist] [Rank 0] step=9860, skipped=190, lr=[2.541434189083649e-06, 2.541434189083649e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:48:21,752] [INFO] [timer.py:199:stop] epoch=10/micro_step=660/global_step=9860, RunningAvgSamplesPerSec=177.09092295257955, CurrSamplesPerSec=177.000641575546, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:48:25,355] [INFO] [logging.py:96:log_dist] [Rank 0] step=9870, skipped=190, lr=[2.532368034889122e-06, 2.532368034889122e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:48:25,373] [INFO] [timer.py:199:stop] epoch=10/micro_step=670/global_step=9870, RunningAvgSamplesPerSec=177.0907754529486, CurrSamplesPerSec=176.83844608259716, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:48:28,977] [INFO] [logging.py:96:log_dist] [Rank 0] step=9880, skipped=190, lr=[2.5233123235268985e-06, 2.5233123235268985e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:48:28,995] [INFO] [timer.py:199:stop] epoch=10/micro_step=680/global_step=9880, RunningAvgSamplesPerSec=177.0905904051459, CurrSamplesPerSec=176.93973967455034, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:48:32,598] [INFO] [logging.py:96:log_dist] [Rank 0] step=9890, skipped=190, lr=[2.51426709624532e-06, 2.51426709624532e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:48:32,617] [INFO] [timer.py:199:stop] epoch=10/micro_step=690/global_step=9890, RunningAvgSamplesPerSec=177.09040437326632, CurrSamplesPerSec=176.7897638811687, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:48:33,674] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:48:34,007] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:48:36,163] [INFO] [logging.py:96:log_dist] [Rank 0] step=9900, skipped=192, lr=[2.50703849064653e-06, 2.50703849064653e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:48:36,182] [INFO] [timer.py:199:stop] epoch=10/micro_step=700/global_step=9900, RunningAvgSamplesPerSec=177.09301928506525, CurrSamplesPerSec=176.89228392865144, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:48:39,785] [INFO] [logging.py:96:log_dist] [Rank 0] step=9910, skipped=192, lr=[2.4980122385033927e-06, 2.4980122385033927e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:48:39,803] [INFO] [timer.py:199:stop] epoch=10/micro_step=710/global_step=9910, RunningAvgSamplesPerSec=177.0928637525974, CurrSamplesPerSec=176.94673776120752, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:48:43,405] [INFO] [logging.py:96:log_dist] [Rank 0] step=9920, skipped=192, lr=[2.4889965856816176e-06, 2.4889965856816176e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:48:43,424] [INFO] [timer.py:199:stop] epoch=10/micro_step=720/global_step=9920, RunningAvgSamplesPerSec=177.09275607726525, CurrSamplesPerSec=176.92341295812597, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:48:44,120] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768, reducing to 16384 [2023-04-21 22:48:47,000] [INFO] [logging.py:96:log_dist] [Rank 0] step=9930, skipped=193, lr=[2.4808915945037305e-06, 2.4808915945037305e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:48:47,018] [INFO] [timer.py:199:stop] epoch=10/micro_step=730/global_step=9930, RunningAvgSamplesPerSec=177.09391936255784, CurrSamplesPerSec=176.72412031453266, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:48:50,632] [INFO] [logging.py:96:log_dist] [Rank 0] step=9940, skipped=193, lr=[2.4718961934889875e-06, 2.4718961934889875e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:48:50,650] [INFO] [timer.py:199:stop] epoch=10/micro_step=740/global_step=9940, RunningAvgSamplesPerSec=177.09330500678195, CurrSamplesPerSec=176.90312538099334, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:48:54,253] [INFO] [logging.py:96:log_dist] [Rank 0] step=9950, skipped=193, lr=[2.462911510752977e-06, 2.462911510752977e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:48:54,272] [INFO] [timer.py:199:stop] epoch=10/micro_step=750/global_step=9950, RunningAvgSamplesPerSec=177.09311268819314, CurrSamplesPerSec=176.9783526836633, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:48:57,888] [INFO] [logging.py:96:log_dist] [Rank 0] step=9960, skipped=193, lr=[2.4539375872205144e-06, 2.4539375872205144e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:48:57,907] [INFO] [timer.py:199:stop] epoch=10/micro_step=760/global_step=9960, RunningAvgSamplesPerSec=177.09227900282772, CurrSamplesPerSec=176.92691128656463, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:49:01,512] [INFO] [logging.py:96:log_dist] [Rank 0] step=9970, skipped=193, lr=[2.4449744637674073e-06, 2.4449744637674073e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:49:01,530] [INFO] [timer.py:199:stop] epoch=10/micro_step=770/global_step=9970, RunningAvgSamplesPerSec=177.09202129414226, CurrSamplesPerSec=177.31749312523078, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:49:05,133] [INFO] [logging.py:96:log_dist] [Rank 0] step=9980, skipped=193, lr=[2.4360221812202637e-06, 2.4360221812202637e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:49:05,152] [INFO] [timer.py:199:stop] epoch=10/micro_step=780/global_step=9980, RunningAvgSamplesPerSec=177.09184276268599, CurrSamplesPerSec=176.85464060228048, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:49:08,756] [INFO] [logging.py:96:log_dist] [Rank 0] step=9990, skipped=193, lr=[2.4270807803563164e-06, 2.4270807803563164e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:49:08,774] [INFO] [timer.py:199:stop] epoch=10/micro_step=790/global_step=9990, RunningAvgSamplesPerSec=177.09161616314154, CurrSamplesPerSec=176.78242893580367, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:49:12,384] [INFO] [logging.py:96:log_dist] [Rank 0] step=10000, skipped=193, lr=[2.4181503019032336e-06, 2.4181503019032336e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:49:12,403] [INFO] [timer.py:199:stop] epoch=10/micro_step=800/global_step=10000, RunningAvgSamplesPerSec=177.0912028182879, CurrSamplesPerSec=176.96073559532763, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:49:16,005] [INFO] [logging.py:96:log_dist] [Rank 0] step=10010, skipped=193, lr=[2.4092307865389305e-06, 2.4092307865389305e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:49:16,023] [INFO] [timer.py:199:stop] epoch=10/micro_step=810/global_step=10010, RunningAvgSamplesPerSec=177.09106281030517, CurrSamplesPerSec=176.85382498000445, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:49:19,626] [INFO] [logging.py:96:log_dist] [Rank 0] step=10020, skipped=193, lr=[2.4003222748913817e-06, 2.4003222748913817e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:49:19,645] [INFO] [timer.py:199:stop] epoch=10/micro_step=820/global_step=10020, RunningAvgSamplesPerSec=177.0909040963806, CurrSamplesPerSec=176.64841146009846, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:49:23,256] [INFO] [logging.py:96:log_dist] [Rank 0] step=10030, skipped=193, lr=[2.391424807538452e-06, 2.391424807538452e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:49:23,274] [INFO] [timer.py:199:stop] epoch=10/micro_step=830/global_step=10030, RunningAvgSamplesPerSec=177.09035084781144, CurrSamplesPerSec=176.7727663654551, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:49:26,877] [INFO] [logging.py:96:log_dist] [Rank 0] step=10040, skipped=193, lr=[2.3825384250076864e-06, 2.3825384250076864e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:49:26,896] [INFO] [timer.py:199:stop] epoch=10/micro_step=840/global_step=10040, RunningAvgSamplesPerSec=177.09017963655964, CurrSamplesPerSec=176.84730027261455, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:49:30,510] [INFO] [logging.py:96:log_dist] [Rank 0] step=10050, skipped=193, lr=[2.373663167776148e-06, 2.373663167776148e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:49:30,529] [INFO] [timer.py:199:stop] epoch=10/micro_step=850/global_step=10050, RunningAvgSamplesPerSec=177.0894389626339, CurrSamplesPerSec=176.85627186940192, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:49:34,131] [INFO] [logging.py:96:log_dist] [Rank 0] step=10060, skipped=193, lr=[2.3647990762702207e-06, 2.3647990762702207e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:49:34,150] [INFO] [timer.py:199:stop] epoch=10/micro_step=860/global_step=10060, RunningAvgSamplesPerSec=177.08931732015455, CurrSamplesPerSec=177.0466374485963, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:49:37,752] [INFO] [logging.py:96:log_dist] [Rank 0] step=10070, skipped=193, lr=[2.355946190865432e-06, 2.355946190865432e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:49:37,770] [INFO] [timer.py:199:stop] epoch=10/micro_step=870/global_step=10070, RunningAvgSamplesPerSec=177.08921093483013, CurrSamplesPerSec=177.00425967394372, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:49:41,371] [INFO] [logging.py:96:log_dist] [Rank 0] step=10080, skipped=193, lr=[2.3471045518862654e-06, 2.3471045518862654e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:49:41,389] [INFO] [timer.py:199:stop] epoch=10/micro_step=880/global_step=10080, RunningAvgSamplesPerSec=177.08914263279536, CurrSamplesPerSec=177.0720972158365, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:49:44,991] [INFO] [logging.py:96:log_dist] [Rank 0] step=10090, skipped=193, lr=[2.338274199605973e-06, 2.338274199605973e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:49:45,009] [INFO] [timer.py:199:stop] epoch=10/micro_step=890/global_step=10090, RunningAvgSamplesPerSec=177.08905410718043, CurrSamplesPerSec=176.96190218147274, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:49:48,614] [INFO] [logging.py:96:log_dist] [Rank 0] step=10100, skipped=193, lr=[2.3294551742464016e-06, 2.3294551742464016e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:49:48,632] [INFO] [timer.py:199:stop] epoch=10/micro_step=900/global_step=10100, RunningAvgSamplesPerSec=177.088831564569, CurrSamplesPerSec=176.22061692702383, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:49:52,236] [INFO] [logging.py:96:log_dist] [Rank 0] step=10110, skipped=193, lr=[2.320647515977803e-06, 2.320647515977803e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:49:52,255] [INFO] [timer.py:199:stop] epoch=10/micro_step=910/global_step=10110, RunningAvgSamplesPerSec=177.08860935882447, CurrSamplesPerSec=176.88214312777782, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:49:55,866] [INFO] [logging.py:96:log_dist] [Rank 0] step=10120, skipped=193, lr=[2.311851264918654e-06, 2.311851264918654e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:49:55,884] [INFO] [timer.py:199:stop] epoch=10/micro_step=920/global_step=10120, RunningAvgSamplesPerSec=177.08828948338058, CurrSamplesPerSec=177.23004114574255, MemAllocated=4.98GB, MaxMemAllocated=24.08GB ***** Evaluating perplexity, Epoch 11/16 ***** ppl: 1.7767704725265503 Beginning of Epoch 12/16, Total Micro Batches 920 [2023-04-21 22:50:05,105] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:50:05,441] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:50:07,597] [INFO] [logging.py:96:log_dist] [Rank 0] step=10130, skipped=195, lr=[2.3048225041885837e-06, 2.3048225041885837e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:50:07,615] [INFO] [timer.py:199:stop] epoch=11/micro_step=10/global_step=10130, RunningAvgSamplesPerSec=177.08975931323005, CurrSamplesPerSec=177.03706273367422, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:50:11,220] [INFO] [logging.py:96:log_dist] [Rank 0] step=10140, skipped=195, lr=[2.296046887039025e-06, 2.296046887039025e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:50:11,239] [INFO] [timer.py:199:stop] epoch=11/micro_step=20/global_step=10140, RunningAvgSamplesPerSec=177.089491109852, CurrSamplesPerSec=177.1250255688829, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:50:14,840] [INFO] [logging.py:96:log_dist] [Rank 0] step=10150, skipped=195, lr=[2.2872827891536406e-06, 2.2872827891536406e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:50:14,858] [INFO] [timer.py:199:stop] epoch=11/micro_step=30/global_step=10150, RunningAvgSamplesPerSec=177.0894241423797, CurrSamplesPerSec=177.06964435074684, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:50:18,461] [INFO] [logging.py:96:log_dist] [Rank 0] step=10160, skipped=195, lr=[2.2785302504524855e-06, 2.2785302504524855e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:50:18,480] [INFO] [timer.py:199:stop] epoch=11/micro_step=40/global_step=10160, RunningAvgSamplesPerSec=177.08926804734017, CurrSamplesPerSec=177.1631348717123, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:50:22,083] [INFO] [logging.py:96:log_dist] [Rank 0] step=10170, skipped=195, lr=[2.2697893108029705e-06, 2.2697893108029705e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:50:22,102] [INFO] [timer.py:199:stop] epoch=11/micro_step=50/global_step=10170, RunningAvgSamplesPerSec=177.0890670591786, CurrSamplesPerSec=176.89205079373448, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:50:25,721] [INFO] [logging.py:96:log_dist] [Rank 0] step=10180, skipped=195, lr=[2.261060010019671e-06, 2.261060010019671e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:50:25,739] [INFO] [timer.py:199:stop] epoch=11/micro_step=60/global_step=10180, RunningAvgSamplesPerSec=177.08815056342314, CurrSamplesPerSec=169.71980078892844, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:50:29,342] [INFO] [logging.py:96:log_dist] [Rank 0] step=10190, skipped=195, lr=[2.2523423878641423e-06, 2.2523423878641423e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:50:29,360] [INFO] [timer.py:199:stop] epoch=11/micro_step=70/global_step=10190, RunningAvgSamplesPerSec=177.08801367896032, CurrSamplesPerSec=177.1536644648574, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:50:32,965] [INFO] [logging.py:96:log_dist] [Rank 0] step=10200, skipped=195, lr=[2.243636484044757e-06, 2.243636484044757e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:50:32,984] [INFO] [timer.py:199:stop] epoch=11/micro_step=80/global_step=10200, RunningAvgSamplesPerSec=177.08773643272968, CurrSamplesPerSec=176.86466171853638, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:50:36,588] [INFO] [logging.py:96:log_dist] [Rank 0] step=10210, skipped=195, lr=[2.2349423382164974e-06, 2.2349423382164974e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:50:36,607] [INFO] [timer.py:199:stop] epoch=11/micro_step=90/global_step=10210, RunningAvgSamplesPerSec=177.08749296525917, CurrSamplesPerSec=176.84753328958845, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:50:40,209] [INFO] [logging.py:96:log_dist] [Rank 0] step=10220, skipped=195, lr=[2.226259989980796e-06, 2.226259989980796e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:50:40,227] [INFO] [timer.py:199:stop] epoch=11/micro_step=100/global_step=10220, RunningAvgSamplesPerSec=177.08737356826805, CurrSamplesPerSec=176.97660248486108, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:50:42,007] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:50:42,341] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:50:43,773] [INFO] [logging.py:96:log_dist] [Rank 0] step=10230, skipped=197, lr=[2.2193226322367385e-06, 2.2193226322367385e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:50:43,791] [INFO] [timer.py:199:stop] epoch=11/micro_step=110/global_step=10230, RunningAvgSamplesPerSec=177.08998742354692, CurrSamplesPerSec=176.92504549418115, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:50:47,397] [INFO] [logging.py:96:log_dist] [Rank 0] step=10240, skipped=197, lr=[2.2106616192916047e-06, 2.2106616192916047e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:50:47,415] [INFO] [timer.py:199:stop] epoch=11/micro_step=120/global_step=10240, RunningAvgSamplesPerSec=177.08967504722216, CurrSamplesPerSec=176.66922202097638, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:50:51,018] [INFO] [logging.py:96:log_dist] [Rank 0] step=10250, skipped=197, lr=[2.202012514536578e-06, 2.202012514536578e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:50:51,037] [INFO] [timer.py:199:stop] epoch=11/micro_step=130/global_step=10250, RunningAvgSamplesPerSec=177.08951597375054, CurrSamplesPerSec=176.9231797411489, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:50:54,637] [INFO] [logging.py:96:log_dist] [Rank 0] step=10260, skipped=197, lr=[2.1933753573679307e-06, 2.1933753573679307e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:50:54,655] [INFO] [timer.py:199:stop] epoch=11/micro_step=140/global_step=10260, RunningAvgSamplesPerSec=177.0894874088805, CurrSamplesPerSec=177.21003415001582, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:50:58,259] [INFO] [logging.py:96:log_dist] [Rank 0] step=10270, skipped=197, lr=[2.184750187127514e-06, 2.184750187127514e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:50:58,277] [INFO] [timer.py:199:stop] epoch=11/micro_step=150/global_step=10270, RunningAvgSamplesPerSec=177.08931449227697, CurrSamplesPerSec=177.10550966859122, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:51:01,895] [INFO] [logging.py:96:log_dist] [Rank 0] step=10280, skipped=197, lr=[2.176137043102575e-06, 2.176137043102575e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:51:01,913] [INFO] [timer.py:199:stop] epoch=11/micro_step=160/global_step=10280, RunningAvgSamplesPerSec=177.08847064168788, CurrSamplesPerSec=176.91256901908216, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:51:05,516] [INFO] [logging.py:96:log_dist] [Rank 0] step=10290, skipped=197, lr=[2.1675359645255873e-06, 2.1675359645255873e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:51:05,535] [INFO] [timer.py:199:stop] epoch=11/micro_step=170/global_step=10290, RunningAvgSamplesPerSec=177.08830274207423, CurrSamplesPerSec=177.02468719767498, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:51:09,153] [INFO] [logging.py:96:log_dist] [Rank 0] step=10300, skipped=197, lr=[2.158946990574067e-06, 2.158946990574067e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:51:09,171] [INFO] [timer.py:199:stop] epoch=11/micro_step=180/global_step=10300, RunningAvgSamplesPerSec=177.08753104114444, CurrSamplesPerSec=177.00612713744255, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:51:12,771] [INFO] [logging.py:96:log_dist] [Rank 0] step=10310, skipped=197, lr=[2.150370160370387e-06, 2.150370160370387e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:51:12,789] [INFO] [timer.py:199:stop] epoch=11/micro_step=190/global_step=10310, RunningAvgSamplesPerSec=177.087504393735, CurrSamplesPerSec=177.1030558777385, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:51:16,394] [INFO] [logging.py:96:log_dist] [Rank 0] step=10320, skipped=197, lr=[2.141805512981618e-06, 2.141805512981618e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:51:16,413] [INFO] [timer.py:199:stop] epoch=11/micro_step=200/global_step=10320, RunningAvgSamplesPerSec=177.08725869758183, CurrSamplesPerSec=177.16629189901272, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:51:18,917] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:51:19,251] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:51:19,957] [INFO] [logging.py:96:log_dist] [Rank 0] step=10330, skipped=199, lr=[2.1349625929149804e-06, 2.1349625929149804e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:51:19,975] [INFO] [timer.py:199:stop] epoch=11/micro_step=210/global_step=10330, RunningAvgSamplesPerSec=177.08991225897987, CurrSamplesPerSec=177.07221402063107, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:51:23,576] [INFO] [logging.py:96:log_dist] [Rank 0] step=10340, skipped=199, lr=[2.126419972864798e-06, 2.126419972864798e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:51:23,595] [INFO] [timer.py:199:stop] epoch=11/micro_step=220/global_step=10340, RunningAvgSamplesPerSec=177.08985039761595, CurrSamplesPerSec=177.0443020567799, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:51:27,195] [INFO] [logging.py:96:log_dist] [Rank 0] step=10350, skipped=199, lr=[2.11788964472152e-06, 2.11788964472152e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:51:27,214] [INFO] [timer.py:199:stop] epoch=11/micro_step=230/global_step=10350, RunningAvgSamplesPerSec=177.08979257345834, CurrSamplesPerSec=177.05656355121693, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:51:30,816] [INFO] [logging.py:96:log_dist] [Rank 0] step=10360, skipped=199, lr=[2.109371647340391e-06, 2.109371647340391e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:51:30,834] [INFO] [timer.py:199:stop] epoch=11/micro_step=240/global_step=10360, RunningAvgSamplesPerSec=177.08969132402083, CurrSamplesPerSec=176.9942227327094, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:51:34,435] [INFO] [logging.py:96:log_dist] [Rank 0] step=10370, skipped=199, lr=[2.100866019520495e-06, 2.100866019520495e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:51:34,454] [INFO] [timer.py:199:stop] epoch=11/micro_step=250/global_step=10370, RunningAvgSamplesPerSec=177.08962309402986, CurrSamplesPerSec=176.99270561969925, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:51:38,064] [INFO] [logging.py:96:log_dist] [Rank 0] step=10380, skipped=199, lr=[2.0923728000045745e-06, 2.0923728000045745e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:51:38,083] [INFO] [timer.py:199:stop] epoch=11/micro_step=260/global_step=10380, RunningAvgSamplesPerSec=177.0892946495206, CurrSamplesPerSec=177.0480387132603, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:51:41,692] [INFO] [logging.py:96:log_dist] [Rank 0] step=10390, skipped=199, lr=[2.0838920274788473e-06, 2.0838920274788473e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:51:41,710] [INFO] [timer.py:199:stop] epoch=11/micro_step=270/global_step=10390, RunningAvgSamplesPerSec=177.08885875007707, CurrSamplesPerSec=177.10258849671374, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:51:45,311] [INFO] [logging.py:96:log_dist] [Rank 0] step=10400, skipped=199, lr=[2.07542374057284e-06, 2.07542374057284e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:51:45,330] [INFO] [timer.py:199:stop] epoch=11/micro_step=280/global_step=10400, RunningAvgSamplesPerSec=177.08879377661816, CurrSamplesPerSec=177.09078794331995, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:51:48,933] [INFO] [logging.py:96:log_dist] [Rank 0] step=10410, skipped=199, lr=[2.066967977859208e-06, 2.066967977859208e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:51:48,951] [INFO] [timer.py:199:stop] epoch=11/micro_step=290/global_step=10410, RunningAvgSamplesPerSec=177.08866107478403, CurrSamplesPerSec=177.1152086503149, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:51:52,553] [INFO] [logging.py:96:log_dist] [Rank 0] step=10420, skipped=199, lr=[2.058524777853557e-06, 2.058524777853557e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:51:52,571] [INFO] [timer.py:199:stop] epoch=11/micro_step=300/global_step=10420, RunningAvgSamplesPerSec=177.08858451799176, CurrSamplesPerSec=177.20395104694293, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:51:55,800] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:51:56,134] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:51:56,134] [INFO] [logging.py:96:log_dist] [Rank 0] step=10430, skipped=201, lr=[2.051779288844746e-06, 2.051779288844746e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:51:56,134] [INFO] [timer.py:199:stop] epoch=11/micro_step=310/global_step=10430, RunningAvgSamplesPerSec=177.09116951059136, CurrSamplesPerSec=192.23475322867327, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:51:59,736] [INFO] [logging.py:96:log_dist] [Rank 0] step=10440, skipped=201, lr=[2.043358798590027e-06, 2.043358798590027e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:51:59,755] [INFO] [timer.py:199:stop] epoch=11/micro_step=320/global_step=10440, RunningAvgSamplesPerSec=177.09107691685756, CurrSamplesPerSec=177.0812084525978, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:52:03,357] [INFO] [logging.py:96:log_dist] [Rank 0] step=10450, skipped=201, lr=[2.0349509785820076e-06, 2.0349509785820076e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:52:03,375] [INFO] [timer.py:199:stop] epoch=11/micro_step=330/global_step=10450, RunningAvgSamplesPerSec=177.09097385519462, CurrSamplesPerSec=176.55975995253775, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:52:06,975] [INFO] [logging.py:96:log_dist] [Rank 0] step=10460, skipped=201, lr=[2.026555867117914e-06, 2.026555867117914e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:52:06,993] [INFO] [timer.py:199:stop] epoch=11/micro_step=340/global_step=10460, RunningAvgSamplesPerSec=177.09100160571873, CurrSamplesPerSec=177.03285952741635, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:52:10,615] [INFO] [logging.py:96:log_dist] [Rank 0] step=10470, skipped=201, lr=[2.0181735024370968e-06, 2.0181735024370968e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:52:10,634] [INFO] [timer.py:199:stop] epoch=11/micro_step=350/global_step=10470, RunningAvgSamplesPerSec=177.0900390353573, CurrSamplesPerSec=176.95758588956247, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:52:14,259] [INFO] [logging.py:96:log_dist] [Rank 0] step=10480, skipped=201, lr=[2.0098039227208325e-06, 2.0098039227208325e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:52:14,277] [INFO] [timer.py:199:stop] epoch=11/micro_step=360/global_step=10480, RunningAvgSamplesPerSec=177.08902773747755, CurrSamplesPerSec=177.0325092692372, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:52:17,879] [INFO] [logging.py:96:log_dist] [Rank 0] step=10490, skipped=201, lr=[2.0014471660921705e-06, 2.0014471660921705e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:52:17,898] [INFO] [timer.py:199:stop] epoch=11/micro_step=370/global_step=10490, RunningAvgSamplesPerSec=177.08891931680338, CurrSamplesPerSec=177.02200216829905, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:52:21,677] [INFO] [logging.py:96:log_dist] [Rank 0] step=10500, skipped=201, lr=[1.9931032706157525e-06, 1.9931032706157525e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:52:21,696] [INFO] [timer.py:199:stop] epoch=11/micro_step=380/global_step=10500, RunningAvgSamplesPerSec=177.08058813986432, CurrSamplesPerSec=176.6465515369313, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:52:25,299] [INFO] [logging.py:96:log_dist] [Rank 0] step=10510, skipped=201, lr=[1.984772274297629e-06, 1.984772274297629e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:52:25,318] [INFO] [timer.py:199:stop] epoch=11/micro_step=390/global_step=10510, RunningAvgSamplesPerSec=177.08040117791123, CurrSamplesPerSec=176.89309990570024, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:52:28,931] [INFO] [logging.py:96:log_dist] [Rank 0] step=10520, skipped=201, lr=[1.976454215085109e-06, 1.976454215085109e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:52:28,950] [INFO] [timer.py:199:stop] epoch=11/micro_step=400/global_step=10520, RunningAvgSamplesPerSec=177.07978545001902, CurrSamplesPerSec=176.79488706116663, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:52:32,552] [INFO] [logging.py:96:log_dist] [Rank 0] step=10530, skipped=201, lr=[1.9681491308665617e-06, 1.9681491308665617e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:52:32,571] [INFO] [timer.py:199:stop] epoch=11/micro_step=410/global_step=10530, RunningAvgSamplesPerSec=177.0796531548019, CurrSamplesPerSec=177.02025110606556, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:52:32,904] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:52:33,237] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:52:36,117] [INFO] [logging.py:96:log_dist] [Rank 0] step=10540, skipped=203, lr=[1.961514430910647e-06, 1.961514430910647e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:52:36,136] [INFO] [timer.py:199:stop] epoch=11/micro_step=420/global_step=10540, RunningAvgSamplesPerSec=177.08214382284055, CurrSamplesPerSec=177.1266618277796, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:52:39,740] [INFO] [logging.py:96:log_dist] [Rank 0] step=10550, skipped=203, lr=[1.9532327969711997e-06, 1.9532327969711997e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:52:39,759] [INFO] [timer.py:199:stop] epoch=11/micro_step=430/global_step=10550, RunningAvgSamplesPerSec=177.08194921424484, CurrSamplesPerSec=176.48987749906638, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:52:43,457] [INFO] [logging.py:96:log_dist] [Rank 0] step=10560, skipped=203, lr=[1.944964243798208e-06, 1.944964243798208e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:52:43,476] [INFO] [timer.py:199:stop] epoch=11/micro_step=440/global_step=10560, RunningAvgSamplesPerSec=177.0776720674165, CurrSamplesPerSec=176.58020115906564, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:52:47,081] [INFO] [logging.py:96:log_dist] [Rank 0] step=10570, skipped=203, lr=[1.93670880905455e-06, 1.93670880905455e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:52:47,099] [INFO] [timer.py:199:stop] epoch=11/micro_step=450/global_step=10570, RunningAvgSamplesPerSec=177.07743201866066, CurrSamplesPerSec=176.83611617994848, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:52:50,703] [INFO] [logging.py:96:log_dist] [Rank 0] step=10580, skipped=203, lr=[1.9284665303433496e-06, 1.9284665303433496e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:52:50,721] [INFO] [timer.py:199:stop] epoch=11/micro_step=460/global_step=10580, RunningAvgSamplesPerSec=177.0772798618765, CurrSamplesPerSec=176.75972939052286, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:52:54,324] [INFO] [logging.py:96:log_dist] [Rank 0] step=10590, skipped=203, lr=[1.920237445207801e-06, 1.920237445207801e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:52:54,343] [INFO] [timer.py:199:stop] epoch=11/micro_step=470/global_step=10590, RunningAvgSamplesPerSec=177.07715395207515, CurrSamplesPerSec=177.0598335699096, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:52:57,949] [INFO] [logging.py:96:log_dist] [Rank 0] step=10600, skipped=203, lr=[1.9120215911310065e-06, 1.9120215911310065e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:52:57,967] [INFO] [timer.py:199:stop] epoch=11/micro_step=480/global_step=10600, RunningAvgSamplesPerSec=177.07687141500767, CurrSamplesPerSec=176.92877711830056, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:53:01,571] [INFO] [logging.py:96:log_dist] [Rank 0] step=10610, skipped=203, lr=[1.903819005535802e-06, 1.903819005535802e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:53:01,589] [INFO] [timer.py:199:stop] epoch=11/micro_step=490/global_step=10610, RunningAvgSamplesPerSec=177.07672167148968, CurrSamplesPerSec=176.9142013550207, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:53:05,192] [INFO] [logging.py:96:log_dist] [Rank 0] step=10620, skipped=203, lr=[1.8956297257845855e-06, 1.8956297257845855e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:53:05,211] [INFO] [timer.py:199:stop] epoch=11/micro_step=500/global_step=10620, RunningAvgSamplesPerSec=177.0765973810035, CurrSamplesPerSec=176.83996055224225, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:53:08,813] [INFO] [logging.py:96:log_dist] [Rank 0] step=10630, skipped=203, lr=[1.8874537891791408e-06, 1.8874537891791408e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:53:08,832] [INFO] [timer.py:199:stop] epoch=11/micro_step=510/global_step=10630, RunningAvgSamplesPerSec=177.0764997465317, CurrSamplesPerSec=177.01196321486177, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:53:09,889] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:53:10,223] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:53:12,380] [INFO] [logging.py:96:log_dist] [Rank 0] step=10640, skipped=205, lr=[1.8809226719877523e-06, 1.8809226719877523e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:53:12,398] [INFO] [timer.py:199:stop] epoch=11/micro_step=520/global_step=10640, RunningAvgSamplesPerSec=177.07889291333643, CurrSamplesPerSec=176.79255830617691, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:53:16,031] [INFO] [logging.py:96:log_dist] [Rank 0] step=10650, skipped=205, lr=[1.8727708468511072e-06, 1.8727708468511072e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:53:16,049] [INFO] [timer.py:199:stop] epoch=11/micro_step=530/global_step=10650, RunningAvgSamplesPerSec=177.07740566013928, CurrSamplesPerSec=176.65236392687663, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:53:19,655] [INFO] [logging.py:96:log_dist] [Rank 0] step=10660, skipped=205, lr=[1.8646324689813683e-06, 1.8646324689813683e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:53:19,673] [INFO] [timer.py:199:stop] epoch=11/micro_step=540/global_step=10660, RunningAvgSamplesPerSec=177.07713758479008, CurrSamplesPerSec=176.8643121255966, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:53:23,276] [INFO] [logging.py:96:log_dist] [Rank 0] step=10670, skipped=205, lr=[1.8565075754484762e-06, 1.8565075754484762e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:53:23,295] [INFO] [timer.py:199:stop] epoch=11/micro_step=550/global_step=10670, RunningAvgSamplesPerSec=177.07700571211723, CurrSamplesPerSec=176.88762134706778, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:53:26,903] [INFO] [logging.py:96:log_dist] [Rank 0] step=10680, skipped=205, lr=[1.8483962032609385e-06, 1.8483962032609385e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:53:26,922] [INFO] [timer.py:199:stop] epoch=11/micro_step=560/global_step=10680, RunningAvgSamplesPerSec=177.07662604680476, CurrSamplesPerSec=176.91408475859757, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:53:30,524] [INFO] [logging.py:96:log_dist] [Rank 0] step=10690, skipped=205, lr=[1.840298389365682e-06, 1.840298389365682e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:53:30,543] [INFO] [timer.py:199:stop] epoch=11/micro_step=570/global_step=10690, RunningAvgSamplesPerSec=177.0764995232887, CurrSamplesPerSec=176.8223710039075, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:53:34,146] [INFO] [logging.py:96:log_dist] [Rank 0] step=10700, skipped=205, lr=[1.8322141706478747e-06, 1.8322141706478747e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:53:34,164] [INFO] [timer.py:199:stop] epoch=11/micro_step=580/global_step=10700, RunningAvgSamplesPerSec=177.07637795345332, CurrSamplesPerSec=176.9364740902059, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:53:37,768] [INFO] [logging.py:96:log_dist] [Rank 0] step=10710, skipped=205, lr=[1.8241435839307546e-06, 1.8241435839307546e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:53:37,787] [INFO] [timer.py:199:stop] epoch=11/micro_step=590/global_step=10710, RunningAvgSamplesPerSec=177.07619085331575, CurrSamplesPerSec=177.18441775148085, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:53:41,390] [INFO] [logging.py:96:log_dist] [Rank 0] step=10720, skipped=205, lr=[1.8160866659754722e-06, 1.8160866659754722e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:53:41,408] [INFO] [timer.py:199:stop] epoch=11/micro_step=600/global_step=10720, RunningAvgSamplesPerSec=177.0760667907645, CurrSamplesPerSec=176.8462517038309, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:53:45,010] [INFO] [logging.py:96:log_dist] [Rank 0] step=10730, skipped=205, lr=[1.8080434534809147e-06, 1.8080434534809147e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:53:45,029] [INFO] [timer.py:199:stop] epoch=11/micro_step=610/global_step=10730, RunningAvgSamplesPerSec=177.0759855112002, CurrSamplesPerSec=177.022118907013, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:53:46,816] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:53:47,150] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:53:48,580] [INFO] [logging.py:96:log_dist] [Rank 0] step=10740, skipped=207, lr=[1.8016187760387939e-06, 1.8016187760387939e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:53:48,598] [INFO] [timer.py:199:stop] epoch=11/micro_step=620/global_step=10740, RunningAvgSamplesPerSec=177.07821997230425, CurrSamplesPerSec=176.84438761225073, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:53:52,203] [INFO] [logging.py:96:log_dist] [Rank 0] step=10750, skipped=207, lr=[1.7936003256553626e-06, 1.7936003256553626e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:53:52,221] [INFO] [timer.py:199:stop] epoch=11/micro_step=630/global_step=10750, RunningAvgSamplesPerSec=177.07803185076497, CurrSamplesPerSec=177.05259297661024, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:53:55,824] [INFO] [logging.py:96:log_dist] [Rank 0] step=10760, skipped=207, lr=[1.7855956831568942e-06, 1.7855956831568942e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:53:55,843] [INFO] [timer.py:199:stop] epoch=11/micro_step=640/global_step=10760, RunningAvgSamplesPerSec=177.07790499376787, CurrSamplesPerSec=176.89240049634037, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:53:59,445] [INFO] [logging.py:96:log_dist] [Rank 0] step=10770, skipped=207, lr=[1.777604885004165e-06, 1.777604885004165e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:53:59,464] [INFO] [timer.py:199:stop] epoch=11/micro_step=650/global_step=10770, RunningAvgSamplesPerSec=177.07778867825476, CurrSamplesPerSec=176.88342523379265, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:54:03,067] [INFO] [logging.py:96:log_dist] [Rank 0] step=10780, skipped=207, lr=[1.7696279675948878e-06, 1.7696279675948878e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:54:03,086] [INFO] [timer.py:199:stop] epoch=11/micro_step=660/global_step=10780, RunningAvgSamplesPerSec=177.0776196133932, CurrSamplesPerSec=177.10410749406378, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:54:06,686] [INFO] [logging.py:96:log_dist] [Rank 0] step=10790, skipped=207, lr=[1.7616649672635525e-06, 1.7616649672635525e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:54:06,705] [INFO] [timer.py:199:stop] epoch=11/micro_step=670/global_step=10790, RunningAvgSamplesPerSec=177.07760797578516, CurrSamplesPerSec=177.05282653489274, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:54:10,308] [INFO] [logging.py:96:log_dist] [Rank 0] step=10800, skipped=207, lr=[1.753715920281256e-06, 1.753715920281256e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:54:10,327] [INFO] [timer.py:199:stop] epoch=11/micro_step=680/global_step=10800, RunningAvgSamplesPerSec=177.07746642571138, CurrSamplesPerSec=177.1219868825633, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:54:13,929] [INFO] [logging.py:96:log_dist] [Rank 0] step=10810, skipped=207, lr=[1.7457808628555402e-06, 1.7457808628555402e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:54:13,948] [INFO] [timer.py:199:stop] epoch=11/micro_step=690/global_step=10810, RunningAvgSamplesPerSec=177.07734190082934, CurrSamplesPerSec=176.7596129976222, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:54:17,577] [INFO] [logging.py:96:log_dist] [Rank 0] step=10820, skipped=207, lr=[1.7378598311302241e-06, 1.7378598311302241e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:54:17,595] [INFO] [timer.py:199:stop] epoch=11/micro_step=700/global_step=10820, RunningAvgSamplesPerSec=177.07604726418467, CurrSamplesPerSec=175.7372302914208, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:54:21,217] [INFO] [logging.py:96:log_dist] [Rank 0] step=10830, skipped=207, lr=[1.7299528611852372e-06, 1.7299528611852372e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:54:21,236] [INFO] [timer.py:199:stop] epoch=11/micro_step=710/global_step=10830, RunningAvgSamplesPerSec=177.07505421094342, CurrSamplesPerSec=176.52643699778977, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:54:23,741] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:54:24,075] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:54:24,780] [INFO] [logging.py:96:log_dist] [Rank 0] step=10840, skipped=209, lr=[1.7236374339159133e-06, 1.7236374339159133e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:54:24,799] [INFO] [timer.py:199:stop] epoch=11/micro_step=720/global_step=10840, RunningAvgSamplesPerSec=177.07757067431825, CurrSamplesPerSec=177.26842978350223, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:54:28,400] [INFO] [logging.py:96:log_dist] [Rank 0] step=10850, skipped=209, lr=[1.7157558658923977e-06, 1.7157558658923977e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:54:28,419] [INFO] [timer.py:199:stop] epoch=11/micro_step=730/global_step=10850, RunningAvgSamplesPerSec=177.07747727666896, CurrSamplesPerSec=176.92784419751345, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:54:32,020] [INFO] [logging.py:96:log_dist] [Rank 0] step=10860, skipped=209, lr=[1.7078884603317481e-06, 1.7078884603317481e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:54:32,038] [INFO] [timer.py:199:stop] epoch=11/micro_step=740/global_step=10860, RunningAvgSamplesPerSec=177.07744808352004, CurrSamplesPerSec=177.00332595697094, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:54:35,642] [INFO] [logging.py:96:log_dist] [Rank 0] step=10870, skipped=209, lr=[1.7000352530696334e-06, 1.7000352530696334e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:54:35,660] [INFO] [timer.py:199:stop] epoch=11/micro_step=750/global_step=10870, RunningAvgSamplesPerSec=177.0773160191353, CurrSamplesPerSec=176.5955351703685, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:54:39,262] [INFO] [logging.py:96:log_dist] [Rank 0] step=10880, skipped=209, lr=[1.6921962798770486e-06, 1.6921962798770486e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:54:39,280] [INFO] [timer.py:199:stop] epoch=11/micro_step=760/global_step=10880, RunningAvgSamplesPerSec=177.0772367511035, CurrSamplesPerSec=177.15459976862073, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:54:42,884] [INFO] [logging.py:96:log_dist] [Rank 0] step=10890, skipped=209, lr=[1.6843715764601531e-06, 1.6843715764601531e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:54:42,902] [INFO] [timer.py:199:stop] epoch=11/micro_step=770/global_step=10890, RunningAvgSamplesPerSec=177.07711488660317, CurrSamplesPerSec=176.78021692888237, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:54:46,504] [INFO] [logging.py:96:log_dist] [Rank 0] step=10900, skipped=209, lr=[1.6765611784601104e-06, 1.6765611784601104e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:54:46,523] [INFO] [timer.py:199:stop] epoch=11/micro_step=780/global_step=10900, RunningAvgSamplesPerSec=177.07704711755764, CurrSamplesPerSec=176.98908799362817, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:54:50,147] [INFO] [logging.py:96:log_dist] [Rank 0] step=10910, skipped=209, lr=[1.6687651214529172e-06, 1.6687651214529172e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:54:50,159] [INFO] [timer.py:199:stop] epoch=11/micro_step=790/global_step=10910, RunningAvgSamplesPerSec=177.07626192452605, CurrSamplesPerSec=169.2987052630185, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:54:53,761] [INFO] [logging.py:96:log_dist] [Rank 0] step=10920, skipped=209, lr=[1.6609834409492537e-06, 1.6609834409492537e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:54:53,779] [INFO] [timer.py:199:stop] epoch=11/micro_step=800/global_step=10920, RunningAvgSamplesPerSec=177.07618425566835, CurrSamplesPerSec=176.96575202488268, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:54:57,381] [INFO] [logging.py:96:log_dist] [Rank 0] step=10930, skipped=209, lr=[1.6532161723943139e-06, 1.6532161723943139e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:54:57,400] [INFO] [timer.py:199:stop] epoch=11/micro_step=810/global_step=10930, RunningAvgSamplesPerSec=177.07607817856032, CurrSamplesPerSec=176.91513413193857, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:55:00,629] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:55:00,962] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:55:00,962] [INFO] [logging.py:96:log_dist] [Rank 0] step=10940, skipped=211, lr=[1.6470127579307093e-06, 1.6470127579307093e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:55:00,963] [INFO] [timer.py:199:stop] epoch=11/micro_step=820/global_step=10940, RunningAvgSamplesPerSec=177.07858939555754, CurrSamplesPerSec=192.26036235288726, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:55:04,566] [INFO] [logging.py:96:log_dist] [Rank 0] step=10950, skipped=211, lr=[1.6392715199957165e-06, 1.6392715199957165e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:55:04,584] [INFO] [timer.py:199:stop] epoch=11/micro_step=830/global_step=10950, RunningAvgSamplesPerSec=177.07845938598527, CurrSamplesPerSec=176.8718869487238, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:55:08,188] [INFO] [logging.py:96:log_dist] [Rank 0] step=10960, skipped=211, lr=[1.6315447929062404e-06, 1.6315447929062404e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:55:08,207] [INFO] [timer.py:199:stop] epoch=11/micro_step=840/global_step=10960, RunningAvgSamplesPerSec=177.07828417006527, CurrSamplesPerSec=176.72947239658018, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:55:11,809] [INFO] [logging.py:96:log_dist] [Rank 0] step=10970, skipped=211, lr=[1.623832611857166e-06, 1.623832611857166e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:55:11,828] [INFO] [timer.py:199:stop] epoch=11/micro_step=850/global_step=10970, RunningAvgSamplesPerSec=177.0781674034305, CurrSamplesPerSec=177.02387000619896, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:55:15,431] [INFO] [logging.py:96:log_dist] [Rank 0] step=10980, skipped=211, lr=[1.6161350119771176e-06, 1.6161350119771176e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:55:15,449] [INFO] [timer.py:199:stop] epoch=11/micro_step=860/global_step=10980, RunningAvgSamplesPerSec=177.0780339417565, CurrSamplesPerSec=176.91501753428594, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:55:19,051] [INFO] [logging.py:96:log_dist] [Rank 0] step=10990, skipped=211, lr=[1.608452028328307e-06, 1.608452028328307e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:55:19,070] [INFO] [timer.py:199:stop] epoch=11/micro_step=870/global_step=10990, RunningAvgSamplesPerSec=177.07794376774683, CurrSamplesPerSec=177.15483359610468, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:55:22,698] [INFO] [logging.py:96:log_dist] [Rank 0] step=11000, skipped=211, lr=[1.6007836959063693e-06, 1.6007836959063693e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:55:22,717] [INFO] [timer.py:199:stop] epoch=11/micro_step=880/global_step=11000, RunningAvgSamplesPerSec=177.07686050822826, CurrSamplesPerSec=176.87841347050363, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:55:26,336] [INFO] [logging.py:96:log_dist] [Rank 0] step=11010, skipped=211, lr=[1.5931300496402022e-06, 1.5931300496402022e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:55:26,355] [INFO] [timer.py:199:stop] epoch=11/micro_step=890/global_step=11010, RunningAvgSamplesPerSec=177.07599771771478, CurrSamplesPerSec=177.00752776092634, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:55:29,956] [INFO] [logging.py:96:log_dist] [Rank 0] step=11020, skipped=211, lr=[1.5854911243918115e-06, 1.5854911243918115e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:55:29,975] [INFO] [timer.py:199:stop] epoch=11/micro_step=900/global_step=11020, RunningAvgSamplesPerSec=177.07593038693383, CurrSamplesPerSec=176.98465368599, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:55:33,578] [INFO] [logging.py:96:log_dist] [Rank 0] step=11030, skipped=211, lr=[1.5778669549561445e-06, 1.5778669549561445e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:55:33,596] [INFO] [timer.py:199:stop] epoch=11/micro_step=910/global_step=11030, RunningAvgSamplesPerSec=177.07580019770467, CurrSamplesPerSec=176.32444667776758, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:55:37,196] [INFO] [logging.py:96:log_dist] [Rank 0] step=11040, skipped=211, lr=[1.5702575760609407e-06, 1.5702575760609407e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:55:37,215] [INFO] [timer.py:199:stop] epoch=11/micro_step=920/global_step=11040, RunningAvgSamplesPerSec=177.07579371045762, CurrSamplesPerSec=176.95641936032507, MemAllocated=4.98GB, MaxMemAllocated=24.08GB ***** Evaluating perplexity, Epoch 12/16 ***** ppl: 1.7724738121032715 Beginning of Epoch 13/16, Total Micro Batches 920 [2023-04-21 22:55:45,711] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:55:46,044] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:55:48,925] [INFO] [logging.py:96:log_dist] [Rank 0] step=11050, skipped=213, lr=[1.5641807454279474e-06, 1.5641807454279474e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:55:48,943] [INFO] [timer.py:199:stop] epoch=12/micro_step=10/global_step=11050, RunningAvgSamplesPerSec=177.0768826805045, CurrSamplesPerSec=176.79022961253543, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:55:52,554] [INFO] [logging.py:96:log_dist] [Rank 0] step=11060, skipped=213, lr=[1.5565980768043318e-06, 1.5565980768043318e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:55:52,573] [INFO] [timer.py:199:stop] epoch=12/micro_step=20/global_step=11060, RunningAvgSamplesPerSec=177.07638995136736, CurrSamplesPerSec=177.02200216829905, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:55:56,175] [INFO] [logging.py:96:log_dist] [Rank 0] step=11070, skipped=213, lr=[1.5490302955999337e-06, 1.5490302955999337e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:55:56,194] [INFO] [timer.py:199:stop] epoch=12/micro_step=30/global_step=11070, RunningAvgSamplesPerSec=177.07630987376112, CurrSamplesPerSec=176.72749441545093, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:55:59,804] [INFO] [logging.py:96:log_dist] [Rank 0] step=11080, skipped=213, lr=[1.5414774362856452e-06, 1.5414774362856452e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:55:59,823] [INFO] [timer.py:199:stop] epoch=12/micro_step=40/global_step=11080, RunningAvgSamplesPerSec=177.07599893575497, CurrSamplesPerSec=177.1589256769956, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:56:03,426] [INFO] [logging.py:96:log_dist] [Rank 0] step=11090, skipped=213, lr=[1.5339395332643926e-06, 1.5339395332643926e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:56:03,444] [INFO] [timer.py:199:stop] epoch=12/micro_step=50/global_step=11090, RunningAvgSamplesPerSec=177.07587145390758, CurrSamplesPerSec=177.1284149958594, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:56:07,050] [INFO] [logging.py:96:log_dist] [Rank 0] step=11100, skipped=213, lr=[1.5264166208709704e-06, 1.5264166208709704e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:56:07,068] [INFO] [timer.py:199:stop] epoch=12/micro_step=60/global_step=11100, RunningAvgSamplesPerSec=177.07565152696236, CurrSamplesPerSec=176.86303029664128, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:56:10,672] [INFO] [logging.py:96:log_dist] [Rank 0] step=11110, skipped=213, lr=[1.5189087333718979e-06, 1.5189087333718979e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:56:10,691] [INFO] [timer.py:199:stop] epoch=12/micro_step=70/global_step=11110, RunningAvgSamplesPerSec=177.07546568905863, CurrSamplesPerSec=176.9519866894837, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:56:14,294] [INFO] [logging.py:96:log_dist] [Rank 0] step=11120, skipped=213, lr=[1.5114159049652562e-06, 1.5114159049652562e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:56:14,313] [INFO] [timer.py:199:stop] epoch=12/micro_step=80/global_step=11120, RunningAvgSamplesPerSec=177.07532114987367, CurrSamplesPerSec=176.4171517388004, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:56:17,917] [INFO] [logging.py:96:log_dist] [Rank 0] step=11130, skipped=213, lr=[1.5039381697805262e-06, 1.5039381697805262e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:56:17,936] [INFO] [timer.py:199:stop] epoch=12/micro_step=90/global_step=11130, RunningAvgSamplesPerSec=177.07513132494412, CurrSamplesPerSec=176.96481871412882, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:56:21,538] [INFO] [logging.py:96:log_dist] [Rank 0] step=11140, skipped=213, lr=[1.4964755618784517e-06, 1.4964755618784517e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:56:21,557] [INFO] [timer.py:199:stop] epoch=12/micro_step=100/global_step=11140, RunningAvgSamplesPerSec=177.07502646401326, CurrSamplesPerSec=177.01546504604167, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:56:22,613] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:56:22,946] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:56:25,102] [INFO] [logging.py:96:log_dist] [Rank 0] step=11150, skipped=215, lr=[1.4905163900451315e-06, 1.4905163900451315e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:56:25,121] [INFO] [timer.py:199:stop] epoch=12/micro_step=110/global_step=11150, RunningAvgSamplesPerSec=177.07743315610426, CurrSamplesPerSec=177.03729625098515, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:56:28,731] [INFO] [logging.py:96:log_dist] [Rank 0] step=11160, skipped=215, lr=[1.4830810968648446e-06, 1.4830810968648446e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:56:28,750] [INFO] [timer.py:199:stop] epoch=12/micro_step=120/global_step=11160, RunningAvgSamplesPerSec=177.07730806552001, CurrSamplesPerSec=176.83471826782838, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:56:32,374] [INFO] [logging.py:96:log_dist] [Rank 0] step=11170, skipped=215, lr=[1.475661025970213e-06, 1.475661025970213e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:56:32,393] [INFO] [timer.py:199:stop] epoch=12/micro_step=130/global_step=11170, RunningAvgSamplesPerSec=177.07640311392865, CurrSamplesPerSec=177.0368292169793, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:56:35,995] [INFO] [logging.py:96:log_dist] [Rank 0] step=11180, skipped=215, lr=[1.4682562111593107e-06, 1.4682562111593107e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:56:36,013] [INFO] [timer.py:199:stop] epoch=12/micro_step=140/global_step=11180, RunningAvgSamplesPerSec=177.07629626631785, CurrSamplesPerSec=176.89496502436597, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:56:39,614] [INFO] [logging.py:96:log_dist] [Rank 0] step=11190, skipped=215, lr=[1.4608666861607276e-06, 1.4608666861607276e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:56:39,632] [INFO] [timer.py:199:stop] epoch=12/micro_step=150/global_step=11190, RunningAvgSamplesPerSec=177.07627334398146, CurrSamplesPerSec=176.98488706506302, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:56:43,235] [INFO] [logging.py:96:log_dist] [Rank 0] step=11200, skipped=215, lr=[1.4534924846334072e-06, 1.4534924846334072e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:56:43,253] [INFO] [timer.py:199:stop] epoch=12/micro_step=160/global_step=11200, RunningAvgSamplesPerSec=177.07615578775795, CurrSamplesPerSec=177.00706088396888, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:56:46,854] [INFO] [logging.py:96:log_dist] [Rank 0] step=11210, skipped=215, lr=[1.446133640166498e-06, 1.446133640166498e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:56:46,873] [INFO] [timer.py:199:stop] epoch=12/micro_step=170/global_step=11210, RunningAvgSamplesPerSec=177.07610203660585, CurrSamplesPerSec=176.86384600382672, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:56:50,474] [INFO] [logging.py:96:log_dist] [Rank 0] step=11220, skipped=215, lr=[1.4387901862791912e-06, 1.4387901862791912e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:56:50,492] [INFO] [timer.py:199:stop] epoch=12/micro_step=180/global_step=11220, RunningAvgSamplesPerSec=177.07604432035444, CurrSamplesPerSec=177.10843093683528, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:56:54,094] [INFO] [logging.py:96:log_dist] [Rank 0] step=11230, skipped=215, lr=[1.431462156420581e-06, 1.431462156420581e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:56:54,113] [INFO] [timer.py:199:stop] epoch=12/micro_step=190/global_step=11230, RunningAvgSamplesPerSec=177.075964662057, CurrSamplesPerSec=176.98990486400808, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:56:57,714] [INFO] [logging.py:96:log_dist] [Rank 0] step=11240, skipped=215, lr=[1.4241495839695046e-06, 1.4241495839695046e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:56:57,732] [INFO] [timer.py:199:stop] epoch=12/micro_step=200/global_step=11240, RunningAvgSamplesPerSec=177.0759189578479, CurrSamplesPerSec=176.8293598094395, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:56:59,513] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:56:59,846] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:57:01,278] [INFO] [logging.py:96:log_dist] [Rank 0] step=11250, skipped=217, lr=[1.4183106777276984e-06, 1.4183106777276984e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:57:01,297] [INFO] [timer.py:199:stop] epoch=12/micro_step=210/global_step=11250, RunningAvgSamplesPerSec=177.07828355435234, CurrSamplesPerSec=176.83658215556665, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:57:04,919] [INFO] [logging.py:96:log_dist] [Rank 0] step=11260, skipped=217, lr=[1.4110260125000082e-06, 1.4110260125000082e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:57:04,937] [INFO] [timer.py:199:stop] epoch=12/micro_step=220/global_step=11260, RunningAvgSamplesPerSec=177.07733972198136, CurrSamplesPerSec=176.8878544703093, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:57:08,544] [INFO] [logging.py:96:log_dist] [Rank 0] step=11270, skipped=217, lr=[1.4037568977655397e-06, 1.4037568977655397e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:57:08,562] [INFO] [timer.py:199:stop] epoch=12/micro_step=230/global_step=11270, RunningAvgSamplesPerSec=177.07705059797146, CurrSamplesPerSec=176.67026849149178, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:57:12,166] [INFO] [logging.py:96:log_dist] [Rank 0] step=11280, skipped=217, lr=[1.39650336663477e-06, 1.39650336663477e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:57:12,184] [INFO] [timer.py:199:stop] epoch=12/micro_step=240/global_step=11280, RunningAvgSamplesPerSec=177.07690201992145, CurrSamplesPerSec=176.7448323450362, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:57:15,788] [INFO] [logging.py:96:log_dist] [Rank 0] step=11290, skipped=217, lr=[1.3892654521472019e-06, 1.3892654521472019e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:57:15,807] [INFO] [timer.py:199:stop] epoch=12/micro_step=250/global_step=11290, RunningAvgSamplesPerSec=177.0767179522298, CurrSamplesPerSec=176.57311588678448, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:57:19,410] [INFO] [logging.py:96:log_dist] [Rank 0] step=11300, skipped=217, lr=[1.3820431872711964e-06, 1.3820431872711964e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:57:19,429] [INFO] [timer.py:199:stop] epoch=12/micro_step=260/global_step=11300, RunningAvgSamplesPerSec=177.07657609425686, CurrSamplesPerSec=177.07092917636612, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:57:23,032] [INFO] [logging.py:96:log_dist] [Rank 0] step=11310, skipped=217, lr=[1.3748366049038366e-06, 1.3748366049038366e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:57:23,050] [INFO] [timer.py:199:stop] epoch=12/micro_step=270/global_step=11310, RunningAvgSamplesPerSec=177.07645877299282, CurrSamplesPerSec=176.73424300382194, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:57:26,651] [INFO] [logging.py:96:log_dist] [Rank 0] step=11320, skipped=217, lr=[1.3676457378707728e-06, 1.3676457378707728e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:57:26,670] [INFO] [timer.py:199:stop] epoch=12/micro_step=280/global_step=11320, RunningAvgSamplesPerSec=177.0764175787161, CurrSamplesPerSec=177.12455807189357, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:57:30,273] [INFO] [logging.py:96:log_dist] [Rank 0] step=11330, skipped=217, lr=[1.360470618926066e-06, 1.360470618926066e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:57:30,291] [INFO] [timer.py:199:stop] epoch=12/micro_step=290/global_step=11330, RunningAvgSamplesPerSec=177.07627913589596, CurrSamplesPerSec=176.7954692595, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:57:33,895] [INFO] [logging.py:96:log_dist] [Rank 0] step=11340, skipped=217, lr=[1.3533112807520563e-06, 1.3533112807520563e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:57:33,913] [INFO] [timer.py:199:stop] epoch=12/micro_step=300/global_step=11340, RunningAvgSamplesPerSec=177.0761136461119, CurrSamplesPerSec=177.01826661074577, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:57:36,425] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:57:36,759] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:57:37,466] [INFO] [logging.py:96:log_dist] [Rank 0] step=11350, skipped=219, lr=[1.3475951942843915e-06, 1.3475951942843915e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:57:37,484] [INFO] [timer.py:199:stop] epoch=12/micro_step=310/global_step=11350, RunningAvgSamplesPerSec=177.07823707704642, CurrSamplesPerSec=176.92982666594602, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:57:41,087] [INFO] [logging.py:96:log_dist] [Rank 0] step=11360, skipped=219, lr=[1.3404643436275725e-06, 1.3404643436275725e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:57:41,105] [INFO] [timer.py:199:stop] epoch=12/micro_step=320/global_step=11360, RunningAvgSamplesPerSec=177.07812730275126, CurrSamplesPerSec=176.99142192900442, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:57:44,708] [INFO] [logging.py:96:log_dist] [Rank 0] step=11370, skipped=219, lr=[1.3333493648691072e-06, 1.3333493648691072e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:57:44,727] [INFO] [timer.py:199:stop] epoch=12/micro_step=330/global_step=11370, RunningAvgSamplesPerSec=177.07799626636066, CurrSamplesPerSec=177.1501571637017, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:57:48,330] [INFO] [logging.py:96:log_dist] [Rank 0] step=11380, skipped=219, lr=[1.3262502904173892e-06, 1.3262502904173892e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:57:48,348] [INFO] [timer.py:199:stop] epoch=12/micro_step=340/global_step=11380, RunningAvgSamplesPerSec=177.07786804765644, CurrSamplesPerSec=176.80594948522938, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:57:51,952] [INFO] [logging.py:96:log_dist] [Rank 0] step=11390, skipped=219, lr=[1.3191671526083756e-06, 1.3191671526083756e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:57:51,971] [INFO] [timer.py:199:stop] epoch=12/micro_step=350/global_step=11390, RunningAvgSamplesPerSec=177.0777077637839, CurrSamplesPerSec=176.9748523206751, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:57:55,576] [INFO] [logging.py:96:log_dist] [Rank 0] step=11400, skipped=219, lr=[1.31209998370543e-06, 1.31209998370543e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:57:55,595] [INFO] [timer.py:199:stop] epoch=12/micro_step=360/global_step=11400, RunningAvgSamplesPerSec=177.07745653998458, CurrSamplesPerSec=176.87608251457684, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:57:59,202] [INFO] [logging.py:96:log_dist] [Rank 0] step=11410, skipped=219, lr=[1.3050488158991807e-06, 1.3050488158991807e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:57:59,221] [INFO] [timer.py:199:stop] epoch=12/micro_step=370/global_step=11410, RunningAvgSamplesPerSec=177.07715154978635, CurrSamplesPerSec=176.49997337069763, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:58:02,825] [INFO] [logging.py:96:log_dist] [Rank 0] step=11420, skipped=219, lr=[1.2980136813073676e-06, 1.2980136813073676e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:58:02,844] [INFO] [timer.py:199:stop] epoch=12/micro_step=380/global_step=11420, RunningAvgSamplesPerSec=177.07695478037408, CurrSamplesPerSec=176.8673419769892, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:58:06,447] [INFO] [logging.py:96:log_dist] [Rank 0] step=11430, skipped=219, lr=[1.2909946119747033e-06, 1.2909946119747033e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:58:06,466] [INFO] [timer.py:199:stop] epoch=12/micro_step=390/global_step=11430, RunningAvgSamplesPerSec=177.07681309170727, CurrSamplesPerSec=176.9591024005643, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:58:10,071] [INFO] [logging.py:96:log_dist] [Rank 0] step=11440, skipped=219, lr=[1.2839916398727251e-06, 1.2839916398727251e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:58:10,089] [INFO] [timer.py:199:stop] epoch=12/micro_step=400/global_step=11440, RunningAvgSamplesPerSec=177.0766103963704, CurrSamplesPerSec=176.88517359013, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:58:13,319] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:58:13,653] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:58:13,653] [INFO] [logging.py:96:log_dist] [Rank 0] step=11450, skipped=221, lr=[1.2784008736353003e-06, 1.2784008736353003e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:58:13,653] [INFO] [timer.py:199:stop] epoch=12/micro_step=410/global_step=11450, RunningAvgSamplesPerSec=177.07894313685935, CurrSamplesPerSec=192.0640625055898, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:58:17,258] [INFO] [logging.py:96:log_dist] [Rank 0] step=11460, skipped=221, lr=[1.271426956882685e-06, 1.271426956882685e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:58:17,277] [INFO] [timer.py:199:stop] epoch=12/micro_step=420/global_step=11460, RunningAvgSamplesPerSec=177.07873582198644, CurrSamplesPerSec=176.82411815349917, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:58:20,880] [INFO] [logging.py:96:log_dist] [Rank 0] step=11470, skipped=221, lr=[1.264469226490518e-06, 1.264469226490518e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:58:20,898] [INFO] [timer.py:199:stop] epoch=12/micro_step=430/global_step=11470, RunningAvgSamplesPerSec=177.0786148391471, CurrSamplesPerSec=176.91070352917916, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:58:24,502] [INFO] [logging.py:96:log_dist] [Rank 0] step=11480, skipped=221, lr=[1.2575277141509332e-06, 1.2575277141509332e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:58:24,520] [INFO] [timer.py:199:stop] epoch=12/micro_step=440/global_step=11480, RunningAvgSamplesPerSec=177.0784697846397, CurrSamplesPerSec=176.78568883637826, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:58:28,124] [INFO] [logging.py:96:log_dist] [Rank 0] step=11490, skipped=221, lr=[1.2506024514822038e-06, 1.2506024514822038e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:58:28,142] [INFO] [timer.py:199:stop] epoch=12/micro_step=450/global_step=11490, RunningAvgSamplesPerSec=177.07831980725908, CurrSamplesPerSec=176.75530656805498, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:58:31,746] [INFO] [logging.py:96:log_dist] [Rank 0] step=11500, skipped=221, lr=[1.2436934700285756e-06, 1.2436934700285756e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:58:31,765] [INFO] [timer.py:199:stop] epoch=12/micro_step=460/global_step=11500, RunningAvgSamplesPerSec=177.0781360671984, CurrSamplesPerSec=176.83285441938105, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:58:35,368] [INFO] [logging.py:96:log_dist] [Rank 0] step=11510, skipped=221, lr=[1.2368008012601406e-06, 1.2368008012601406e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:58:35,387] [INFO] [timer.py:199:stop] epoch=12/micro_step=470/global_step=11510, RunningAvgSamplesPerSec=177.07800271948395, CurrSamplesPerSec=176.80560012435353, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:58:39,010] [INFO] [logging.py:96:log_dist] [Rank 0] step=11520, skipped=221, lr=[1.2299244765726863e-06, 1.2299244765726863e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:58:39,021] [INFO] [timer.py:199:stop] epoch=12/micro_step=480/global_step=11520, RunningAvgSamplesPerSec=177.07734006532218, CurrSamplesPerSec=172.58468391085418, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:58:42,631] [INFO] [logging.py:96:log_dist] [Rank 0] step=11530, skipped=221, lr=[1.223064527287551e-06, 1.223064527287551e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:58:42,650] [INFO] [timer.py:199:stop] epoch=12/micro_step=490/global_step=11530, RunningAvgSamplesPerSec=177.07690989756588, CurrSamplesPerSec=176.95688597017445, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:58:46,253] [INFO] [logging.py:96:log_dist] [Rank 0] step=11540, skipped=221, lr=[1.2162209846514856e-06, 1.2162209846514856e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:58:46,272] [INFO] [timer.py:199:stop] epoch=12/micro_step=500/global_step=11540, RunningAvgSamplesPerSec=177.07676728028903, CurrSamplesPerSec=176.88272590093504, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:58:49,875] [INFO] [logging.py:96:log_dist] [Rank 0] step=11550, skipped=221, lr=[1.2093938798365108e-06, 1.2093938798365108e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:58:49,894] [INFO] [timer.py:199:stop] epoch=12/micro_step=510/global_step=11550, RunningAvgSamplesPerSec=177.0766311412231, CurrSamplesPerSec=176.98430361853448, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:58:50,226] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:58:50,559] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:58:53,446] [INFO] [logging.py:96:log_dist] [Rank 0] step=11560, skipped=223, lr=[1.2039440521155179e-06, 1.2039440521155179e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:58:53,464] [INFO] [timer.py:199:stop] epoch=12/micro_step=520/global_step=11560, RunningAvgSamplesPerSec=177.07866209050417, CurrSamplesPerSec=176.94102261566553, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:58:57,068] [INFO] [logging.py:96:log_dist] [Rank 0] step=11570, skipped=223, lr=[1.1971466136929072e-06, 1.1971466136929072e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:58:57,086] [INFO] [timer.py:199:stop] epoch=12/micro_step=530/global_step=11570, RunningAvgSamplesPerSec=177.07852458630256, CurrSamplesPerSec=177.07583504569786, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:59:00,691] [INFO] [logging.py:96:log_dist] [Rank 0] step=11580, skipped=223, lr=[1.1903656999742624e-06, 1.1903656999742624e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:59:00,709] [INFO] [timer.py:199:stop] epoch=12/micro_step=540/global_step=11580, RunningAvgSamplesPerSec=177.07832795487639, CurrSamplesPerSec=176.92784419751345, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:59:04,310] [INFO] [logging.py:96:log_dist] [Rank 0] step=11590, skipped=223, lr=[1.1836013418463317e-06, 1.1836013418463317e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:59:04,328] [INFO] [timer.py:199:stop] epoch=12/micro_step=550/global_step=11590, RunningAvgSamplesPerSec=177.07831982581956, CurrSamplesPerSec=177.19342358618786, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:59:07,931] [INFO] [logging.py:96:log_dist] [Rank 0] step=11600, skipped=223, lr=[1.1768535701204485e-06, 1.1768535701204485e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:59:07,949] [INFO] [timer.py:199:stop] epoch=12/micro_step=560/global_step=11600, RunningAvgSamplesPerSec=177.07822313928014, CurrSamplesPerSec=176.9025424734088, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:59:11,551] [INFO] [logging.py:96:log_dist] [Rank 0] step=11610, skipped=223, lr=[1.1701224155324067e-06, 1.1701224155324067e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:59:11,569] [INFO] [timer.py:199:stop] epoch=12/micro_step=570/global_step=11610, RunningAvgSamplesPerSec=177.0781605019564, CurrSamplesPerSec=176.97006871517846, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:59:15,199] [INFO] [logging.py:96:log_dist] [Rank 0] step=11620, skipped=223, lr=[1.1634079087422982e-06, 1.1634079087422982e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:59:15,218] [INFO] [timer.py:199:stop] epoch=12/micro_step=580/global_step=11620, RunningAvgSamplesPerSec=177.07690699034524, CurrSamplesPerSec=177.02795603902806, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:59:18,822] [INFO] [logging.py:96:log_dist] [Rank 0] step=11630, skipped=223, lr=[1.1567100803343919e-06, 1.1567100803343919e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:59:18,840] [INFO] [timer.py:199:stop] epoch=12/micro_step=590/global_step=11630, RunningAvgSamplesPerSec=177.07672986758138, CurrSamplesPerSec=176.8203909427748, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:59:22,441] [INFO] [logging.py:96:log_dist] [Rank 0] step=11640, skipped=223, lr=[1.1500289608169872e-06, 1.1500289608169872e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:59:22,460] [INFO] [timer.py:199:stop] epoch=12/micro_step=600/global_step=11640, RunningAvgSamplesPerSec=177.07668077046424, CurrSamplesPerSec=177.06065109345576, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:59:26,062] [INFO] [logging.py:96:log_dist] [Rank 0] step=11650, skipped=223, lr=[1.14336458062227e-06, 1.14336458062227e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:59:26,081] [INFO] [timer.py:199:stop] epoch=12/micro_step=610/global_step=11650, RunningAvgSamplesPerSec=177.0765921355111, CurrSamplesPerSec=176.88272590093504, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:59:27,138] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 22:59:27,472] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 22:59:29,629] [INFO] [logging.py:96:log_dist] [Rank 0] step=11660, skipped=225, lr=[1.1380451491805912e-06, 1.1380451491805912e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:59:29,647] [INFO] [timer.py:199:stop] epoch=12/micro_step=620/global_step=11660, RunningAvgSamplesPerSec=177.07878384206725, CurrSamplesPerSec=177.1750620096813, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:59:33,250] [INFO] [logging.py:96:log_dist] [Rank 0] step=11670, skipped=225, lr=[1.131410976212367e-06, 1.131410976212367e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:59:33,268] [INFO] [timer.py:199:stop] epoch=12/micro_step=630/global_step=11670, RunningAvgSamplesPerSec=177.07867917807505, CurrSamplesPerSec=176.83798009715582, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:59:36,872] [INFO] [logging.py:96:log_dist] [Rank 0] step=11680, skipped=225, lr=[1.1247936273708874e-06, 1.1247936273708874e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:59:36,890] [INFO] [timer.py:199:stop] epoch=12/micro_step=640/global_step=11680, RunningAvgSamplesPerSec=177.07854061375826, CurrSamplesPerSec=176.95408634798923, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:59:40,494] [INFO] [logging.py:96:log_dist] [Rank 0] step=11690, skipped=225, lr=[1.1181931327978712e-06, 1.1181931327978712e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:59:40,512] [INFO] [timer.py:199:stop] epoch=12/micro_step=650/global_step=11690, RunningAvgSamplesPerSec=177.0783812888028, CurrSamplesPerSec=176.86850733208937, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:59:44,119] [INFO] [logging.py:96:log_dist] [Rank 0] step=11700, skipped=225, lr=[1.1116095225582651e-06, 1.1116095225582651e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:59:44,138] [INFO] [timer.py:199:stop] epoch=12/micro_step=660/global_step=11700, RunningAvgSamplesPerSec=177.07809120418233, CurrSamplesPerSec=176.94603792763178, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:59:47,767] [INFO] [logging.py:96:log_dist] [Rank 0] step=11710, skipped=225, lr=[1.1050428266401095e-06, 1.1050428266401095e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:59:47,785] [INFO] [timer.py:199:stop] epoch=12/micro_step=670/global_step=11710, RunningAvgSamplesPerSec=177.07688354521147, CurrSamplesPerSec=176.4282828973269, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:59:51,390] [INFO] [logging.py:96:log_dist] [Rank 0] step=11720, skipped=225, lr=[1.098493074954398e-06, 1.098493074954398e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:59:51,408] [INFO] [timer.py:199:stop] epoch=12/micro_step=680/global_step=11720, RunningAvgSamplesPerSec=177.0767047126577, CurrSamplesPerSec=176.92376278474438, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:59:55,011] [INFO] [logging.py:96:log_dist] [Rank 0] step=11730, skipped=225, lr=[1.0919602973349466e-06, 1.0919602973349466e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:59:55,029] [INFO] [timer.py:199:stop] epoch=12/micro_step=690/global_step=11730, RunningAvgSamplesPerSec=177.07659863430266, CurrSamplesPerSec=177.00916184967286, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 22:59:58,636] [INFO] [logging.py:96:log_dist] [Rank 0] step=11740, skipped=225, lr=[1.0854445235382546e-06, 1.0854445235382546e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 22:59:58,655] [INFO] [timer.py:199:stop] epoch=12/micro_step=700/global_step=11740, RunningAvgSamplesPerSec=177.07634352439584, CurrSamplesPerSec=177.03005750078972, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:00:02,258] [INFO] [logging.py:96:log_dist] [Rank 0] step=11750, skipped=225, lr=[1.0789457832433692e-06, 1.0789457832433692e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:00:02,276] [INFO] [timer.py:199:stop] epoch=12/micro_step=710/global_step=11750, RunningAvgSamplesPerSec=177.07622136949752, CurrSamplesPerSec=176.88832071863578, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:00:04,057] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:00:04,390] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:00:05,820] [INFO] [logging.py:96:log_dist] [Rank 0] step=11760, skipped=227, lr=[1.0737590750235544e-06, 1.0737590750235544e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:00:05,839] [INFO] [timer.py:199:stop] epoch=12/micro_step=720/global_step=11760, RunningAvgSamplesPerSec=177.07853948171496, CurrSamplesPerSec=177.09686327935583, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:00:09,443] [INFO] [logging.py:96:log_dist] [Rank 0] step=11770, skipped=227, lr=[1.0672910695753794e-06, 1.0672910695753794e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:00:09,461] [INFO] [timer.py:199:stop] epoch=12/micro_step=730/global_step=11770, RunningAvgSamplesPerSec=177.07838681942454, CurrSamplesPerSec=176.7979145343729, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:00:13,063] [INFO] [logging.py:96:log_dist] [Rank 0] step=11780, skipped=227, lr=[1.0608401803171498e-06, 1.0608401803171498e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:00:13,081] [INFO] [timer.py:199:stop] epoch=12/micro_step=740/global_step=11780, RunningAvgSamplesPerSec=177.07831403742225, CurrSamplesPerSec=177.3398675279683, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:00:16,684] [INFO] [logging.py:96:log_dist] [Rank 0] step=11790, skipped=227, lr=[1.0544064366323722e-06, 1.0544064366323722e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:00:16,703] [INFO] [timer.py:199:stop] epoch=12/micro_step=750/global_step=11790, RunningAvgSamplesPerSec=177.078200026759, CurrSamplesPerSec=176.6927124541377, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:00:20,321] [INFO] [logging.py:96:log_dist] [Rank 0] step=11800, skipped=227, lr=[1.0479898678264458e-06, 1.0479898678264458e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:00:20,340] [INFO] [timer.py:199:stop] epoch=12/micro_step=760/global_step=11800, RunningAvgSamplesPerSec=177.07750725725498, CurrSamplesPerSec=176.87153732722052, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:00:23,941] [INFO] [logging.py:96:log_dist] [Rank 0] step=11810, skipped=227, lr=[1.0415905031265485e-06, 1.0415905031265485e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:00:23,960] [INFO] [timer.py:199:stop] epoch=12/micro_step=770/global_step=11810, RunningAvgSamplesPerSec=177.07743869011134, CurrSamplesPerSec=176.79314048917286, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:00:27,563] [INFO] [logging.py:96:log_dist] [Rank 0] step=11820, skipped=227, lr=[1.035208371681487e-06, 1.035208371681487e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:00:27,581] [INFO] [timer.py:199:stop] epoch=12/micro_step=780/global_step=11820, RunningAvgSamplesPerSec=177.0773173683992, CurrSamplesPerSec=176.7095782504771, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:00:31,185] [INFO] [logging.py:96:log_dist] [Rank 0] step=11830, skipped=227, lr=[1.0288435025615746e-06, 1.0288435025615746e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:00:31,204] [INFO] [timer.py:199:stop] epoch=12/micro_step=790/global_step=11830, RunningAvgSamplesPerSec=177.07715927685598, CurrSamplesPerSec=176.83996055224225, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:00:34,807] [INFO] [logging.py:96:log_dist] [Rank 0] step=11840, skipped=227, lr=[1.0224959247584964e-06, 1.0224959247584964e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:00:34,825] [INFO] [timer.py:199:stop] epoch=12/micro_step=800/global_step=11840, RunningAvgSamplesPerSec=177.07704582673927, CurrSamplesPerSec=176.75775073205838, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:00:38,428] [INFO] [logging.py:96:log_dist] [Rank 0] step=11850, skipped=227, lr=[1.0161656671851728e-06, 1.0161656671851728e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:00:38,447] [INFO] [timer.py:199:stop] epoch=12/micro_step=810/global_step=11850, RunningAvgSamplesPerSec=177.07692580490675, CurrSamplesPerSec=176.96785201007344, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:00:40,953] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:00:41,286] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:00:41,993] [INFO] [logging.py:96:log_dist] [Rank 0] step=11860, skipped=229, lr=[1.0111139510710595e-06, 1.0111139510710595e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:00:42,012] [INFO] [timer.py:199:stop] epoch=12/micro_step=820/global_step=11860, RunningAvgSamplesPerSec=177.07913527038025, CurrSamplesPerSec=176.82167215373812, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:00:45,614] [INFO] [logging.py:96:log_dist] [Rank 0] step=11870, skipped=229, lr=[1.0048149425199521e-06, 1.0048149425199521e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:00:45,633] [INFO] [timer.py:199:stop] epoch=12/micro_step=830/global_step=11870, RunningAvgSamplesPerSec=177.07903334338184, CurrSamplesPerSec=176.97426894030545, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:00:49,253] [INFO] [logging.py:96:log_dist] [Rank 0] step=11880, skipped=229, lr=[9.985333347346538e-07, 9.985333347346538e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:00:49,272] [INFO] [timer.py:199:stop] epoch=12/micro_step=840/global_step=11880, RunningAvgSamplesPerSec=177.0781917348384, CurrSamplesPerSec=177.06695795742908, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:00:52,899] [INFO] [logging.py:96:log_dist] [Rank 0] step=11890, skipped=229, lr=[9.922691563275975e-07, 9.922691563275975e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:00:52,917] [INFO] [timer.py:199:stop] epoch=12/micro_step=850/global_step=11890, RunningAvgSamplesPerSec=177.0770838899648, CurrSamplesPerSec=176.96131888647756, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:00:56,521] [INFO] [logging.py:96:log_dist] [Rank 0] step=11900, skipped=229, lr=[9.860224358318255e-07, 9.860224358318255e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:00:56,539] [INFO] [timer.py:199:stop] epoch=12/micro_step=860/global_step=11900, RunningAvgSamplesPerSec=177.07692636556516, CurrSamplesPerSec=176.91968156027295, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:01:00,141] [INFO] [logging.py:96:log_dist] [Rank 0] step=11910, skipped=229, lr=[9.797932017008612e-07, 9.797932017008612e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:01:00,159] [INFO] [timer.py:199:stop] epoch=12/micro_step=870/global_step=11910, RunningAvgSamplesPerSec=177.0768737728632, CurrSamplesPerSec=176.9225967013962, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:01:03,763] [INFO] [logging.py:96:log_dist] [Rank 0] step=11920, skipped=229, lr=[9.735814823085752e-07, 9.735814823085752e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:01:03,782] [INFO] [timer.py:199:stop] epoch=12/micro_step=880/global_step=11920, RunningAvgSamplesPerSec=177.07671940442034, CurrSamplesPerSec=176.88470735839397, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:01:07,388] [INFO] [logging.py:96:log_dist] [Rank 0] step=11930, skipped=229, lr=[9.67387305949062e-07, 9.67387305949062e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:01:07,406] [INFO] [timer.py:199:stop] epoch=12/micro_step=890/global_step=11930, RunningAvgSamplesPerSec=177.07646472127803, CurrSamplesPerSec=176.75379354802175, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:01:11,008] [INFO] [logging.py:96:log_dist] [Rank 0] step=11940, skipped=229, lr=[9.612107008365076e-07, 9.612107008365076e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:01:11,027] [INFO] [timer.py:199:stop] epoch=12/micro_step=900/global_step=11940, RunningAvgSamplesPerSec=177.07640375337218, CurrSamplesPerSec=177.0008749967855, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:01:14,628] [INFO] [logging.py:96:log_dist] [Rank 0] step=11950, skipped=229, lr=[9.550516951050626e-07, 9.550516951050626e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:01:14,646] [INFO] [timer.py:199:stop] epoch=12/micro_step=910/global_step=11950, RunningAvgSamplesPerSec=177.07636752449403, CurrSamplesPerSec=177.27849784903202, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:01:17,875] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:01:18,209] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:01:18,209] [INFO] [logging.py:96:log_dist] [Rank 0] step=11960, skipped=231, lr=[9.501371809292806e-07, 9.501371809292806e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:01:18,209] [INFO] [timer.py:199:stop] epoch=12/micro_step=920/global_step=11960, RunningAvgSamplesPerSec=177.07863460443176, CurrSamplesPerSec=191.8655574623628, MemAllocated=4.98GB, MaxMemAllocated=24.08GB ***** Evaluating perplexity, Epoch 13/16 ***** ppl: 1.7818936109542847 Beginning of Epoch 14/16, Total Micro Batches 920 [2023-04-21 23:01:29,962] [INFO] [logging.py:96:log_dist] [Rank 0] step=11970, skipped=231, lr=[9.44009924725925e-07, 9.44009924725925e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:01:29,981] [INFO] [timer.py:199:stop] epoch=13/micro_step=10/global_step=11970, RunningAvgSamplesPerSec=177.07748676398447, CurrSamplesPerSec=177.2379983995225, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:01:33,582] [INFO] [logging.py:96:log_dist] [Rank 0] step=11980, skipped=231, lr=[9.379003462524131e-07, 9.379003462524131e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:01:33,601] [INFO] [timer.py:199:stop] epoch=13/micro_step=20/global_step=11980, RunningAvgSamplesPerSec=177.07743124082555, CurrSamplesPerSec=177.1412726222882, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:01:37,202] [INFO] [logging.py:96:log_dist] [Rank 0] step=11990, skipped=231, lr=[9.318084733375972e-07, 9.318084733375972e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:01:37,221] [INFO] [timer.py:199:stop] epoch=13/micro_step=30/global_step=11990, RunningAvgSamplesPerSec=177.07738828271854, CurrSamplesPerSec=177.2827127276666, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:01:40,822] [INFO] [logging.py:96:log_dist] [Rank 0] step=12000, skipped=231, lr=[9.257343337296721e-07, 9.257343337296721e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:01:40,841] [INFO] [timer.py:199:stop] epoch=13/micro_step=40/global_step=12000, RunningAvgSamplesPerSec=177.0773026847436, CurrSamplesPerSec=176.91676651521323, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:01:44,448] [INFO] [logging.py:96:log_dist] [Rank 0] step=12010, skipped=231, lr=[9.196779550960634e-07, 9.196779550960634e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:01:44,467] [INFO] [timer.py:199:stop] epoch=13/micro_step=50/global_step=12010, RunningAvgSamplesPerSec=177.07701671041923, CurrSamplesPerSec=177.08775043161086, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:01:48,072] [INFO] [logging.py:96:log_dist] [Rank 0] step=12020, skipped=231, lr=[9.136393650232951e-07, 9.136393650232951e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:01:48,091] [INFO] [timer.py:199:stop] epoch=13/micro_step=60/global_step=12020, RunningAvgSamplesPerSec=177.07678840340327, CurrSamplesPerSec=176.90603997653866, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:01:51,700] [INFO] [logging.py:96:log_dist] [Rank 0] step=12030, skipped=231, lr=[9.076185910168655e-07, 9.076185910168655e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:01:51,718] [INFO] [timer.py:199:stop] epoch=13/micro_step=70/global_step=12030, RunningAvgSamplesPerSec=177.07642407788938, CurrSamplesPerSec=176.88191001959018, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:01:55,332] [INFO] [logging.py:96:log_dist] [Rank 0] step=12040, skipped=231, lr=[9.016156605011192e-07, 9.016156605011192e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:01:55,350] [INFO] [timer.py:199:stop] epoch=13/micro_step=80/global_step=12040, RunningAvgSamplesPerSec=177.07587337417544, CurrSamplesPerSec=176.7356393278349, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:01:58,971] [INFO] [logging.py:96:log_dist] [Rank 0] step=12050, skipped=231, lr=[8.956306008191278e-07, 8.956306008191278e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:01:58,990] [INFO] [timer.py:199:stop] epoch=13/micro_step=90/global_step=12050, RunningAvgSamplesPerSec=177.0751819883182, CurrSamplesPerSec=176.98535382505554, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:02:02,592] [INFO] [logging.py:96:log_dist] [Rank 0] step=12060, skipped=231, lr=[8.896634392325615e-07, 8.896634392325615e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:02:02,611] [INFO] [timer.py:199:stop] epoch=13/micro_step=100/global_step=12060, RunningAvgSamplesPerSec=177.07507009014338, CurrSamplesPerSec=176.83297090875794, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:02:02,944] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:02:03,277] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:02:06,156] [INFO] [logging.py:96:log_dist] [Rank 0] step=12070, skipped=233, lr=[8.849026148598112e-07, 8.849026148598112e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:02:06,174] [INFO] [timer.py:199:stop] epoch=13/micro_step=110/global_step=12070, RunningAvgSamplesPerSec=177.07730516753912, CurrSamplesPerSec=176.84730027261455, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:02:09,778] [INFO] [logging.py:96:log_dist] [Rank 0] step=12080, skipped=233, lr=[8.789677382841063e-07, 8.789677382841063e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:02:09,797] [INFO] [timer.py:199:stop] epoch=13/micro_step=120/global_step=12080, RunningAvgSamplesPerSec=177.07714115123065, CurrSamplesPerSec=177.1097163253928, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:02:13,404] [INFO] [logging.py:96:log_dist] [Rank 0] step=12090, skipped=233, lr=[8.730508357023934e-07, 8.730508357023934e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:02:13,422] [INFO] [timer.py:199:stop] epoch=13/micro_step=130/global_step=12090, RunningAvgSamplesPerSec=177.07686161983565, CurrSamplesPerSec=176.6845714977766, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:02:17,028] [INFO] [logging.py:96:log_dist] [Rank 0] step=12100, skipped=233, lr=[8.671519340658848e-07, 8.671519340658848e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:02:17,046] [INFO] [timer.py:199:stop] epoch=13/micro_step=140/global_step=12100, RunningAvgSamplesPerSec=177.07664705896596, CurrSamplesPerSec=176.95606940455292, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:02:20,648] [INFO] [logging.py:96:log_dist] [Rank 0] step=12110, skipped=233, lr=[8.612710602438092e-07, 8.612710602438092e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:02:20,667] [INFO] [timer.py:199:stop] epoch=13/micro_step=150/global_step=12110, RunningAvgSamplesPerSec=177.07658222348675, CurrSamplesPerSec=177.01850007847398, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:02:24,267] [INFO] [logging.py:96:log_dist] [Rank 0] step=12120, skipped=233, lr=[8.554082410232706e-07, 8.554082410232706e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:02:24,286] [INFO] [timer.py:199:stop] epoch=13/micro_step=160/global_step=12120, RunningAvgSamplesPerSec=177.07657319167558, CurrSamplesPerSec=176.97671916370425, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:02:27,886] [INFO] [logging.py:96:log_dist] [Rank 0] step=12130, skipped=233, lr=[8.495635031091402e-07, 8.495635031091402e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:02:27,905] [INFO] [timer.py:199:stop] epoch=13/micro_step=170/global_step=12130, RunningAvgSamplesPerSec=177.0765332191825, CurrSamplesPerSec=176.94230557538532, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:02:31,517] [INFO] [logging.py:96:log_dist] [Rank 0] step=12140, skipped=233, lr=[8.437368731239274e-07, 8.437368731239274e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:02:31,535] [INFO] [timer.py:199:stop] epoch=13/micro_step=180/global_step=12140, RunningAvgSamplesPerSec=177.07608670935488, CurrSamplesPerSec=172.88423889136845, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:02:35,139] [INFO] [logging.py:96:log_dist] [Rank 0] step=12150, skipped=233, lr=[8.37928377607662e-07, 8.37928377607662e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:02:35,157] [INFO] [timer.py:199:stop] epoch=13/micro_step=190/global_step=12150, RunningAvgSamplesPerSec=177.0759210183335, CurrSamplesPerSec=177.04593682458213, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:02:38,759] [INFO] [logging.py:96:log_dist] [Rank 0] step=12160, skipped=233, lr=[8.321380430177733e-07, 8.321380430177733e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:02:38,778] [INFO] [timer.py:199:stop] epoch=13/micro_step=200/global_step=12160, RunningAvgSamplesPerSec=177.07585018733485, CurrSamplesPerSec=176.72551647859692, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:02:39,834] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:02:40,167] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:02:42,322] [INFO] [logging.py:96:log_dist] [Rank 0] step=12170, skipped=235, lr=[8.27518868939437e-07, 8.27518868939437e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:02:42,341] [INFO] [timer.py:199:stop] epoch=13/micro_step=210/global_step=12170, RunningAvgSamplesPerSec=177.07808795118564, CurrSamplesPerSec=176.62807067874388, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:02:45,941] [INFO] [logging.py:96:log_dist] [Rank 0] step=12180, skipped=235, lr=[8.217612904256142e-07, 8.217612904256142e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:02:45,959] [INFO] [timer.py:199:stop] epoch=13/micro_step=220/global_step=12180, RunningAvgSamplesPerSec=177.07809600982424, CurrSamplesPerSec=177.14840356520168, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:02:49,563] [INFO] [logging.py:96:log_dist] [Rank 0] step=12190, skipped=235, lr=[8.160219464784996e-07, 8.160219464784996e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:02:49,582] [INFO] [timer.py:199:stop] epoch=13/micro_step=230/global_step=12190, RunningAvgSamplesPerSec=177.07792233835193, CurrSamplesPerSec=176.73319577529142, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:02:53,183] [INFO] [logging.py:96:log_dist] [Rank 0] step=12200, skipped=235, lr=[8.103008632405379e-07, 8.103008632405379e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:02:53,202] [INFO] [timer.py:199:stop] epoch=13/micro_step=240/global_step=12200, RunningAvgSamplesPerSec=177.077861081389, CurrSamplesPerSec=176.9712354243815, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:02:56,805] [INFO] [logging.py:96:log_dist] [Rank 0] step=12210, skipped=235, lr=[8.045980667709988e-07, 8.045980667709988e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:02:56,824] [INFO] [timer.py:199:stop] epoch=13/micro_step=250/global_step=12210, RunningAvgSamplesPerSec=177.0777415952704, CurrSamplesPerSec=177.0599503585254, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:03:00,427] [INFO] [logging.py:96:log_dist] [Rank 0] step=12220, skipped=235, lr=[7.989135830458554e-07, 7.989135830458554e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:03:00,446] [INFO] [timer.py:199:stop] epoch=13/micro_step=260/global_step=12220, RunningAvgSamplesPerSec=177.0775899803553, CurrSamplesPerSec=177.08950282949536, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:03:04,047] [INFO] [logging.py:96:log_dist] [Rank 0] step=12230, skipped=235, lr=[7.932474379576686e-07, 7.932474379576686e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:03:04,065] [INFO] [timer.py:199:stop] epoch=13/micro_step=270/global_step=12230, RunningAvgSamplesPerSec=177.07755375377667, CurrSamplesPerSec=177.02970725369806, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:03:07,690] [INFO] [logging.py:96:log_dist] [Rank 0] step=12240, skipped=235, lr=[7.875996573154646e-07, 7.875996573154646e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:03:07,708] [INFO] [timer.py:199:stop] epoch=13/micro_step=280/global_step=12240, RunningAvgSamplesPerSec=177.07658309639848, CurrSamplesPerSec=177.00729432213973, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:03:11,310] [INFO] [logging.py:96:log_dist] [Rank 0] step=12250, skipped=235, lr=[7.819702668446232e-07, 7.819702668446232e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:03:11,328] [INFO] [timer.py:199:stop] epoch=13/micro_step=290/global_step=12250, RunningAvgSamplesPerSec=177.07651191098915, CurrSamplesPerSec=176.91874873540235, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:03:14,931] [INFO] [logging.py:96:log_dist] [Rank 0] step=12260, skipped=235, lr=[7.763592921867577e-07, 7.763592921867577e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:03:14,950] [INFO] [timer.py:199:stop] epoch=13/micro_step=300/global_step=12260, RunningAvgSamplesPerSec=177.0764145313238, CurrSamplesPerSec=176.826331259214, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:03:16,730] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:03:17,064] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:03:18,496] [INFO] [logging.py:96:log_dist] [Rank 0] step=12270, skipped=237, lr=[7.718837890234216e-07, 7.718837890234216e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:03:18,514] [INFO] [timer.py:199:stop] epoch=13/micro_step=310/global_step=12270, RunningAvgSamplesPerSec=177.0785783725133, CurrSamplesPerSec=176.9172329159675, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:03:22,118] [INFO] [logging.py:96:log_dist] [Rank 0] step=12280, skipped=237, lr=[7.663060271779448e-07, 7.663060271779448e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:03:22,136] [INFO] [timer.py:199:stop] epoch=13/micro_step=320/global_step=12280, RunningAvgSamplesPerSec=177.07844653810804, CurrSamplesPerSec=176.850562566129, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:03:25,738] [INFO] [logging.py:96:log_dist] [Rank 0] step=12290, skipped=237, lr=[7.6074675249533e-07, 7.6074675249533e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:03:25,756] [INFO] [timer.py:199:stop] epoch=13/micro_step=330/global_step=12290, RunningAvgSamplesPerSec=177.07838844942307, CurrSamplesPerSec=177.23004114574255, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:03:29,358] [INFO] [logging.py:96:log_dist] [Rank 0] step=12300, skipped=237, lr=[7.552059902978113e-07, 7.552059902978113e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:03:29,377] [INFO] [timer.py:199:stop] epoch=13/micro_step=340/global_step=12300, RunningAvgSamplesPerSec=177.07830809570413, CurrSamplesPerSec=177.04944000010553, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:03:32,981] [INFO] [logging.py:96:log_dist] [Rank 0] step=12310, skipped=237, lr=[7.496837658233096e-07, 7.496837658233096e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:03:32,999] [INFO] [timer.py:199:stop] epoch=13/micro_step=350/global_step=12310, RunningAvgSamplesPerSec=177.078140376958, CurrSamplesPerSec=176.92376278474438, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:03:36,604] [INFO] [logging.py:96:log_dist] [Rank 0] step=12320, skipped=237, lr=[7.441801042252967e-07, 7.441801042252967e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:03:36,622] [INFO] [timer.py:199:stop] epoch=13/micro_step=360/global_step=12320, RunningAvgSamplesPerSec=177.0779685875973, CurrSamplesPerSec=176.66654776375879, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:03:40,235] [INFO] [logging.py:96:log_dist] [Rank 0] step=12330, skipped=237, lr=[7.386950305726977e-07, 7.386950305726977e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:03:40,253] [INFO] [timer.py:199:stop] epoch=13/micro_step=370/global_step=12330, RunningAvgSamplesPerSec=177.07747687700726, CurrSamplesPerSec=173.02174797206493, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:03:43,855] [INFO] [logging.py:96:log_dist] [Rank 0] step=12340, skipped=237, lr=[7.332285698497683e-07, 7.332285698497683e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:03:43,874] [INFO] [timer.py:199:stop] epoch=13/micro_step=380/global_step=12340, RunningAvgSamplesPerSec=177.07740110221667, CurrSamplesPerSec=177.01978416198733, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:03:47,477] [INFO] [logging.py:96:log_dist] [Rank 0] step=12350, skipped=237, lr=[7.277807469559854e-07, 7.277807469559854e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:03:47,496] [INFO] [timer.py:199:stop] epoch=13/micro_step=390/global_step=12350, RunningAvgSamplesPerSec=177.07725631684494, CurrSamplesPerSec=176.72912333786948, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:03:51,100] [INFO] [logging.py:96:log_dist] [Rank 0] step=12360, skipped=237, lr=[7.22351586705927e-07, 7.22351586705927e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:03:51,119] [INFO] [timer.py:199:stop] epoch=13/micro_step=400/global_step=12360, RunningAvgSamplesPerSec=177.07708547922826, CurrSamplesPerSec=176.81957563637417, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:03:53,625] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:03:53,959] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:03:54,665] [INFO] [logging.py:96:log_dist] [Rank 0] step=12370, skipped=239, lr=[7.180217122304957e-07, 7.180217122304957e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:03:54,684] [INFO] [timer.py:199:stop] epoch=13/micro_step=410/global_step=12370, RunningAvgSamplesPerSec=177.07920128270095, CurrSamplesPerSec=176.9775359199144, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:03:58,303] [INFO] [logging.py:96:log_dist] [Rank 0] step=12380, skipped=239, lr=[7.126262070004711e-07, 7.126262070004711e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:03:58,322] [INFO] [timer.py:199:stop] epoch=13/micro_step=420/global_step=12380, RunningAvgSamplesPerSec=177.0784308663076, CurrSamplesPerSec=177.04301761754263, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:04:01,948] [INFO] [logging.py:96:log_dist] [Rank 0] step=12390, skipped=239, lr=[7.072494334423994e-07, 7.072494334423994e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:04:01,967] [INFO] [timer.py:199:stop] epoch=13/micro_step=430/global_step=12390, RunningAvgSamplesPerSec=177.07742624729406, CurrSamplesPerSec=176.69410812198444, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:04:05,570] [INFO] [logging.py:96:log_dist] [Rank 0] step=12400, skipped=239, lr=[7.018914160472346e-07, 7.018914160472346e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:04:05,589] [INFO] [timer.py:199:stop] epoch=13/micro_step=440/global_step=12400, RunningAvgSamplesPerSec=177.0773006618409, CurrSamplesPerSec=177.08179253821532, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:04:09,190] [INFO] [logging.py:96:log_dist] [Rank 0] step=12410, skipped=239, lr=[6.965521792204981e-07, 6.965521792204981e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:04:09,209] [INFO] [timer.py:199:stop] epoch=13/micro_step=450/global_step=12410, RunningAvgSamplesPerSec=177.07724005813895, CurrSamplesPerSec=177.1228049801621, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:04:12,829] [INFO] [logging.py:96:log_dist] [Rank 0] step=12420, skipped=239, lr=[6.912317472821636e-07, 6.912317472821636e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:04:12,848] [INFO] [timer.py:199:stop] epoch=13/micro_step=460/global_step=12420, RunningAvgSamplesPerSec=177.07644612086338, CurrSamplesPerSec=176.7610097225469, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:04:16,452] [INFO] [logging.py:96:log_dist] [Rank 0] step=12430, skipped=239, lr=[6.85930144466556e-07, 6.85930144466556e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:04:16,470] [INFO] [timer.py:199:stop] epoch=13/micro_step=470/global_step=12430, RunningAvgSamplesPerSec=177.0762930745466, CurrSamplesPerSec=176.92143063341896, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:04:20,073] [INFO] [logging.py:96:log_dist] [Rank 0] step=12440, skipped=239, lr=[6.806473949222267e-07, 6.806473949222267e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:04:20,091] [INFO] [timer.py:199:stop] epoch=13/micro_step=480/global_step=12440, RunningAvgSamplesPerSec=177.07620099930728, CurrSamplesPerSec=176.89123482636438, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:04:23,693] [INFO] [logging.py:96:log_dist] [Rank 0] step=12450, skipped=239, lr=[6.753835227118564e-07, 6.753835227118564e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:04:23,712] [INFO] [timer.py:199:stop] epoch=13/micro_step=490/global_step=12450, RunningAvgSamplesPerSec=177.07612600992778, CurrSamplesPerSec=177.08085800307674, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:04:27,318] [INFO] [logging.py:96:log_dist] [Rank 0] step=12460, skipped=239, lr=[6.701385518121399e-07, 6.701385518121399e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:04:27,336] [INFO] [timer.py:199:stop] epoch=13/micro_step=500/global_step=12460, RunningAvgSamplesPerSec=177.07589662910135, CurrSamplesPerSec=177.0763022855262, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:04:30,567] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:04:30,901] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:04:30,901] [INFO] [logging.py:96:log_dist] [Rank 0] step=12470, skipped=241, lr=[6.659562000934136e-07, 6.659562000934136e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:04:30,901] [INFO] [timer.py:199:stop] epoch=13/micro_step=510/global_step=12470, RunningAvgSamplesPerSec=177.0780070685006, CurrSamplesPerSec=191.8268926647339, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:04:34,502] [INFO] [logging.py:96:log_dist] [Rank 0] step=12480, skipped=241, lr=[6.607453116992557e-07, 6.607453116992557e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:04:34,521] [INFO] [timer.py:199:stop] epoch=13/micro_step=520/global_step=12480, RunningAvgSamplesPerSec=177.07797067161573, CurrSamplesPerSec=177.0250374249029, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:04:38,122] [INFO] [logging.py:96:log_dist] [Rank 0] step=12490, skipped=241, lr=[6.555533912921244e-07, 6.555533912921244e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:04:38,141] [INFO] [timer.py:199:stop] epoch=13/micro_step=530/global_step=12490, RunningAvgSamplesPerSec=177.07791403491697, CurrSamplesPerSec=177.18617206306837, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:04:41,743] [INFO] [logging.py:96:log_dist] [Rank 0] step=12500, skipped=241, lr=[6.503804625209743e-07, 6.503804625209743e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:04:41,762] [INFO] [timer.py:199:stop] epoch=13/micro_step=540/global_step=12500, RunningAvgSamplesPerSec=177.07782966441866, CurrSamplesPerSec=176.74378499100925, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:04:45,387] [INFO] [logging.py:96:log_dist] [Rank 0] step=12510, skipped=241, lr=[6.452265489482562e-07, 6.452265489482562e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:04:45,405] [INFO] [timer.py:199:stop] epoch=13/micro_step=550/global_step=12510, RunningAvgSamplesPerSec=177.0768402326123, CurrSamplesPerSec=176.65189892160495, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:04:49,008] [INFO] [logging.py:96:log_dist] [Rank 0] step=12520, skipped=241, lr=[6.400916740498081e-07, 6.400916740498081e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:04:49,027] [INFO] [timer.py:199:stop] epoch=13/micro_step=560/global_step=12520, RunningAvgSamplesPerSec=177.07673688647267, CurrSamplesPerSec=177.0263216032615, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:04:52,630] [INFO] [logging.py:96:log_dist] [Rank 0] step=12530, skipped=241, lr=[6.349758612147484e-07, 6.349758612147484e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:04:52,649] [INFO] [timer.py:199:stop] epoch=13/micro_step=570/global_step=12530, RunningAvgSamplesPerSec=177.07660316011555, CurrSamplesPerSec=176.5669602921521, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:04:56,252] [INFO] [logging.py:96:log_dist] [Rank 0] step=12540, skipped=241, lr=[6.298791337453636e-07, 6.298791337453636e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:04:56,270] [INFO] [timer.py:199:stop] epoch=13/micro_step=580/global_step=12540, RunningAvgSamplesPerSec=177.07647922465324, CurrSamplesPerSec=176.84660122537716, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:04:59,871] [INFO] [logging.py:96:log_dist] [Rank 0] step=12550, skipped=241, lr=[6.248015148570156e-07, 6.248015148570156e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:04:59,890] [INFO] [timer.py:199:stop] epoch=13/micro_step=590/global_step=12550, RunningAvgSamplesPerSec=177.07644396975158, CurrSamplesPerSec=176.9880377427907, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:05:03,491] [INFO] [logging.py:96:log_dist] [Rank 0] step=12560, skipped=241, lr=[6.197430276780202e-07, 6.197430276780202e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:05:03,510] [INFO] [timer.py:199:stop] epoch=13/micro_step=600/global_step=12560, RunningAvgSamplesPerSec=177.07637450365564, CurrSamplesPerSec=176.90755560417747, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:05:07,125] [INFO] [logging.py:96:log_dist] [Rank 0] step=12570, skipped=241, lr=[6.14703695249552e-07, 6.14703695249552e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:05:07,143] [INFO] [timer.py:199:stop] epoch=13/micro_step=610/global_step=12570, RunningAvgSamplesPerSec=177.07579721826554, CurrSamplesPerSec=176.8462517038309, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:05:07,476] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:05:07,810] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:05:10,689] [INFO] [logging.py:96:log_dist] [Rank 0] step=12580, skipped=243, lr=[6.106860361551303e-07, 6.106860361551303e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:05:10,708] [INFO] [timer.py:199:stop] epoch=13/micro_step=620/global_step=12580, RunningAvgSamplesPerSec=177.07789909383754, CurrSamplesPerSec=177.06906034526546, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:05:14,310] [INFO] [logging.py:96:log_dist] [Rank 0] step=12590, skipped=243, lr=[6.056812400628079e-07, 6.056812400628079e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:05:14,329] [INFO] [timer.py:199:stop] epoch=13/micro_step=630/global_step=12590, RunningAvgSamplesPerSec=177.07781582098517, CurrSamplesPerSec=176.95945236833293, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:05:17,945] [INFO] [logging.py:96:log_dist] [Rank 0] step=12600, skipped=243, lr=[6.006956627718015e-07, 6.006956627718015e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:05:17,963] [INFO] [timer.py:199:stop] epoch=13/micro_step=640/global_step=12600, RunningAvgSamplesPerSec=177.077333991919, CurrSamplesPerSec=177.11310516648015, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:05:21,565] [INFO] [logging.py:96:log_dist] [Rank 0] step=12610, skipped=243, lr=[5.957293269911891e-07, 5.957293269911891e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:05:21,584] [INFO] [timer.py:199:stop] epoch=13/micro_step=650/global_step=12610, RunningAvgSamplesPerSec=177.07726004594008, CurrSamplesPerSec=177.08985331323413, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:05:25,185] [INFO] [logging.py:96:log_dist] [Rank 0] step=12620, skipped=243, lr=[5.907822553423956e-07, 5.907822553423956e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:05:25,204] [INFO] [timer.py:199:stop] epoch=13/micro_step=660/global_step=12620, RunningAvgSamplesPerSec=177.0772065097237, CurrSamplesPerSec=176.9459212892407, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:05:28,806] [INFO] [logging.py:96:log_dist] [Rank 0] step=12630, skipped=243, lr=[5.858544703591068e-07, 5.858544703591068e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:05:28,824] [INFO] [timer.py:199:stop] epoch=13/micro_step=670/global_step=12630, RunningAvgSamplesPerSec=177.07714005256446, CurrSamplesPerSec=176.94358855371019, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:05:32,426] [INFO] [logging.py:96:log_dist] [Rank 0] step=12640, skipped=243, lr=[5.809459944871525e-07, 5.809459944871525e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:05:32,445] [INFO] [timer.py:199:stop] epoch=13/micro_step=680/global_step=12640, RunningAvgSamplesPerSec=177.07706100093463, CurrSamplesPerSec=176.87340199121022, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:05:36,045] [INFO] [logging.py:96:log_dist] [Rank 0] step=12650, skipped=243, lr=[5.760568500844135e-07, 5.760568500844135e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:05:36,064] [INFO] [timer.py:199:stop] epoch=13/micro_step=690/global_step=12650, RunningAvgSamplesPerSec=177.07706396598545, CurrSamplesPerSec=177.1968156313948, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:05:39,669] [INFO] [logging.py:96:log_dist] [Rank 0] step=12660, skipped=243, lr=[5.71187059420716e-07, 5.71187059420716e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:05:39,687] [INFO] [timer.py:199:stop] epoch=13/micro_step=700/global_step=12660, RunningAvgSamplesPerSec=177.07686335124134, CurrSamplesPerSec=176.3612852700619, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:05:43,290] [INFO] [logging.py:96:log_dist] [Rank 0] step=12670, skipped=243, lr=[5.663366446777296e-07, 5.663366446777296e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:05:43,309] [INFO] [timer.py:199:stop] epoch=13/micro_step=710/global_step=12670, RunningAvgSamplesPerSec=177.07675894218053, CurrSamplesPerSec=177.08868503949995, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:05:44,366] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:05:44,700] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:05:46,856] [INFO] [logging.py:96:log_dist] [Rank 0] step=12680, skipped=245, lr=[5.624702783959953e-07, 5.624702783959953e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:05:46,874] [INFO] [timer.py:199:stop] epoch=13/micro_step=720/global_step=12680, RunningAvgSamplesPerSec=177.07879543387043, CurrSamplesPerSec=176.7695069414354, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:05:50,505] [INFO] [logging.py:96:log_dist] [Rank 0] step=12690, skipped=245, lr=[5.576547959263226e-07, 5.576547959263226e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:05:50,524] [INFO] [timer.py:199:stop] epoch=13/micro_step=730/global_step=12690, RunningAvgSamplesPerSec=177.07759733490914, CurrSamplesPerSec=176.9717021123701, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:05:54,127] [INFO] [logging.py:96:log_dist] [Rank 0] step=12700, skipped=245, lr=[5.528587510161932e-07, 5.528587510161932e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:05:54,146] [INFO] [timer.py:199:stop] epoch=13/micro_step=740/global_step=12700, RunningAvgSamplesPerSec=177.07747789325649, CurrSamplesPerSec=177.02655509223752, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:05:57,748] [INFO] [logging.py:96:log_dist] [Rank 0] step=12710, skipped=245, lr=[5.480821655113711e-07, 5.480821655113711e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:05:57,766] [INFO] [timer.py:199:stop] epoch=13/micro_step=750/global_step=12710, RunningAvgSamplesPerSec=177.07738947883547, CurrSamplesPerSec=176.9818531851207, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:06:01,369] [INFO] [logging.py:96:log_dist] [Rank 0] step=12720, skipped=245, lr=[5.433250611689816e-07, 5.433250611689816e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:06:01,387] [INFO] [timer.py:199:stop] epoch=13/micro_step=760/global_step=12720, RunningAvgSamplesPerSec=177.0773018923967, CurrSamplesPerSec=177.04056555802293, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:06:04,986] [INFO] [logging.py:96:log_dist] [Rank 0] step=12730, skipped=245, lr=[5.385874596574146e-07, 5.385874596574146e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:06:05,005] [INFO] [timer.py:199:stop] epoch=13/micro_step=770/global_step=12730, RunningAvgSamplesPerSec=177.07734027673214, CurrSamplesPerSec=177.21389479743905, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:06:08,607] [INFO] [logging.py:96:log_dist] [Rank 0] step=12740, skipped=245, lr=[5.338693825562231e-07, 5.338693825562231e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:06:08,626] [INFO] [timer.py:199:stop] epoch=13/micro_step=780/global_step=12740, RunningAvgSamplesPerSec=177.07724136690908, CurrSamplesPerSec=177.0076444805505, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:06:12,245] [INFO] [logging.py:96:log_dist] [Rank 0] step=12750, skipped=245, lr=[5.291708513560332e-07, 5.291708513560332e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:06:12,264] [INFO] [timer.py:199:stop] epoch=13/micro_step=790/global_step=12750, RunningAvgSamplesPerSec=177.07650253453974, CurrSamplesPerSec=176.91455114521224, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:06:15,867] [INFO] [logging.py:96:log_dist] [Rank 0] step=12760, skipped=245, lr=[5.244918874584335e-07, 5.244918874584335e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:06:15,885] [INFO] [timer.py:199:stop] epoch=13/micro_step=800/global_step=12760, RunningAvgSamplesPerSec=177.07638039407084, CurrSamplesPerSec=176.80525076485833, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:06:19,486] [INFO] [logging.py:96:log_dist] [Rank 0] step=12770, skipped=245, lr=[5.198325121758892e-07, 5.198325121758892e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:06:19,505] [INFO] [timer.py:199:stop] epoch=13/micro_step=810/global_step=12770, RunningAvgSamplesPerSec=177.07635008735897, CurrSamplesPerSec=177.0409158480815, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:06:21,287] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:06:21,621] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:06:23,052] [INFO] [logging.py:96:log_dist] [Rank 0] step=12780, skipped=247, lr=[5.161191300177106e-07, 5.161191300177106e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:06:23,070] [INFO] [timer.py:199:stop] epoch=13/micro_step=820/global_step=12780, RunningAvgSamplesPerSec=177.0783820500848, CurrSamplesPerSec=176.89683018236286, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:06:26,673] [INFO] [logging.py:96:log_dist] [Rank 0] step=12790, skipped=247, lr=[5.114950676648058e-07, 5.114950676648058e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:06:26,692] [INFO] [timer.py:199:stop] epoch=13/micro_step=830/global_step=12790, RunningAvgSamplesPerSec=177.07827533044622, CurrSamplesPerSec=177.05773140117012, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:06:30,292] [INFO] [logging.py:96:log_dist] [Rank 0] step=12800, skipped=247, lr=[5.068906531268657e-07, 5.068906531268657e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:06:30,311] [INFO] [timer.py:199:stop] epoch=13/micro_step=840/global_step=12800, RunningAvgSamplesPerSec=177.07825053205482, CurrSamplesPerSec=176.96738534239063, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:06:33,910] [INFO] [logging.py:96:log_dist] [Rank 0] step=12810, skipped=247, lr=[5.023059073767891e-07, 5.023059073767891e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:06:33,929] [INFO] [timer.py:199:stop] epoch=13/micro_step=850/global_step=12810, RunningAvgSamplesPerSec=177.07826905734822, CurrSamplesPerSec=177.15214461730557, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:06:37,530] [INFO] [logging.py:96:log_dist] [Rank 0] step=12820, skipped=247, lr=[4.977408512978771e-07, 4.977408512978771e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:06:37,549] [INFO] [timer.py:199:stop] epoch=13/micro_step=860/global_step=12820, RunningAvgSamplesPerSec=177.07822027107548, CurrSamplesPerSec=176.939039896329, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:06:41,152] [INFO] [logging.py:96:log_dist] [Rank 0] step=12830, skipped=247, lr=[4.931955056837492e-07, 4.931955056837492e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:06:41,171] [INFO] [timer.py:199:stop] epoch=13/micro_step=870/global_step=12830, RunningAvgSamplesPerSec=177.07809629921292, CurrSamplesPerSec=176.9806863359156, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:06:44,772] [INFO] [logging.py:96:log_dist] [Rank 0] step=12840, skipped=247, lr=[4.88669891238245e-07, 4.88669891238245e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:06:44,791] [INFO] [timer.py:199:stop] epoch=13/micro_step=880/global_step=12840, RunningAvgSamplesPerSec=177.07803384767857, CurrSamplesPerSec=177.02772254635647, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:06:48,392] [INFO] [logging.py:96:log_dist] [Rank 0] step=12850, skipped=247, lr=[4.841640285753278e-07, 4.841640285753278e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:06:48,411] [INFO] [timer.py:199:stop] epoch=13/micro_step=890/global_step=12850, RunningAvgSamplesPerSec=177.0779905863531, CurrSamplesPerSec=176.96960203580457, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:06:52,013] [INFO] [logging.py:96:log_dist] [Rank 0] step=12860, skipped=247, lr=[4.796779382189927e-07, 4.796779382189927e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:06:52,032] [INFO] [timer.py:199:stop] epoch=13/micro_step=900/global_step=12860, RunningAvgSamplesPerSec=177.07789533641647, CurrSamplesPerSec=176.92562855007375, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:06:55,638] [INFO] [logging.py:96:log_dist] [Rank 0] step=12870, skipped=247, lr=[4.7521164060317327e-07, 4.7521164060317327e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:06:55,657] [INFO] [timer.py:199:stop] epoch=13/micro_step=910/global_step=12870, RunningAvgSamplesPerSec=177.07763976771946, CurrSamplesPerSec=176.87584942236313, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:06:58,163] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:06:58,496] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:06:59,202] [INFO] [logging.py:96:log_dist] [Rank 0] step=12880, skipped=249, lr=[4.716528669577405e-07, 4.716528669577405e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:06:59,221] [INFO] [timer.py:199:stop] epoch=13/micro_step=920/global_step=12880, RunningAvgSamplesPerSec=177.07970652029522, CurrSamplesPerSec=176.921197421668, MemAllocated=4.98GB, MaxMemAllocated=24.08GB ***** Evaluating perplexity, Epoch 14/16 ***** ppl: 1.7721067667007446 Beginning of Epoch 15/16, Total Micro Batches 920 [2023-04-21 23:07:10,981] [INFO] [logging.py:96:log_dist] [Rank 0] step=12890, skipped=249, lr=[4.672222474805286e-07, 4.672222474805286e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:07:10,999] [INFO] [timer.py:199:stop] epoch=14/micro_step=10/global_step=12890, RunningAvgSamplesPerSec=177.07826230516415, CurrSamplesPerSec=176.88156035846072, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:07:14,608] [INFO] [logging.py:96:log_dist] [Rank 0] step=12900, skipped=249, lr=[4.6281147747892663e-07, 4.6281147747892663e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:07:14,626] [INFO] [timer.py:199:stop] epoch=14/micro_step=20/global_step=12900, RunningAvgSamplesPerSec=177.07795437194181, CurrSamplesPerSec=175.8183781914451, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:07:18,232] [INFO] [logging.py:96:log_dist] [Rank 0] step=12910, skipped=249, lr=[4.5842057704378814e-07, 4.5842057704378814e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:07:18,250] [INFO] [timer.py:199:stop] epoch=14/micro_step=30/global_step=12910, RunningAvgSamplesPerSec=177.07777266097665, CurrSamplesPerSec=176.5282943924158, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:07:21,853] [INFO] [logging.py:96:log_dist] [Rank 0] step=12920, skipped=249, lr=[4.540495661754586e-07, 4.540495661754586e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:07:21,871] [INFO] [timer.py:199:stop] epoch=14/micro_step=40/global_step=12920, RunningAvgSamplesPerSec=177.07767431491865, CurrSamplesPerSec=177.12958379386, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:07:25,474] [INFO] [logging.py:96:log_dist] [Rank 0] step=12930, skipped=249, lr=[4.496984647836927e-07, 4.496984647836927e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:07:25,492] [INFO] [timer.py:199:stop] epoch=14/micro_step=50/global_step=12930, RunningAvgSamplesPerSec=177.07757051002048, CurrSamplesPerSec=177.04278408513815, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:07:29,100] [INFO] [logging.py:96:log_dist] [Rank 0] step=12940, skipped=249, lr=[4.453672926875535e-07, 4.453672926875535e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:07:29,119] [INFO] [timer.py:199:stop] epoch=14/micro_step=60/global_step=12940, RunningAvgSamplesPerSec=177.07730001765952, CurrSamplesPerSec=176.79174325642416, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:07:32,727] [INFO] [logging.py:96:log_dist] [Rank 0] step=12950, skipped=249, lr=[4.4105606961533046e-07, 4.4105606961533046e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:07:32,745] [INFO] [timer.py:199:stop] epoch=14/micro_step=70/global_step=12950, RunningAvgSamplesPerSec=177.0770136329481, CurrSamplesPerSec=176.41251417061252, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:07:36,354] [INFO] [logging.py:96:log_dist] [Rank 0] step=12960, skipped=249, lr=[4.367648152044436e-07, 4.367648152044436e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:07:36,372] [INFO] [timer.py:199:stop] epoch=14/micro_step=80/global_step=12960, RunningAvgSamplesPerSec=177.07667145829728, CurrSamplesPerSec=176.8016407975836, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:07:39,979] [INFO] [logging.py:96:log_dist] [Rank 0] step=12970, skipped=249, lr=[4.324935490013594e-07, 4.324935490013594e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:07:39,997] [INFO] [timer.py:199:stop] epoch=14/micro_step=90/global_step=12970, RunningAvgSamplesPerSec=177.0764297248247, CurrSamplesPerSec=176.85813621153235, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:07:43,229] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:07:43,563] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:07:43,563] [INFO] [logging.py:96:log_dist] [Rank 0] step=12980, skipped=251, lr=[4.290909406256114e-07, 4.290909406256114e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:07:43,564] [INFO] [timer.py:199:stop] epoch=14/micro_step=100/global_step=12980, RunningAvgSamplesPerSec=177.07838581346647, CurrSamplesPerSec=191.94760036410884, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:07:47,166] [INFO] [logging.py:96:log_dist] [Rank 0] step=12990, skipped=251, lr=[4.2485570216298765e-07, 4.2485570216298765e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:07:47,185] [INFO] [timer.py:199:stop] epoch=14/micro_step=110/global_step=12990, RunningAvgSamplesPerSec=177.07828247910055, CurrSamplesPerSec=176.99865751986687, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:07:50,788] [INFO] [logging.py:96:log_dist] [Rank 0] step=13000, skipped=251, lr=[4.2064050615362305e-07, 4.2064050615362305e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:07:50,807] [INFO] [timer.py:199:stop] epoch=14/micro_step=120/global_step=13000, RunningAvgSamplesPerSec=177.07813679545905, CurrSamplesPerSec=176.8579031666165, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:07:54,408] [INFO] [logging.py:96:log_dist] [Rank 0] step=13010, skipped=251, lr=[4.164453717975404e-07, 4.164453717975404e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:07:54,426] [INFO] [timer.py:199:stop] epoch=14/micro_step=130/global_step=13010, RunningAvgSamplesPerSec=177.07808527640304, CurrSamplesPerSec=177.25040377908167, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:07:58,071] [INFO] [logging.py:96:log_dist] [Rank 0] step=13020, skipped=251, lr=[4.1227031820338157e-07, 4.1227031820338157e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:07:58,090] [INFO] [timer.py:199:stop] epoch=14/micro_step=140/global_step=13020, RunningAvgSamplesPerSec=177.07647287103032, CurrSamplesPerSec=176.46864512464575, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:08:01,692] [INFO] [logging.py:96:log_dist] [Rank 0] step=13030, skipped=251, lr=[4.0811536438832326e-07, 4.0811536438832326e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:08:01,711] [INFO] [timer.py:199:stop] epoch=14/micro_step=150/global_step=13030, RunningAvgSamplesPerSec=177.07638421772594, CurrSamplesPerSec=177.02363652430589, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:08:05,311] [INFO] [logging.py:96:log_dist] [Rank 0] step=13040, skipped=251, lr=[4.03980529277985e-07, 4.03980529277985e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:08:05,330] [INFO] [timer.py:199:stop] epoch=14/micro_step=160/global_step=13040, RunningAvgSamplesPerSec=177.07634614986125, CurrSamplesPerSec=176.84415460356632, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:08:08,934] [INFO] [logging.py:96:log_dist] [Rank 0] step=13050, skipped=251, lr=[3.998658317063522e-07, 3.998658317063522e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:08:08,952] [INFO] [timer.py:199:stop] epoch=14/micro_step=170/global_step=13050, RunningAvgSamplesPerSec=177.07619322806332, CurrSamplesPerSec=176.99585657580673, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:08:12,556] [INFO] [logging.py:96:log_dist] [Rank 0] step=13060, skipped=251, lr=[3.957712904156798e-07, 3.957712904156798e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:08:12,574] [INFO] [timer.py:199:stop] epoch=14/micro_step=180/global_step=13060, RunningAvgSamplesPerSec=177.07607212720106, CurrSamplesPerSec=176.6421343763757, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:08:16,178] [INFO] [logging.py:96:log_dist] [Rank 0] step=13070, skipped=251, lr=[3.916969240564129e-07, 3.916969240564129e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:08:16,197] [INFO] [timer.py:199:stop] epoch=14/micro_step=190/global_step=13070, RunningAvgSamplesPerSec=177.07591069311758, CurrSamplesPerSec=176.73924660032577, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:08:19,801] [INFO] [logging.py:96:log_dist] [Rank 0] step=13080, skipped=251, lr=[3.87642751187103e-07, 3.87642751187103e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:08:19,819] [INFO] [timer.py:199:stop] epoch=14/micro_step=200/global_step=13080, RunningAvgSamplesPerSec=177.07577456872414, CurrSamplesPerSec=176.2718798777817, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:08:20,152] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:08:20,486] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:08:23,365] [INFO] [logging.py:96:log_dist] [Rank 0] step=13090, skipped=253, lr=[3.844139646170508e-07, 3.844139646170508e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:08:23,383] [INFO] [timer.py:199:stop] epoch=14/micro_step=210/global_step=13090, RunningAvgSamplesPerSec=177.077800589776, CurrSamplesPerSec=176.83611617994848, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:08:26,986] [INFO] [logging.py:96:log_dist] [Rank 0] step=13100, skipped=253, lr=[3.803961865035522e-07, 3.803961865035522e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:08:27,004] [INFO] [timer.py:199:stop] epoch=14/micro_step=220/global_step=13100, RunningAvgSamplesPerSec=177.07770457274145, CurrSamplesPerSec=177.03858060720697, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:08:30,606] [INFO] [logging.py:96:log_dist] [Rank 0] step=13110, skipped=253, lr=[3.7639865335434606e-07, 3.7639865335434606e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:08:30,624] [INFO] [timer.py:199:stop] epoch=14/micro_step=230/global_step=13110, RunningAvgSamplesPerSec=177.0776188436039, CurrSamplesPerSec=176.14152453844417, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:08:34,257] [INFO] [logging.py:96:log_dist] [Rank 0] step=13120, skipped=253, lr=[3.7242138337800797e-07, 3.7242138337800797e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:08:34,275] [INFO] [timer.py:199:stop] epoch=14/micro_step=240/global_step=13120, RunningAvgSamplesPerSec=177.07640810760444, CurrSamplesPerSec=176.98558720597504, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:08:37,879] [INFO] [logging.py:96:log_dist] [Rank 0] step=13130, skipped=253, lr=[3.684643946908219e-07, 3.684643946908219e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:08:37,898] [INFO] [timer.py:199:stop] epoch=14/micro_step=250/global_step=13130, RunningAvgSamplesPerSec=177.07626112433167, CurrSamplesPerSec=176.88995260713477, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:08:41,498] [INFO] [logging.py:96:log_dist] [Rank 0] step=13140, skipped=253, lr=[3.645277053166853e-07, 3.645277053166853e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:08:41,516] [INFO] [timer.py:199:stop] epoch=14/micro_step=260/global_step=13140, RunningAvgSamplesPerSec=177.07624712563427, CurrSamplesPerSec=177.28470315669065, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:08:45,120] [INFO] [logging.py:96:log_dist] [Rank 0] step=13150, skipped=253, lr=[3.606113331870367e-07, 3.606113331870367e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:08:45,138] [INFO] [timer.py:199:stop] epoch=14/micro_step=270/global_step=13150, RunningAvgSamplesPerSec=177.07609473166102, CurrSamplesPerSec=177.06719155361216, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:08:48,740] [INFO] [logging.py:96:log_dist] [Rank 0] step=13160, skipped=253, lr=[3.5671529614076906e-07, 3.5671529614076906e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:08:48,758] [INFO] [timer.py:199:stop] epoch=14/micro_step=280/global_step=13160, RunningAvgSamplesPerSec=177.07602414472413, CurrSamplesPerSec=176.88494047395477, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:08:52,358] [INFO] [logging.py:96:log_dist] [Rank 0] step=13170, skipped=253, lr=[3.528396119241522e-07, 3.528396119241522e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:08:52,377] [INFO] [timer.py:199:stop] epoch=14/micro_step=290/global_step=13170, RunningAvgSamplesPerSec=177.07602504786627, CurrSamplesPerSec=176.94638784372768, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:08:55,980] [INFO] [logging.py:96:log_dist] [Rank 0] step=13180, skipped=253, lr=[3.4898429819074517e-07, 3.4898429819074517e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:08:55,998] [INFO] [timer.py:199:stop] epoch=14/micro_step=300/global_step=13180, RunningAvgSamplesPerSec=177.07589345054157, CurrSamplesPerSec=176.8160815512442, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:08:57,056] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:08:57,390] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:08:59,545] [INFO] [logging.py:96:log_dist] [Rank 0] step=13190, skipped=255, lr=[3.4591472575589294e-07, 3.4591472575589294e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:08:59,564] [INFO] [timer.py:199:stop] epoch=14/micro_step=310/global_step=13190, RunningAvgSamplesPerSec=177.0778822851217, CurrSamplesPerSec=177.08050755494278, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:09:03,166] [INFO] [logging.py:96:log_dist] [Rank 0] step=13200, skipped=255, lr=[3.4209612308301647e-07, 3.4209612308301647e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:09:03,185] [INFO] [timer.py:199:stop] epoch=14/micro_step=320/global_step=13200, RunningAvgSamplesPerSec=177.07780387395192, CurrSamplesPerSec=176.85871882650898, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:09:06,794] [INFO] [logging.py:96:log_dist] [Rank 0] step=13210, skipped=255, lr=[3.382979398294449e-07, 3.382979398294449e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:09:06,812] [INFO] [timer.py:199:stop] epoch=14/micro_step=330/global_step=13210, RunningAvgSamplesPerSec=177.07747595877305, CurrSamplesPerSec=177.23156233019478, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:09:10,415] [INFO] [logging.py:96:log_dist] [Rank 0] step=13220, skipped=255, lr=[3.3452019329572675e-07, 3.3452019329572675e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:09:10,433] [INFO] [timer.py:199:stop] epoch=14/micro_step=340/global_step=13220, RunningAvgSamplesPerSec=177.07738868598432, CurrSamplesPerSec=177.08506348888022, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:09:14,035] [INFO] [logging.py:96:log_dist] [Rank 0] step=13230, skipped=255, lr=[3.30762900689323e-07, 3.30762900689323e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:09:14,053] [INFO] [timer.py:199:stop] epoch=14/micro_step=350/global_step=13230, RunningAvgSamplesPerSec=177.07732372778028, CurrSamplesPerSec=176.86303029664128, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:09:17,655] [INFO] [logging.py:96:log_dist] [Rank 0] step=13240, skipped=255, lr=[3.2702607912452533e-07, 3.2702607912452533e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:09:17,673] [INFO] [timer.py:199:stop] epoch=14/micro_step=360/global_step=13240, RunningAvgSamplesPerSec=177.0772627944232, CurrSamplesPerSec=176.88400801539822, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:09:21,279] [INFO] [logging.py:96:log_dist] [Rank 0] step=13250, skipped=255, lr=[3.233097456223851e-07, 3.233097456223851e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:09:21,298] [INFO] [timer.py:199:stop] epoch=14/micro_step=370/global_step=13250, RunningAvgSamplesPerSec=177.0771467216121, CurrSamplesPerSec=176.84858187356997, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:09:24,923] [INFO] [logging.py:96:log_dist] [Rank 0] step=13260, skipped=255, lr=[3.1961391711062754e-07, 3.1961391711062754e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:09:24,942] [INFO] [timer.py:199:stop] epoch=14/micro_step=380/global_step=13260, RunningAvgSamplesPerSec=177.07620564266145, CurrSamplesPerSec=176.64294804724773, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:09:28,543] [INFO] [logging.py:96:log_dist] [Rank 0] step=13270, skipped=255, lr=[3.159386104235807e-07, 3.159386104235807e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:09:28,561] [INFO] [timer.py:199:stop] epoch=14/micro_step=390/global_step=13270, RunningAvgSamplesPerSec=177.07616243554136, CurrSamplesPerSec=176.99550646403227, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:09:32,163] [INFO] [logging.py:96:log_dist] [Rank 0] step=13280, skipped=255, lr=[3.1228384230209573e-07, 3.1228384230209573e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:09:32,181] [INFO] [timer.py:199:stop] epoch=14/micro_step=400/global_step=13280, RunningAvgSamplesPerSec=177.07608636538342, CurrSamplesPerSec=176.99340582247217, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:09:33,964] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:09:34,297] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:09:35,728] [INFO] [logging.py:96:log_dist] [Rank 0] step=13290, skipped=257, lr=[3.0937482676224337e-07, 3.0937482676224337e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:09:35,746] [INFO] [timer.py:199:stop] epoch=14/micro_step=410/global_step=13290, RunningAvgSamplesPerSec=177.07804681761047, CurrSamplesPerSec=176.90988738971777, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:09:39,365] [INFO] [logging.py:96:log_dist] [Rank 0] step=13300, skipped=257, lr=[3.057570699470493e-07, 3.057570699470493e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:09:39,383] [INFO] [timer.py:199:stop] epoch=14/micro_step=420/global_step=13300, RunningAvgSamplesPerSec=177.07738531251783, CurrSamplesPerSec=176.85941796954782, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:09:42,985] [INFO] [logging.py:96:log_dist] [Rank 0] step=13310, skipped=257, lr=[3.0215989807386265e-07, 3.0215989807386265e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:09:43,003] [INFO] [timer.py:199:stop] epoch=14/micro_step=430/global_step=13310, RunningAvgSamplesPerSec=177.07731914982844, CurrSamplesPerSec=176.9126856135074, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:09:46,604] [INFO] [logging.py:96:log_dist] [Rank 0] step=13320, skipped=257, lr=[2.985833275276337e-07, 2.985833275276337e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:09:46,623] [INFO] [timer.py:199:stop] epoch=14/micro_step=440/global_step=13320, RunningAvgSamplesPerSec=177.0772592437542, CurrSamplesPerSec=177.0982653391769, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:09:50,228] [INFO] [logging.py:96:log_dist] [Rank 0] step=13330, skipped=257, lr=[2.9502737459947557e-07, 2.9502737459947557e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:09:50,246] [INFO] [timer.py:199:stop] epoch=14/micro_step=450/global_step=13330, RunningAvgSamplesPerSec=177.07707258106018, CurrSamplesPerSec=176.87037193219194, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:09:53,848] [INFO] [logging.py:96:log_dist] [Rank 0] step=13340, skipped=257, lr=[2.914920554865891e-07, 2.914920554865891e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:09:53,866] [INFO] [timer.py:199:stop] epoch=14/micro_step=460/global_step=13340, RunningAvgSamplesPerSec=177.07702432619212, CurrSamplesPerSec=176.93250889983474, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:09:57,468] [INFO] [logging.py:96:log_dist] [Rank 0] step=13350, skipped=257, lr=[2.879773862921899e-07, 2.879773862921899e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:09:57,486] [INFO] [timer.py:199:stop] epoch=14/micro_step=470/global_step=13350, RunningAvgSamplesPerSec=177.0769608815437, CurrSamplesPerSec=177.00367609968131, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:10:01,089] [INFO] [logging.py:96:log_dist] [Rank 0] step=13360, skipped=257, lr=[2.8448338302543117e-07, 2.8448338302543117e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:10:01,107] [INFO] [timer.py:199:stop] epoch=14/micro_step=480/global_step=13360, RunningAvgSamplesPerSec=177.07688758051938, CurrSamplesPerSec=177.0879840826583, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:10:04,709] [INFO] [logging.py:96:log_dist] [Rank 0] step=13370, skipped=257, lr=[2.8101006160133776e-07, 2.8101006160133776e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:10:04,727] [INFO] [timer.py:199:stop] epoch=14/micro_step=490/global_step=13370, RunningAvgSamplesPerSec=177.07684060353046, CurrSamplesPerSec=176.92877711830056, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:10:08,329] [INFO] [logging.py:96:log_dist] [Rank 0] step=13380, skipped=257, lr=[2.7755743784072665e-07, 2.7755743784072665e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:10:08,347] [INFO] [timer.py:199:stop] epoch=14/micro_step=500/global_step=13380, RunningAvgSamplesPerSec=177.07678920003926, CurrSamplesPerSec=177.19775139085823, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:10:10,854] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:10:11,188] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:10:11,895] [INFO] [logging.py:96:log_dist] [Rank 0] step=13390, skipped=259, lr=[2.748102517213506e-07, 2.748102517213506e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:10:11,914] [INFO] [timer.py:199:stop] epoch=14/micro_step=510/global_step=13390, RunningAvgSamplesPerSec=177.07869271406005, CurrSamplesPerSec=176.82097330909284, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:10:15,517] [INFO] [logging.py:96:log_dist] [Rank 0] step=13400, skipped=259, lr=[2.7139492332249193e-07, 2.7139492332249193e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:10:15,535] [INFO] [timer.py:199:stop] epoch=14/micro_step=520/global_step=13400, RunningAvgSamplesPerSec=177.07856475263523, CurrSamplesPerSec=176.82796200413685, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:10:19,137] [INFO] [logging.py:96:log_dist] [Rank 0] step=13410, skipped=259, lr=[2.6800033638362473e-07, 2.6800033638362473e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:10:19,155] [INFO] [timer.py:199:stop] epoch=14/micro_step=530/global_step=13410, RunningAvgSamplesPerSec=177.0785191295726, CurrSamplesPerSec=176.89869537969216, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:10:22,755] [INFO] [logging.py:96:log_dist] [Rank 0] step=13420, skipped=259, lr=[2.646265063669355e-07, 2.646265063669355e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:10:22,773] [INFO] [timer.py:199:stop] epoch=14/micro_step=540/global_step=13420, RunningAvgSamplesPerSec=177.07852586971276, CurrSamplesPerSec=177.04441882490747, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:10:26,378] [INFO] [logging.py:96:log_dist] [Rank 0] step=13430, skipped=259, lr=[2.6127344864006343e-07, 2.6127344864006343e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:10:26,397] [INFO] [timer.py:199:stop] epoch=14/micro_step=550/global_step=13430, RunningAvgSamplesPerSec=177.0783438181106, CurrSamplesPerSec=177.09300772931306, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:10:30,022] [INFO] [logging.py:96:log_dist] [Rank 0] step=13440, skipped=259, lr=[2.579411784760305e-07, 2.579411784760305e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:10:30,040] [INFO] [timer.py:199:stop] epoch=14/micro_step=560/global_step=13440, RunningAvgSamplesPerSec=177.0774955697975, CurrSamplesPerSec=176.99153862738905, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:10:33,640] [INFO] [logging.py:96:log_dist] [Rank 0] step=13450, skipped=259, lr=[2.546297110531731e-07, 2.546297110531731e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:10:33,658] [INFO] [timer.py:199:stop] epoch=14/micro_step=570/global_step=13450, RunningAvgSamplesPerSec=177.07749625465917, CurrSamplesPerSec=177.0824934460425, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:10:37,259] [INFO] [logging.py:96:log_dist] [Rank 0] step=13460, skipped=259, lr=[2.5133906145507185e-07, 2.5133906145507185e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:10:37,277] [INFO] [timer.py:199:stop] epoch=14/micro_step=580/global_step=13460, RunningAvgSamplesPerSec=177.07745578790036, CurrSamplesPerSec=176.88424112911568, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:10:40,881] [INFO] [logging.py:96:log_dist] [Rank 0] step=13470, skipped=259, lr=[2.480692446704834e-07, 2.480692446704834e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:10:40,899] [INFO] [timer.py:199:stop] epoch=14/micro_step=590/global_step=13470, RunningAvgSamplesPerSec=177.07733694881767, CurrSamplesPerSec=177.00519340076752, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:10:44,527] [INFO] [logging.py:96:log_dist] [Rank 0] step=13480, skipped=259, lr=[2.4482027559327107e-07, 2.4482027559327107e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:10:44,546] [INFO] [timer.py:199:stop] epoch=14/micro_step=600/global_step=13480, RunningAvgSamplesPerSec=177.07631964499973, CurrSamplesPerSec=176.98290336256068, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:10:47,775] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:10:48,109] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:10:48,109] [INFO] [logging.py:96:log_dist] [Rank 0] step=13490, skipped=261, lr=[2.4223612062886763e-07, 2.4223612062886763e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:10:48,110] [INFO] [timer.py:199:stop] epoch=14/micro_step=610/global_step=13490, RunningAvgSamplesPerSec=177.07828221914406, CurrSamplesPerSec=192.242875814185, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:10:51,713] [INFO] [logging.py:96:log_dist] [Rank 0] step=13500, skipped=261, lr=[2.390247146543137e-07, 2.390247146543137e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:10:51,731] [INFO] [timer.py:199:stop] epoch=14/micro_step=620/global_step=13500, RunningAvgSamplesPerSec=177.07815908455777, CurrSamplesPerSec=176.91128649054764, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:10:55,335] [INFO] [logging.py:96:log_dist] [Rank 0] step=13510, skipped=261, lr=[2.3583419758455043e-07, 2.3583419758455043e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:10:55,353] [INFO] [timer.py:199:stop] epoch=14/micro_step=630/global_step=13510, RunningAvgSamplesPerSec=177.0780313675969, CurrSamplesPerSec=176.91012057165264, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:10:58,966] [INFO] [logging.py:96:log_dist] [Rank 0] step=13520, skipped=261, lr=[2.3266458395223563e-07, 2.3266458395223563e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:10:58,984] [INFO] [timer.py:199:stop] epoch=14/micro_step=640/global_step=13520, RunningAvgSamplesPerSec=177.0775707030468, CurrSamplesPerSec=176.92516210505224, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:11:02,585] [INFO] [logging.py:96:log_dist] [Rank 0] step=13530, skipped=261, lr=[2.2951588819481212e-07, 2.2951588819481212e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:11:02,603] [INFO] [timer.py:199:stop] epoch=14/micro_step=650/global_step=13530, RunningAvgSamplesPerSec=177.07754716807733, CurrSamplesPerSec=177.17026757492937, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:11:06,209] [INFO] [logging.py:96:log_dist] [Rank 0] step=13540, skipped=261, lr=[2.263881246544442e-07, 2.263881246544442e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:11:06,227] [INFO] [timer.py:199:stop] epoch=14/micro_step=660/global_step=13540, RunningAvgSamplesPerSec=177.07733651712732, CurrSamplesPerSec=176.9746189680657, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:11:09,829] [INFO] [logging.py:96:log_dist] [Rank 0] step=13550, skipped=261, lr=[2.2328130757794882e-07, 2.2328130757794882e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:11:09,848] [INFO] [timer.py:199:stop] epoch=14/micro_step=670/global_step=13550, RunningAvgSamplesPerSec=177.0772566119134, CurrSamplesPerSec=177.08529713283733, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:11:13,448] [INFO] [logging.py:96:log_dist] [Rank 0] step=13560, skipped=261, lr=[2.2019545111673315e-07, 2.2019545111673315e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:11:13,467] [INFO] [timer.py:199:stop] epoch=14/micro_step=680/global_step=13560, RunningAvgSamplesPerSec=177.0772289926142, CurrSamplesPerSec=177.00472653612425, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:11:17,087] [INFO] [logging.py:96:log_dist] [Rank 0] step=13570, skipped=261, lr=[2.1713056932673222e-07, 2.1713056932673222e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:11:17,106] [INFO] [timer.py:199:stop] epoch=14/micro_step=690/global_step=13570, RunningAvgSamplesPerSec=177.07647979977108, CurrSamplesPerSec=176.85382498000445, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:11:20,709] [INFO] [logging.py:96:log_dist] [Rank 0] step=13580, skipped=261, lr=[2.1408667616833898e-07, 2.1408667616833898e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:11:20,727] [INFO] [timer.py:199:stop] epoch=14/micro_step=700/global_step=13580, RunningAvgSamplesPerSec=177.07635484930088, CurrSamplesPerSec=176.95816915994865, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:11:24,329] [INFO] [logging.py:96:log_dist] [Rank 0] step=13590, skipped=261, lr=[2.1106378550634563e-07, 2.1106378550634563e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:11:24,347] [INFO] [timer.py:199:stop] epoch=14/micro_step=710/global_step=13590, RunningAvgSamplesPerSec=177.07629241665828, CurrSamplesPerSec=177.0092785714521, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:11:24,680] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:11:25,014] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:11:27,894] [INFO] [logging.py:96:log_dist] [Rank 0] step=13600, skipped=263, lr=[2.0866060403022866e-07, 2.0866060403022866e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:11:27,913] [INFO] [timer.py:199:stop] epoch=14/micro_step=720/global_step=13600, RunningAvgSamplesPerSec=177.078192633911, CurrSamplesPerSec=176.87177040806912, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:11:31,518] [INFO] [logging.py:96:log_dist] [Rank 0] step=13610, skipped=263, lr=[2.056755524956278e-07, 2.056755524956278e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:11:31,536] [INFO] [timer.py:199:stop] epoch=14/micro_step=730/global_step=13610, RunningAvgSamplesPerSec=177.0780247043209, CurrSamplesPerSec=176.84636821085945, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:11:35,146] [INFO] [logging.py:96:log_dist] [Rank 0] step=13620, skipped=263, lr=[2.027115417697058e-07, 2.027115417697058e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:11:35,164] [INFO] [timer.py:199:stop] epoch=14/micro_step=740/global_step=13620, RunningAvgSamplesPerSec=177.0776725050729, CurrSamplesPerSec=176.96003565102342, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:11:38,766] [INFO] [logging.py:96:log_dist] [Rank 0] step=13630, skipped=263, lr=[1.997685853533947e-07, 1.997685853533947e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:11:38,784] [INFO] [timer.py:199:stop] epoch=14/micro_step=750/global_step=13630, RunningAvgSamplesPerSec=177.07760268784006, CurrSamplesPerSec=177.01721601358975, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:11:42,386] [INFO] [logging.py:96:log_dist] [Rank 0] step=13640, skipped=263, lr=[1.968466966517227e-07, 1.968466966517227e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:11:42,404] [INFO] [timer.py:199:stop] epoch=14/micro_step=760/global_step=13640, RunningAvgSamplesPerSec=177.07754689146282, CurrSamplesPerSec=176.79686655114477, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:11:46,006] [INFO] [logging.py:96:log_dist] [Rank 0] step=13650, skipped=263, lr=[1.9394588897375805e-07, 1.9394588897375805e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:11:46,026] [INFO] [timer.py:199:stop] epoch=14/micro_step=770/global_step=13650, RunningAvgSamplesPerSec=177.07742575269685, CurrSamplesPerSec=175.79350789427326, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:11:49,634] [INFO] [logging.py:96:log_dist] [Rank 0] step=13660, skipped=263, lr=[1.9106617553254557e-07, 1.9106617553254557e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:11:49,653] [INFO] [timer.py:199:stop] epoch=14/micro_step=780/global_step=13660, RunningAvgSamplesPerSec=177.07713927292087, CurrSamplesPerSec=176.889836042672, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:11:53,256] [INFO] [logging.py:96:log_dist] [Rank 0] step=13670, skipped=263, lr=[1.8820756944504756e-07, 1.8820756944504756e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:11:53,274] [INFO] [timer.py:199:stop] epoch=14/micro_step=790/global_step=13670, RunningAvgSamplesPerSec=177.0770267548488, CurrSamplesPerSec=176.98465368599, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:11:56,878] [INFO] [logging.py:96:log_dist] [Rank 0] step=13680, skipped=263, lr=[1.853700837320792e-07, 1.853700837320792e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:11:56,897] [INFO] [timer.py:199:stop] epoch=14/micro_step=800/global_step=13680, RunningAvgSamplesPerSec=177.0768671150904, CurrSamplesPerSec=177.1097163253928, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:12:00,500] [INFO] [logging.py:96:log_dist] [Rank 0] step=13690, skipped=263, lr=[1.8255373131825946e-07, 1.8255373131825946e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:12:00,519] [INFO] [timer.py:199:stop] epoch=14/micro_step=810/global_step=13690, RunningAvgSamplesPerSec=177.0767387378153, CurrSamplesPerSec=176.77823781171654, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:12:01,575] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:12:01,908] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:12:04,064] [INFO] [logging.py:96:log_dist] [Rank 0] step=13700, skipped=265, lr=[1.8031587398647956e-07, 1.8031587398647956e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:12:04,083] [INFO] [timer.py:199:stop] epoch=14/micro_step=820/global_step=13700, RunningAvgSamplesPerSec=177.0786710465626, CurrSamplesPerSec=176.8203909427748, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:12:07,685] [INFO] [logging.py:96:log_dist] [Rank 0] step=13710, skipped=265, lr=[1.775375937738532e-07, 1.775375937738532e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:12:07,703] [INFO] [timer.py:199:stop] epoch=14/micro_step=830/global_step=13710, RunningAvgSamplesPerSec=177.0785907498827, CurrSamplesPerSec=177.10469173075197, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:12:11,304] [INFO] [logging.py:96:log_dist] [Rank 0] step=13720, skipped=265, lr=[1.7478048253700278e-07, 1.7478048253700278e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:12:11,322] [INFO] [timer.py:199:stop] epoch=14/micro_step=840/global_step=13720, RunningAvgSamplesPerSec=177.07857781586156, CurrSamplesPerSec=176.95490289530915, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:12:14,922] [INFO] [logging.py:96:log_dist] [Rank 0] step=13730, skipped=265, lr=[1.7204455283444243e-07, 1.7204455283444243e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:12:14,941] [INFO] [timer.py:199:stop] epoch=14/micro_step=850/global_step=13730, RunningAvgSamplesPerSec=177.07855460464648, CurrSamplesPerSec=177.05247619770006, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:12:18,560] [INFO] [logging.py:96:log_dist] [Rank 0] step=13740, skipped=265, lr=[1.6932981712820401e-07, 1.6932981712820401e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:12:18,579] [INFO] [timer.py:199:stop] epoch=14/micro_step=860/global_step=13740, RunningAvgSamplesPerSec=177.07785486894625, CurrSamplesPerSec=176.9873375824903, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:12:22,181] [INFO] [logging.py:96:log_dist] [Rank 0] step=13750, skipped=265, lr=[1.6663628778378454e-07, 1.6663628778378454e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:12:22,199] [INFO] [timer.py:199:stop] epoch=14/micro_step=870/global_step=13750, RunningAvgSamplesPerSec=177.0777652807629, CurrSamplesPerSec=176.93507459095758, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:12:25,802] [INFO] [logging.py:96:log_dist] [Rank 0] step=13760, skipped=265, lr=[1.6396397707008403e-07, 1.6396397707008403e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:12:25,820] [INFO] [timer.py:199:stop] epoch=14/micro_step=880/global_step=13760, RunningAvgSamplesPerSec=177.07767816627788, CurrSamplesPerSec=176.91688311517126, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:12:29,422] [INFO] [logging.py:96:log_dist] [Rank 0] step=13770, skipped=265, lr=[1.613128971593536e-07, 1.613128971593536e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:12:29,441] [INFO] [timer.py:199:stop] epoch=14/micro_step=890/global_step=13770, RunningAvgSamplesPerSec=177.07761863365278, CurrSamplesPerSec=177.0638044692811, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:12:33,043] [INFO] [logging.py:96:log_dist] [Rank 0] step=13780, skipped=265, lr=[1.586830601271403e-07, 1.586830601271403e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:12:33,062] [INFO] [timer.py:199:stop] epoch=14/micro_step=900/global_step=13780, RunningAvgSamplesPerSec=177.07753119264416, CurrSamplesPerSec=176.84648471804155, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:12:36,679] [INFO] [logging.py:96:log_dist] [Rank 0] step=13790, skipped=265, lr=[1.5607447795222968e-07, 1.5607447795222968e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:12:36,690] [INFO] [timer.py:199:stop] epoch=14/micro_step=910/global_step=13790, RunningAvgSamplesPerSec=177.07719989411004, CurrSamplesPerSec=173.65887008551763, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:12:38,473] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:12:38,806] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:12:40,237] [INFO] [logging.py:96:log_dist] [Rank 0] step=13800, skipped=267, lr=[1.5400292369749732e-07, 1.5400292369749732e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:12:40,256] [INFO] [timer.py:199:stop] epoch=14/micro_step=920/global_step=13800, RunningAvgSamplesPerSec=177.07905498871665, CurrSamplesPerSec=176.93437484963553, MemAllocated=4.98GB, MaxMemAllocated=24.08GB ***** Evaluating perplexity, Epoch 15/16 ***** ppl: 1.7748836278915405 Beginning of Epoch 16/16, Total Micro Batches 920 [2023-04-21 23:12:52,038] [INFO] [logging.py:96:log_dist] [Rank 0] step=13810, skipped=267, lr=[1.5143263014320282e-07, 1.5143263014320282e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:12:52,056] [INFO] [timer.py:199:stop] epoch=15/micro_step=10/global_step=13810, RunningAvgSamplesPerSec=177.0772119602919, CurrSamplesPerSec=176.69713214462502, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:12:55,657] [INFO] [logging.py:96:log_dist] [Rank 0] step=13820, skipped=267, lr=[1.4888362447158386e-07, 1.4888362447158386e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:12:55,675] [INFO] [timer.py:199:stop] epoch=15/micro_step=20/global_step=13820, RunningAvgSamplesPerSec=177.07717646479722, CurrSamplesPerSec=176.96073559532763, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:12:59,281] [INFO] [logging.py:96:log_dist] [Rank 0] step=13830, skipped=267, lr=[1.4635591829324226e-07, 1.4635591829324226e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:12:59,299] [INFO] [timer.py:199:stop] epoch=15/micro_step=30/global_step=13830, RunningAvgSamplesPerSec=177.07696286406338, CurrSamplesPerSec=177.1277137244629, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:13:02,902] [INFO] [logging.py:96:log_dist] [Rank 0] step=13840, skipped=267, lr=[1.438495231217624e-07, 1.438495231217624e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:13:02,920] [INFO] [timer.py:199:stop] epoch=15/micro_step=40/global_step=13840, RunningAvgSamplesPerSec=177.07688273433396, CurrSamplesPerSec=177.00624385521957, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:13:06,527] [INFO] [logging.py:96:log_dist] [Rank 0] step=13850, skipped=267, lr=[1.4136445037365856e-07, 1.4136445037365856e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:13:06,546] [INFO] [timer.py:199:stop] epoch=15/micro_step=50/global_step=13850, RunningAvgSamplesPerSec=177.07661292815953, CurrSamplesPerSec=176.89578102614996, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:13:10,150] [INFO] [logging.py:96:log_dist] [Rank 0] step=13860, skipped=267, lr=[1.3890071136832062e-07, 1.3890071136832062e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:13:10,169] [INFO] [timer.py:199:stop] epoch=15/micro_step=60/global_step=13860, RunningAvgSamplesPerSec=177.07645953868283, CurrSamplesPerSec=177.13624623701185, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:13:13,770] [INFO] [logging.py:96:log_dist] [Rank 0] step=13870, skipped=267, lr=[1.3645831732796759e-07, 1.3645831732796759e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:13:13,788] [INFO] [timer.py:199:stop] epoch=15/micro_step=70/global_step=13870, RunningAvgSamplesPerSec=177.0764234589922, CurrSamplesPerSec=176.95863577902605, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:13:17,392] [INFO] [logging.py:96:log_dist] [Rank 0] step=13880, skipped=267, lr=[1.340372793775895e-07, 1.340372793775895e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:13:17,411] [INFO] [timer.py:199:stop] epoch=15/micro_step=80/global_step=13880, RunningAvgSamplesPerSec=177.07627260740784, CurrSamplesPerSec=176.8553397130763, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:13:21,012] [INFO] [logging.py:96:log_dist] [Rank 0] step=13890, skipped=267, lr=[1.3163760854490226e-07, 1.3163760854490226e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:13:21,030] [INFO] [timer.py:199:stop] epoch=15/micro_step=90/global_step=13890, RunningAvgSamplesPerSec=177.07623334159098, CurrSamplesPerSec=177.2323814403313, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:13:23,549] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:13:23,883] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:13:24,590] [INFO] [logging.py:96:log_dist] [Rank 0] step=13900, skipped=269, lr=[1.2973326355198338e-07, 1.2973326355198338e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:13:24,608] [INFO] [timer.py:199:stop] epoch=15/micro_step=100/global_step=13900, RunningAvgSamplesPerSec=177.07766391135243, CurrSamplesPerSec=176.7640360356089, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:13:28,209] [INFO] [logging.py:96:log_dist] [Rank 0] step=13910, skipped=269, lr=[1.2737208101028665e-07, 1.2737208101028665e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:13:28,227] [INFO] [timer.py:199:stop] epoch=15/micro_step=110/global_step=13910, RunningAvgSamplesPerSec=177.077644009576, CurrSamplesPerSec=177.018850281221, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:13:31,830] [INFO] [logging.py:96:log_dist] [Rank 0] step=13920, skipped=269, lr=[1.250322959459502e-07, 1.250322959459502e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:13:31,848] [INFO] [timer.py:199:stop] epoch=15/micro_step=120/global_step=13920, RunningAvgSamplesPerSec=177.07756145005484, CurrSamplesPerSec=176.82178862838273, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:13:35,449] [INFO] [logging.py:96:log_dist] [Rank 0] step=13930, skipped=269, lr=[1.227139190165874e-07, 1.227139190165874e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:13:35,467] [INFO] [timer.py:199:stop] epoch=15/micro_step=130/global_step=13930, RunningAvgSamplesPerSec=177.0775425226757, CurrSamplesPerSec=176.8520772433167, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:13:39,069] [INFO] [logging.py:96:log_dist] [Rank 0] step=13940, skipped=269, lr=[1.2041696078229589e-07, 1.2041696078229589e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:13:39,088] [INFO] [timer.py:199:stop] epoch=15/micro_step=140/global_step=13940, RunningAvgSamplesPerSec=177.07745135467397, CurrSamplesPerSec=176.81619801852372, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:13:42,705] [INFO] [logging.py:96:log_dist] [Rank 0] step=13950, skipped=269, lr=[1.1814143170561594e-07, 1.1814143170561594e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:13:42,723] [INFO] [timer.py:199:stop] epoch=15/micro_step=150/global_step=13950, RunningAvgSamplesPerSec=177.07683232164746, CurrSamplesPerSec=176.2608841822155, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:13:46,327] [INFO] [logging.py:96:log_dist] [Rank 0] step=13960, skipped=269, lr=[1.1588734215147629e-07, 1.1588734215147629e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:13:46,345] [INFO] [timer.py:199:stop] epoch=15/micro_step=160/global_step=13960, RunningAvgSamplesPerSec=177.0767143108868, CurrSamplesPerSec=177.12105192313277, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:13:49,948] [INFO] [logging.py:96:log_dist] [Rank 0] step=13970, skipped=269, lr=[1.1365470238714959e-07, 1.1365470238714959e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:13:49,966] [INFO] [timer.py:199:stop] epoch=15/micro_step=170/global_step=13970, RunningAvgSamplesPerSec=177.07659978926114, CurrSamplesPerSec=176.97566905965417, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:13:53,568] [INFO] [logging.py:96:log_dist] [Rank 0] step=13980, skipped=269, lr=[1.11443522582207e-07, 1.11443522582207e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:13:53,587] [INFO] [timer.py:199:stop] epoch=15/micro_step=180/global_step=13980, RunningAvgSamplesPerSec=177.07652071380497, CurrSamplesPerSec=176.7511167298778, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:13:57,204] [INFO] [logging.py:96:log_dist] [Rank 0] step=13990, skipped=269, lr=[1.0925381280847098e-07, 1.0925381280847098e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:13:57,222] [INFO] [timer.py:199:stop] epoch=15/micro_step=190/global_step=13990, RunningAvgSamplesPerSec=177.07591455278725, CurrSamplesPerSec=176.5474510546725, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:14:00,453] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:14:00,787] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:14:00,787] [INFO] [logging.py:96:log_dist] [Rank 0] step=14000, skipped=271, lr=[1.0751751011778247e-07, 1.0751751011778247e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:14:00,788] [INFO] [timer.py:199:stop] epoch=15/micro_step=200/global_step=14000, RunningAvgSamplesPerSec=177.07777206426564, CurrSamplesPerSec=192.10103580593776, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:14:04,391] [INFO] [logging.py:96:log_dist] [Rank 0] step=14010, skipped=271, lr=[1.0536647146901515e-07, 1.0536647146901515e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:14:04,409] [INFO] [timer.py:199:stop] epoch=15/micro_step=210/global_step=14010, RunningAvgSamplesPerSec=177.07766104916897, CurrSamplesPerSec=176.8406595469814, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:14:08,012] [INFO] [logging.py:96:log_dist] [Rank 0] step=14020, skipped=271, lr=[1.0323693053214024e-07, 1.0323693053214024e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:14:08,031] [INFO] [timer.py:199:stop] epoch=15/micro_step=220/global_step=14020, RunningAvgSamplesPerSec=177.07754780936426, CurrSamplesPerSec=176.79511994003985, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:14:11,634] [INFO] [logging.py:96:log_dist] [Rank 0] step=14030, skipped=271, lr=[1.0112889700711685e-07, 1.0112889700711685e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:14:11,652] [INFO] [timer.py:199:stop] epoch=15/micro_step=230/global_step=14030, RunningAvgSamplesPerSec=177.07743743744138, CurrSamplesPerSec=177.1599779569236, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:14:15,254] [INFO] [logging.py:96:log_dist] [Rank 0] step=14040, skipped=271, lr=[9.904238049594058e-08, 9.904238049594058e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:14:15,272] [INFO] [timer.py:199:stop] epoch=15/micro_step=240/global_step=14040, RunningAvgSamplesPerSec=177.07739187788243, CurrSamplesPerSec=177.25602188069948, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:14:18,875] [INFO] [logging.py:96:log_dist] [Rank 0] step=14050, skipped=271, lr=[9.697739050259745e-08, 9.697739050259745e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:14:18,893] [INFO] [timer.py:199:stop] epoch=15/micro_step=250/global_step=14050, RunningAvgSamplesPerSec=177.07729817181146, CurrSamplesPerSec=176.9452214621233, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:14:22,494] [INFO] [logging.py:96:log_dist] [Rank 0] step=14060, skipped=271, lr=[9.493393643302004e-08, 9.493393643302004e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:14:22,512] [INFO] [timer.py:199:stop] epoch=15/micro_step=260/global_step=14060, RunningAvgSamplesPerSec=177.0772677910322, CurrSamplesPerSec=176.9270278998952, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:14:26,114] [INFO] [logging.py:96:log_dist] [Rank 0] step=14070, skipped=271, lr=[9.291202759504828e-08, 9.291202759504828e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:14:26,132] [INFO] [timer.py:199:stop] epoch=15/micro_step=270/global_step=14070, RunningAvgSamplesPerSec=177.07722741470758, CurrSamplesPerSec=177.02678858182946, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:14:29,761] [INFO] [logging.py:96:log_dist] [Rank 0] step=14080, skipped=271, lr=[9.091167319838243e-08, 9.091167319838243e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:14:29,779] [INFO] [timer.py:199:stop] epoch=15/micro_step=280/global_step=14080, RunningAvgSamplesPerSec=177.07624261921458, CurrSamplesPerSec=176.83763060968644, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:14:33,382] [INFO] [logging.py:96:log_dist] [Rank 0] step=14090, skipped=271, lr=[8.89328823545444e-08, 8.89328823545444e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:14:33,400] [INFO] [timer.py:199:stop] epoch=15/micro_step=290/global_step=14090, RunningAvgSamplesPerSec=177.07612531251294, CurrSamplesPerSec=176.9452214621233, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:14:37,003] [INFO] [logging.py:96:log_dist] [Rank 0] step=14100, skipped=271, lr=[8.697566407683387e-08, 8.697566407683387e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:14:37,021] [INFO] [timer.py:199:stop] epoch=15/micro_step=300/global_step=14100, RunningAvgSamplesPerSec=177.07603418905634, CurrSamplesPerSec=176.74157395104825, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:14:37,354] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:14:37,687] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:14:40,568] [INFO] [logging.py:96:log_dist] [Rank 0] step=14110, skipped=273, lr=[8.542542769648692e-08, 8.542542769648692e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:14:40,587] [INFO] [timer.py:199:stop] epoch=15/micro_step=310/global_step=14110, RunningAvgSamplesPerSec=177.07786449999136, CurrSamplesPerSec=176.80734694253928, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:14:44,192] [INFO] [logging.py:96:log_dist] [Rank 0] step=14120, skipped=273, lr=[8.350706243764637e-08, 8.350706243764637e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:14:44,211] [INFO] [timer.py:199:stop] epoch=15/micro_step=320/global_step=14120, RunningAvgSamplesPerSec=177.07765110769006, CurrSamplesPerSec=176.02301898620462, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:14:47,829] [INFO] [logging.py:96:log_dist] [Rank 0] step=14130, skipped=273, lr=[8.16102944592928e-08, 8.16102944592928e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:14:47,847] [INFO] [timer.py:199:stop] epoch=15/micro_step=330/global_step=14130, RunningAvgSamplesPerSec=177.07703645365484, CurrSamplesPerSec=177.13636312645298, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:14:51,448] [INFO] [logging.py:96:log_dist] [Rank 0] step=14140, skipped=273, lr=[7.973513240111515e-08, 7.973513240111515e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:14:51,466] [INFO] [timer.py:199:stop] epoch=15/micro_step=340/global_step=14140, RunningAvgSamplesPerSec=177.07698318012154, CurrSamplesPerSec=176.79453774400713, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:14:55,070] [INFO] [logging.py:96:log_dist] [Rank 0] step=14150, skipped=273, lr=[7.78815848043905e-08, 7.78815848043905e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:14:55,088] [INFO] [timer.py:199:stop] epoch=15/micro_step=350/global_step=14150, RunningAvgSamplesPerSec=177.07686919856937, CurrSamplesPerSec=177.00367609968131, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:14:58,699] [INFO] [logging.py:96:log_dist] [Rank 0] step=14160, skipped=273, lr=[7.604966011194172e-08, 7.604966011194172e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:14:58,718] [INFO] [timer.py:199:stop] epoch=15/micro_step=360/global_step=14160, RunningAvgSamplesPerSec=177.07647139573203, CurrSamplesPerSec=176.97590241503278, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:15:02,349] [INFO] [logging.py:96:log_dist] [Rank 0] step=14170, skipped=273, lr=[7.423936666810114e-08, 7.423936666810114e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:15:02,367] [INFO] [timer.py:199:stop] epoch=15/micro_step=370/global_step=14170, RunningAvgSamplesPerSec=177.075408289171, CurrSamplesPerSec=176.74401773527586, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:15:05,971] [INFO] [logging.py:96:log_dist] [Rank 0] step=14180, skipped=273, lr=[7.245071271867132e-08, 7.245071271867132e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:15:05,990] [INFO] [timer.py:199:stop] epoch=15/micro_step=380/global_step=14180, RunningAvgSamplesPerSec=177.07527818723136, CurrSamplesPerSec=176.98850451940189, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:15:09,592] [INFO] [logging.py:96:log_dist] [Rank 0] step=14190, skipped=273, lr=[7.068370641088817e-08, 7.068370641088817e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:15:09,610] [INFO] [timer.py:199:stop] epoch=15/micro_step=390/global_step=14190, RunningAvgSamplesPerSec=177.07522387186407, CurrSamplesPerSec=177.08868503949995, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:15:13,214] [INFO] [logging.py:96:log_dist] [Rank 0] step=14200, skipped=273, lr=[6.893835579338344e-08, 6.893835579338344e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:15:13,232] [INFO] [timer.py:199:stop] epoch=15/micro_step=400/global_step=14200, RunningAvgSamplesPerSec=177.07511210075012, CurrSamplesPerSec=176.7051579373517, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:15:14,288] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:15:14,622] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:15:16,776] [INFO] [logging.py:96:log_dist] [Rank 0] step=14210, skipped=275, lr=[6.755767274208959e-08, 6.755767274208959e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:15:16,794] [INFO] [timer.py:199:stop] epoch=15/micro_step=410/global_step=14210, RunningAvgSamplesPerSec=177.07704889437602, CurrSamplesPerSec=177.07162999819917, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:15:20,396] [INFO] [logging.py:96:log_dist] [Rank 0] step=14220, skipped=275, lr=[6.585132233475405e-08, 6.585132233475405e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:15:20,414] [INFO] [timer.py:199:stop] epoch=15/micro_step=420/global_step=14220, RunningAvgSamplesPerSec=177.07698588865367, CurrSamplesPerSec=176.9806863359156, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:15:24,015] [INFO] [logging.py:96:log_dist] [Rank 0] step=14230, skipped=275, lr=[6.416664962898266e-08, 6.416664962898266e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:15:24,033] [INFO] [timer.py:199:stop] epoch=15/micro_step=430/global_step=14230, RunningAvgSamplesPerSec=177.0769587669053, CurrSamplesPerSec=177.0170992813422, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:15:27,635] [INFO] [logging.py:96:log_dist] [Rank 0] step=14240, skipped=275, lr=[6.25036622983806e-08, 6.25036622983806e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:15:27,654] [INFO] [timer.py:199:stop] epoch=15/micro_step=440/global_step=14240, RunningAvgSamplesPerSec=177.0769009093941, CurrSamplesPerSec=177.0260881149014, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:15:31,259] [INFO] [logging.py:96:log_dist] [Rank 0] step=14250, skipped=275, lr=[6.086236791777957e-08, 6.086236791777957e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:15:31,277] [INFO] [timer.py:199:stop] epoch=15/micro_step=450/global_step=14250, RunningAvgSamplesPerSec=177.0767271748675, CurrSamplesPerSec=177.02527091049137, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:15:34,897] [INFO] [logging.py:96:log_dist] [Rank 0] step=14260, skipped=275, lr=[5.92427739631977e-08, 5.92427739631977e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:15:34,915] [INFO] [timer.py:199:stop] epoch=15/micro_step=460/global_step=14260, RunningAvgSamplesPerSec=177.07625235156866, CurrSamplesPerSec=176.64562159003492, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:15:38,517] [INFO] [logging.py:96:log_dist] [Rank 0] step=14270, skipped=275, lr=[5.764488781181105e-08, 5.764488781181105e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:15:38,536] [INFO] [timer.py:199:stop] epoch=15/micro_step=470/global_step=14270, RunningAvgSamplesPerSec=177.07618480401547, CurrSamplesPerSec=176.95408634798923, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:15:42,137] [INFO] [logging.py:96:log_dist] [Rank 0] step=14280, skipped=275, lr=[5.606871674191729e-08, 5.606871674191729e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:15:42,155] [INFO] [timer.py:199:stop] epoch=15/micro_step=480/global_step=14280, RunningAvgSamplesPerSec=177.07615539334427, CurrSamplesPerSec=176.93309200491973, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:15:45,756] [INFO] [logging.py:96:log_dist] [Rank 0] step=14290, skipped=275, lr=[5.451426793290241e-08, 5.451426793290241e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:15:45,774] [INFO] [timer.py:199:stop] epoch=15/micro_step=490/global_step=14290, RunningAvgSamplesPerSec=177.07614230923795, CurrSamplesPerSec=176.9199147680276, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:15:49,383] [INFO] [logging.py:96:log_dist] [Rank 0] step=14300, skipped=275, lr=[5.298154846520809e-08, 5.298154846520809e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:15:49,401] [INFO] [timer.py:199:stop] epoch=15/micro_step=500/global_step=14300, RunningAvgSamplesPerSec=177.07585292089283, CurrSamplesPerSec=177.18757553735068, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:15:51,184] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:15:51,522] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:15:52,953] [INFO] [logging.py:96:log_dist] [Rank 0] step=14310, skipped=277, lr=[5.177102271167558e-08, 5.177102271167558e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:15:52,972] [INFO] [timer.py:199:stop] epoch=15/micro_step=510/global_step=14310, RunningAvgSamplesPerSec=177.07749421053984, CurrSamplesPerSec=176.81130652488503, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:15:56,574] [INFO] [logging.py:96:log_dist] [Rank 0] step=14320, skipped=277, lr=[5.0277433585122065e-08, 5.0277433585122065e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:15:56,593] [INFO] [timer.py:199:stop] epoch=15/micro_step=520/global_step=14320, RunningAvgSamplesPerSec=177.0774149025436, CurrSamplesPerSec=176.9533864562843, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:16:00,195] [INFO] [logging.py:96:log_dist] [Rank 0] step=14330, skipped=277, lr=[4.880559309847276e-08, 4.880559309847276e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:16:00,213] [INFO] [timer.py:199:stop] epoch=15/micro_step=530/global_step=14330, RunningAvgSamplesPerSec=177.07735846761307, CurrSamplesPerSec=176.86885694161398, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:16:03,815] [INFO] [logging.py:96:log_dist] [Rank 0] step=14340, skipped=277, lr=[4.7355507955894074e-08, 4.7355507955894074e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:16:03,833] [INFO] [timer.py:199:stop] epoch=15/micro_step=540/global_step=14340, RunningAvgSamplesPerSec=177.0773045799781, CurrSamplesPerSec=176.8467177328663, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:16:07,455] [INFO] [logging.py:96:log_dist] [Rank 0] step=14350, skipped=277, lr=[4.592718476245597e-08, 4.592718476245597e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:16:07,474] [INFO] [timer.py:199:stop] epoch=15/micro_step=550/global_step=14350, RunningAvgSamplesPerSec=177.07655115491693, CurrSamplesPerSec=177.16640882811058, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:16:11,075] [INFO] [logging.py:96:log_dist] [Rank 0] step=14360, skipped=277, lr=[4.452063002410463e-08, 4.452063002410463e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:16:11,094] [INFO] [timer.py:199:stop] epoch=15/micro_step=560/global_step=14360, RunningAvgSamplesPerSec=177.07650361627714, CurrSamplesPerSec=176.9857038966656, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:16:14,697] [INFO] [logging.py:96:log_dist] [Rank 0] step=14370, skipped=277, lr=[4.3135850147631915e-08, 4.3135850147631915e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:16:14,715] [INFO] [timer.py:199:stop] epoch=15/micro_step=570/global_step=14370, RunningAvgSamplesPerSec=177.07641345371735, CurrSamplesPerSec=176.94498818764765, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:16:18,316] [INFO] [logging.py:96:log_dist] [Rank 0] step=14380, skipped=277, lr=[4.1772851440644845e-08, 4.1772851440644845e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:16:18,335] [INFO] [timer.py:199:stop] epoch=15/micro_step=580/global_step=14380, RunningAvgSamplesPerSec=177.0763690992683, CurrSamplesPerSec=176.90394145806533, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:16:21,936] [INFO] [logging.py:96:log_dist] [Rank 0] step=14390, skipped=277, lr=[4.043164011154094e-08, 4.043164011154094e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:16:21,954] [INFO] [timer.py:199:stop] epoch=15/micro_step=590/global_step=14390, RunningAvgSamplesPerSec=177.0763500634381, CurrSamplesPerSec=177.11941576788016, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:16:25,557] [INFO] [logging.py:96:log_dist] [Rank 0] step=14400, skipped=277, lr=[3.911222226947448e-08, 3.911222226947448e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:16:25,576] [INFO] [timer.py:199:stop] epoch=15/micro_step=600/global_step=14400, RunningAvgSamplesPerSec=177.07626321356386, CurrSamplesPerSec=176.98838782501826, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:16:28,080] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:16:28,413] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:16:29,119] [INFO] [logging.py:96:log_dist] [Rank 0] step=14410, skipped=279, lr=[3.807238334846818e-08, 3.807238334846818e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:16:29,138] [INFO] [timer.py:199:stop] epoch=15/micro_step=610/global_step=14410, RunningAvgSamplesPerSec=177.07818402258164, CurrSamplesPerSec=177.03461083910182, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:16:32,740] [INFO] [logging.py:96:log_dist] [Rank 0] step=14420, skipped=279, lr=[3.679220886125956e-08, 3.679220886125956e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:16:32,759] [INFO] [timer.py:199:stop] epoch=15/micro_step=620/global_step=14420, RunningAvgSamplesPerSec=177.07810218489684, CurrSamplesPerSec=176.8579031666165, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:16:36,358] [INFO] [logging.py:96:log_dist] [Rank 0] step=14430, skipped=279, lr=[3.553384443852918e-08, 3.553384443852918e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:16:36,377] [INFO] [timer.py:199:stop] epoch=15/micro_step=630/global_step=14430, RunningAvgSamplesPerSec=177.078119210609, CurrSamplesPerSec=177.07641909586854, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:16:39,981] [INFO] [logging.py:96:log_dist] [Rank 0] step=14440, skipped=279, lr=[3.429729581206866e-08, 3.429729581206866e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:16:40,000] [INFO] [timer.py:199:stop] epoch=15/micro_step=640/global_step=14440, RunningAvgSamplesPerSec=177.07797440814437, CurrSamplesPerSec=176.95641936032507, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:16:43,605] [INFO] [logging.py:96:log_dist] [Rank 0] step=14450, skipped=279, lr=[3.308256861429993e-08, 3.308256861429993e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:16:43,624] [INFO] [timer.py:199:stop] epoch=15/micro_step=650/global_step=14450, RunningAvgSamplesPerSec=177.0777837056253, CurrSamplesPerSec=177.0302909996208, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:16:47,226] [INFO] [logging.py:96:log_dist] [Rank 0] step=14460, skipped=279, lr=[3.188966837824903e-08, 3.188966837824903e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:16:47,244] [INFO] [timer.py:199:stop] epoch=15/micro_step=660/global_step=14460, RunningAvgSamplesPerSec=177.0777112266401, CurrSamplesPerSec=177.05107486279366, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:16:50,847] [INFO] [logging.py:96:log_dist] [Rank 0] step=14470, skipped=279, lr=[3.07186005375209e-08, 3.07186005375209e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:16:50,866] [INFO] [timer.py:199:stop] epoch=15/micro_step=670/global_step=14470, RunningAvgSamplesPerSec=177.07761150188196, CurrSamplesPerSec=176.84951395865932, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:16:54,476] [INFO] [logging.py:96:log_dist] [Rank 0] step=14480, skipped=279, lr=[2.956937042627529e-08, 2.956937042627529e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:16:54,487] [INFO] [timer.py:199:stop] epoch=15/micro_step=680/global_step=14480, RunningAvgSamplesPerSec=177.0775180145798, CurrSamplesPerSec=176.69027008845853, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:16:58,096] [INFO] [logging.py:96:log_dist] [Rank 0] step=14490, skipped=279, lr=[2.8441983279202135e-08, 2.8441983279202135e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:16:58,114] [INFO] [timer.py:199:stop] epoch=15/micro_step=690/global_step=14490, RunningAvgSamplesPerSec=177.07722713772836, CurrSamplesPerSec=176.92679467338777, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:17:01,716] [INFO] [logging.py:96:log_dist] [Rank 0] step=14500, skipped=279, lr=[2.7336444231497413e-08, 2.7336444231497413e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:17:01,734] [INFO] [timer.py:199:stop] epoch=15/micro_step=700/global_step=14500, RunningAvgSamplesPerSec=177.07718277406846, CurrSamplesPerSec=176.95152010547133, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:17:04,964] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:17:05,297] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:17:05,298] [INFO] [logging.py:96:log_dist] [Rank 0] step=14510, skipped=281, lr=[2.6467747012200608e-08, 2.6467747012200608e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:17:05,298] [INFO] [timer.py:199:stop] epoch=15/micro_step=710/global_step=14510, RunningAvgSamplesPerSec=177.07903439804696, CurrSamplesPerSec=191.9165859374609, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:17:08,900] [INFO] [logging.py:96:log_dist] [Rank 0] step=14520, skipped=281, lr=[2.540154716638121e-08, 2.540154716638121e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:17:08,919] [INFO] [timer.py:199:stop] epoch=15/micro_step=720/global_step=14520, RunningAvgSamplesPerSec=177.0789591273217, CurrSamplesPerSec=176.94895393738523, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:17:12,529] [INFO] [logging.py:96:log_dist] [Rank 0] step=14530, skipped=281, lr=[2.4357209268976138e-08, 2.4357209268976138e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:17:12,548] [INFO] [timer.py:199:stop] epoch=15/micro_step=730/global_step=14530, RunningAvgSamplesPerSec=177.07861461705417, CurrSamplesPerSec=176.9857038966656, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:17:16,150] [INFO] [logging.py:96:log_dist] [Rank 0] step=14540, skipped=281, lr=[2.333473807689571e-08, 2.333473807689571e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:17:16,168] [INFO] [timer.py:199:stop] epoch=15/micro_step=740/global_step=14540, RunningAvgSamplesPerSec=177.0785383007123, CurrSamplesPerSec=177.0302909996208, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:17:19,769] [INFO] [logging.py:96:log_dist] [Rank 0] step=14550, skipped=281, lr=[2.23341382474497e-08, 2.23341382474497e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:17:19,787] [INFO] [timer.py:199:stop] epoch=15/micro_step=750/global_step=14550, RunningAvgSamplesPerSec=177.07853608119368, CurrSamplesPerSec=177.0384638467804, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:17:23,388] [INFO] [logging.py:96:log_dist] [Rank 0] step=14560, skipped=281, lr=[2.1355414338322718e-08, 2.1355414338322718e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:17:23,407] [INFO] [timer.py:199:stop] epoch=15/micro_step=760/global_step=14560, RunningAvgSamplesPerSec=177.07849987511995, CurrSamplesPerSec=176.9802196005417, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:17:27,007] [INFO] [logging.py:96:log_dist] [Rank 0] step=14570, skipped=281, lr=[2.0398570807558106e-08, 2.0398570807558106e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:17:27,025] [INFO] [timer.py:199:stop] epoch=15/micro_step=770/global_step=14570, RunningAvgSamplesPerSec=177.07850238385328, CurrSamplesPerSec=177.19494414211746, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:17:30,626] [INFO] [logging.py:96:log_dist] [Rank 0] step=14580, skipped=281, lr=[1.946361201353225e-08, 1.946361201353225e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:17:30,644] [INFO] [timer.py:199:stop] epoch=15/micro_step=780/global_step=14580, RunningAvgSamplesPerSec=177.07849923198367, CurrSamplesPerSec=177.04290085126337, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:17:34,249] [INFO] [logging.py:96:log_dist] [Rank 0] step=14590, skipped=281, lr=[1.8550542214940644e-08, 1.8550542214940644e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:17:34,268] [INFO] [timer.py:199:stop] epoch=15/micro_step=790/global_step=14590, RunningAvgSamplesPerSec=177.0783257654392, CurrSamplesPerSec=176.98231992911093, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:17:37,871] [INFO] [logging.py:96:log_dist] [Rank 0] step=14600, skipped=281, lr=[1.765936557077271e-08, 1.765936557077271e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:17:37,890] [INFO] [timer.py:199:stop] epoch=15/micro_step=800/global_step=14600, RunningAvgSamplesPerSec=177.07823370087485, CurrSamplesPerSec=177.0548118051731, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:17:41,491] [INFO] [logging.py:96:log_dist] [Rank 0] step=14610, skipped=281, lr=[1.6790086140297347e-08, 1.6790086140297347e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:17:41,510] [INFO] [timer.py:199:stop] epoch=15/micro_step=810/global_step=14610, RunningAvgSamplesPerSec=177.07817969779853, CurrSamplesPerSec=176.94837072774578, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:17:41,843] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:17:42,176] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:17:45,086] [INFO] [logging.py:96:log_dist] [Rank 0] step=14620, skipped=283, lr=[1.6110431253929868e-08, 1.6110431253929868e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:17:45,104] [INFO] [timer.py:199:stop] epoch=15/micro_step=820/global_step=14620, RunningAvgSamplesPerSec=177.0789829610796, CurrSamplesPerSec=176.77788856034613, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:17:48,707] [INFO] [logging.py:96:log_dist] [Rank 0] step=14630, skipped=283, lr=[1.5280576719072884e-08, 1.5280576719072884e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:17:48,726] [INFO] [timer.py:199:stop] epoch=15/micro_step=830/global_step=14630, RunningAvgSamplesPerSec=177.0788964249475, CurrSamplesPerSec=177.03963145797678, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:17:52,326] [INFO] [logging.py:96:log_dist] [Rank 0] step=14640, skipped=283, lr=[1.4472630233181501e-08, 1.4472630233181501e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:17:52,344] [INFO] [timer.py:199:stop] epoch=15/micro_step=840/global_step=14640, RunningAvgSamplesPerSec=177.0788992253621, CurrSamplesPerSec=177.13764892048732, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:17:55,946] [INFO] [logging.py:96:log_dist] [Rank 0] step=14650, skipped=283, lr=[1.3686595476413991e-08, 1.3686595476413991e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:17:55,965] [INFO] [timer.py:199:stop] epoch=15/micro_step=850/global_step=14650, RunningAvgSamplesPerSec=177.0788259431593, CurrSamplesPerSec=177.09791482214038, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:17:59,577] [INFO] [logging.py:96:log_dist] [Rank 0] step=14660, skipped=283, lr=[1.2922476029122392e-08, 1.2922476029122392e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:17:59,595] [INFO] [timer.py:199:stop] epoch=15/micro_step=860/global_step=14660, RunningAvgSamplesPerSec=177.07842477851003, CurrSamplesPerSec=175.82736086157126, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:18:03,206] [INFO] [logging.py:96:log_dist] [Rank 0] step=14670, skipped=283, lr=[1.2180275371835887e-08, 1.2180275371835887e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:18:03,224] [INFO] [timer.py:199:stop] epoch=15/micro_step=870/global_step=14670, RunningAvgSamplesPerSec=177.07807667576407, CurrSamplesPerSec=177.04955677501056, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:18:06,828] [INFO] [logging.py:96:log_dist] [Rank 0] step=14680, skipped=283, lr=[1.1459996885243141e-08, 1.1459996885243141e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:18:06,847] [INFO] [timer.py:199:stop] epoch=15/micro_step=880/global_step=14680, RunningAvgSamplesPerSec=177.07796046088916, CurrSamplesPerSec=177.13051884336645, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:18:10,448] [INFO] [logging.py:96:log_dist] [Rank 0] step=14690, skipped=283, lr=[1.076164385017997e-08, 1.076164385017997e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:18:10,467] [INFO] [timer.py:199:stop] epoch=15/micro_step=890/global_step=14690, RunningAvgSamplesPerSec=177.0779051010191, CurrSamplesPerSec=177.16383642361114, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:18:14,070] [INFO] [logging.py:96:log_dist] [Rank 0] step=14700, skipped=283, lr=[1.0085219447612204e-08, 1.0085219447612204e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:18:14,088] [INFO] [timer.py:199:stop] epoch=15/micro_step=900/global_step=14700, RunningAvgSamplesPerSec=177.07781743047008, CurrSamplesPerSec=177.16114717150484, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:18:17,708] [INFO] [logging.py:96:log_dist] [Rank 0] step=14710, skipped=283, lr=[9.430726758620686e-09, 9.430726758620686e-09], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:18:17,726] [INFO] [timer.py:199:stop] epoch=15/micro_step=910/global_step=14710, RunningAvgSamplesPerSec=177.07735059000308, CurrSamplesPerSec=176.7327303443726, MemAllocated=4.98GB, MaxMemAllocated=24.08GB [2023-04-21 23:18:18,784] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-21 23:18:19,118] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-21 23:18:21,273] [INFO] [logging.py:96:log_dist] [Rank 0] step=14720, skipped=285, lr=[8.92292544791556e-09, 8.92292544791556e-09], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-21 23:18:21,291] [INFO] [timer.py:199:stop] epoch=15/micro_step=920/global_step=14720, RunningAvgSamplesPerSec=177.0791244119088, CurrSamplesPerSec=176.9038248751655, MemAllocated=4.98GB, MaxMemAllocated=24.08GB ***** Evaluating perplexity, Epoch 16/16 ***** ppl: 1.7787002325057983 saving the final model ... [2023-04-21 23:18:34,943] [INFO] [launch.py:460:main] Process 2763039 exits successfully. [2023-04-21 23:18:34,943] [INFO] [launch.py:460:main] Process 2763034 exits successfully. [2023-04-21 23:18:35,945] [INFO] [launch.py:460:main] Process 2763037 exits successfully. [2023-04-21 23:18:35,945] [INFO] [launch.py:460:main] Process 2763036 exits successfully. [2023-04-21 23:18:35,945] [INFO] [launch.py:460:main] Process 2763040 exits successfully. [2023-04-21 23:18:35,945] [INFO] [launch.py:460:main] Process 2763038 exits successfully. [2023-04-21 23:18:35,945] [INFO] [launch.py:460:main] Process 2763033 exits successfully. [2023-04-21 23:18:37,947] [INFO] [launch.py:460:main] Process 2763032 exits successfully.