[2023-04-19 16:55:10,332] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-04-19 16:55:10,380] [INFO] [runner.py:540:main] cmd = /home/ubuntu/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path iamketan25/alpaca-instructions-dataset iamketan25/dolly-instructions-15k iamketan25/gsm-general-qa-instructions --model_name_or_path iamketan25/gpt-neo-1.3b-sft --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --max_seq_len 512 --learning_rate 1e-3 --weight_decay 0.1 --num_train_epochs 1 --gradient_accumulation_steps 2 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --zero_stage 3 --lora_dim 16 --lora_module_name h. --only_optimize_lora --deepspeed --output_dir ./gpt_neo_1.3b_sft_lora_dim_32_zero_stage3_epoch2 [2023-04-19 16:55:12,124] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]} [2023-04-19 16:55:12,124] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=4, node_rank=0 [2023-04-19 16:55:12,124] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1, 2, 3]}) [2023-04-19 16:55:12,124] [INFO] [launch.py:247:main] dist_world_size=4 [2023-04-19 16:55:12,124] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 [2023-04-19 16:55:15,363] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2023-04-19 16:55:20,249] [INFO] [partition_parameters.py:436:__exit__] finished initializing model with 1.42B parameters Some weights of GPTNeoForCausalLM were not initialized from the model checkpoint at iamketan25/gpt-neo-1.3b-sft and are newly initialized: ['transformer.h.10.attn.attention.masked_bias', 'transformer.h.9.attn.attention.masked_bias', 'transformer.h.17.attn.attention.masked_bias', 'transformer.h.1.attn.attention.masked_bias', 'transformer.h.12.attn.attention.masked_bias', 'transformer.h.16.attn.attention.masked_bias', 'transformer.h.11.attn.attention.masked_bias', 'transformer.h.6.attn.attention.masked_bias', 'transformer.h.4.attn.attention.masked_bias', 'transformer.h.20.attn.attention.masked_bias', 'transformer.h.14.attn.attention.masked_bias', 'transformer.h.13.attn.attention.masked_bias', 'transformer.h.8.attn.attention.masked_bias', 'transformer.h.7.attn.attention.masked_bias', 'transformer.h.3.attn.attention.masked_bias', 'transformer.h.23.attn.attention.masked_bias', 'transformer.h.18.attn.attention.masked_bias', 'transformer.h.2.attn.attention.masked_bias', 'transformer.h.0.attn.attention.masked_bias', 'transformer.h.5.attn.attention.masked_bias', 'transformer.h.22.attn.attention.masked_bias', 'transformer.h.19.attn.attention.masked_bias', 'transformer.h.21.attn.attention.masked_bias', 'transformer.h.15.attn.attention.masked_bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Some weights of GPTNeoForCausalLM were not initialized from the model checkpoint at iamketan25/gpt-neo-1.3b-sft and are newly initialized: ['transformer.h.20.attn.attention.masked_bias', 'transformer.h.12.attn.attention.masked_bias', 'transformer.h.15.attn.attention.masked_bias', 'transformer.h.5.attn.attention.masked_bias', 'transformer.h.3.attn.attention.masked_bias', 'transformer.h.7.attn.attention.masked_bias', 'transformer.h.16.attn.attention.masked_bias', 'transformer.h.4.attn.attention.masked_bias', 'transformer.h.9.attn.attention.masked_bias', 'transformer.h.11.attn.attention.masked_bias', 'transformer.h.19.attn.attention.masked_bias', 'transformer.h.2.attn.attention.masked_bias', 'transformer.h.14.attn.attention.masked_bias', 'transformer.h.8.attn.attention.masked_bias', 'transformer.h.18.attn.attention.masked_bias', 'transformer.h.22.attn.attention.masked_bias', 'transformer.h.0.attn.attention.masked_bias', 'transformer.h.21.attn.attention.masked_bias', 'transformer.h.1.attn.attention.masked_bias', 'transformer.h.17.attn.attention.masked_bias', 'transformer.h.6.attn.attention.masked_bias', 'transformer.h.23.attn.attention.masked_bias', 'transformer.h.10.attn.attention.masked_bias', 'transformer.h.13.attn.attention.masked_bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Some weights of GPTNeoForCausalLM were not initialized from the model checkpoint at iamketan25/gpt-neo-1.3b-sft and are newly initialized: ['transformer.h.13.attn.attention.masked_bias', 'transformer.h.7.attn.attention.masked_bias', 'transformer.h.0.attn.attention.masked_bias', 'transformer.h.2.attn.attention.masked_bias', 'transformer.h.10.attn.attention.masked_bias', 'transformer.h.4.attn.attention.masked_bias', 'transformer.h.23.attn.attention.masked_bias', 'transformer.h.6.attn.attention.masked_bias', 'transformer.h.15.attn.attention.masked_bias', 'transformer.h.16.attn.attention.masked_bias', 'transformer.h.18.attn.attention.masked_bias', 'transformer.h.5.attn.attention.masked_bias', 'transformer.h.3.attn.attention.masked_bias', 'transformer.h.1.attn.attention.masked_bias', 'transformer.h.11.attn.attention.masked_bias', 'transformer.h.9.attn.attention.masked_bias', 'transformer.h.19.attn.attention.masked_bias', 'transformer.h.8.attn.attention.masked_bias', 'transformer.h.17.attn.attention.masked_bias', 'transformer.h.20.attn.attention.masked_bias', 'transformer.h.21.attn.attention.masked_bias', 'transformer.h.22.attn.attention.masked_bias', 'transformer.h.12.attn.attention.masked_bias', 'transformer.h.14.attn.attention.masked_bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Some weights of GPTNeoForCausalLM were not initialized from the model checkpoint at iamketan25/gpt-neo-1.3b-sft and are newly initialized: ['transformer.h.18.attn.attention.masked_bias', 'transformer.h.10.attn.attention.masked_bias', 'transformer.h.13.attn.attention.masked_bias', 'transformer.h.4.attn.attention.masked_bias', 'transformer.h.20.attn.attention.masked_bias', 'transformer.h.21.attn.attention.masked_bias', 'transformer.h.15.attn.attention.masked_bias', 'transformer.h.7.attn.attention.masked_bias', 'transformer.h.2.attn.attention.masked_bias', 'transformer.h.14.attn.attention.masked_bias', 'transformer.h.6.attn.attention.masked_bias', 'transformer.h.1.attn.attention.masked_bias', 'transformer.h.9.attn.attention.masked_bias', 'transformer.h.11.attn.attention.masked_bias', 'transformer.h.12.attn.attention.masked_bias', 'transformer.h.5.attn.attention.masked_bias', 'transformer.h.17.attn.attention.masked_bias', 'transformer.h.16.attn.attention.masked_bias', 'transformer.h.22.attn.attention.masked_bias', 'transformer.h.0.attn.attention.masked_bias', 'transformer.h.8.attn.attention.masked_bias', 'transformer.h.19.attn.attention.masked_bias', 'transformer.h.23.attn.attention.masked_bias', 'transformer.h.3.attn.attention.masked_bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Found cached dataset parquet (/home/ubuntu/.cache/huggingface/datasets/iamketan25___parquet/iamketan25--alpaca-instructions-dataset-57eb880093a82a29/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec) 0%| | 0/2 [00:00 [2023-04-19 16:57:05,894] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 3 optimizer Using /home/ubuntu/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Using /home/ubuntu/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Using /home/ubuntu/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py39_cu117/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [2023-04-19 16:57:06,328] [INFO] [utils.py:785:see_memory_usage] Stage 3 initialize beginning [2023-04-19 16:57:06,329] [INFO] [utils.py:786:see_memory_usage] MA 0.78 GB Max_MA 1.28 GB CA 3.41 GB Max_CA 3 GB [2023-04-19 16:57:06,329] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 35.22 GB, percent = 18.9% [2023-04-19 16:57:06,331] [INFO] [stage3.py:113:__init__] Reduce bucket size 500,000,000 [2023-04-19 16:57:06,331] [INFO] [stage3.py:114:__init__] Prefetch bucket size 30000000 Using /home/ubuntu/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.5918192863464355 seconds Loading extension module utils... Loading extension module utils... Time to load utils op: 0.6023216247558594 seconds Time to load utils op: 0.6022861003875732 seconds Loading extension module utils... Time to load utils op: 0.20152544975280762 seconds [2023-04-19 16:57:06,875] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] [2023-04-19 16:57:06,876] [INFO] [utils.py:786:see_memory_usage] MA 0.78 GB Max_MA 0.78 GB CA 3.41 GB Max_CA 3 GB [2023-04-19 16:57:06,876] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 35.23 GB, percent = 18.9% Parameter Offload: Total persistent parameters: 495616 in 170 params [2023-04-19 16:57:07,289] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end] [2023-04-19 16:57:07,289] [INFO] [utils.py:786:see_memory_usage] MA 0.76 GB Max_MA 0.78 GB CA 3.41 GB Max_CA 3 GB [2023-04-19 16:57:07,290] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 35.23 GB, percent = 18.9% [2023-04-19 16:57:07,630] [INFO] [utils.py:785:see_memory_usage] Before creating fp16 partitions [2023-04-19 16:57:07,631] [INFO] [utils.py:786:see_memory_usage] MA 0.76 GB Max_MA 0.76 GB CA 3.41 GB Max_CA 3 GB [2023-04-19 16:57:07,631] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 35.23 GB, percent = 18.9% [2023-04-19 16:57:08,444] [INFO] [utils.py:785:see_memory_usage] After creating fp16 partitions: 1 [2023-04-19 16:57:08,444] [INFO] [utils.py:786:see_memory_usage] MA 0.76 GB Max_MA 0.76 GB CA 1.67 GB Max_CA 3 GB [2023-04-19 16:57:08,444] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 35.25 GB, percent = 18.9% [2023-04-19 16:57:08,786] [INFO] [utils.py:785:see_memory_usage] Before creating fp32 partitions [2023-04-19 16:57:08,786] [INFO] [utils.py:786:see_memory_usage] MA 0.76 GB Max_MA 0.76 GB CA 1.67 GB Max_CA 2 GB [2023-04-19 16:57:08,787] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 35.25 GB, percent = 18.9% [2023-04-19 16:57:09,128] [INFO] [utils.py:785:see_memory_usage] After creating fp32 partitions [2023-04-19 16:57:09,128] [INFO] [utils.py:786:see_memory_usage] MA 0.77 GB Max_MA 0.78 GB CA 1.67 GB Max_CA 2 GB [2023-04-19 16:57:09,129] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 35.25 GB, percent = 18.9% [2023-04-19 16:57:09,469] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states [2023-04-19 16:57:09,470] [INFO] [utils.py:786:see_memory_usage] MA 0.77 GB Max_MA 0.77 GB CA 1.67 GB Max_CA 2 GB [2023-04-19 16:57:09,470] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 35.25 GB, percent = 18.9% [2023-04-19 16:57:09,810] [INFO] [utils.py:785:see_memory_usage] After initializing optimizer states [2023-04-19 16:57:09,811] [INFO] [utils.py:786:see_memory_usage] MA 0.8 GB Max_MA 0.81 GB CA 1.67 GB Max_CA 2 GB [2023-04-19 16:57:09,811] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 35.25 GB, percent = 18.9% [2023-04-19 16:57:09,812] [INFO] [stage3.py:366:_setup_for_real_optimizer] optimizer state initialized Using /home/ubuntu/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.00035953521728515625 seconds Using /home/ubuntu/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.00038123130798339844 seconds Using /home/ubuntu/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0003180503845214844 seconds [2023-04-19 16:57:10,317] [INFO] [utils.py:785:see_memory_usage] After initializing ZeRO optimizer [2023-04-19 16:57:10,317] [INFO] [utils.py:786:see_memory_usage] MA 1.74 GB Max_MA 1.74 GB CA 2.61 GB Max_CA 3 GB [2023-04-19 16:57:10,318] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 35.25 GB, percent = 18.9% [2023-04-19 16:57:10,318] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam [2023-04-19 16:57:10,318] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler [2023-04-19 16:57:10,318] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = [2023-04-19 16:57:10,318] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.001], mom=[(0.9, 0.95)] [2023-04-19 16:57:10,319] [INFO] [config.py:953:print] DeepSpeedEngine configuration: [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] amp_enabled .................. False [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] amp_params ................... False [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] bfloat16_enabled ............. False [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] checkpoint_parallel_write_pipeline False [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] checkpoint_tag_validation_enabled True [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] checkpoint_tag_validation_fail False [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] comms_config ................. [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] communication_data_type ...... None [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] curriculum_enabled_legacy .... False [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] curriculum_params_legacy ..... False [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] data_efficiency_enabled ...... False [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] dataloader_drop_last ......... False [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] disable_allgather ............ False [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] dump_state ................... False [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 100, 'delayed_shift': 2, 'min_scale': 1} [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] eigenvalue_enabled ........... False [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] eigenvalue_gas_boundary_resolution 1 [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] eigenvalue_layer_name ........ bert.encoder.layer [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] eigenvalue_layer_num ......... 0 [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] eigenvalue_max_iter .......... 100 [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] eigenvalue_stability ......... 1e-06 [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] eigenvalue_tol ............... 0.01 [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] eigenvalue_verbose ........... False [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] elasticity_enabled ........... False [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] flops_profiler_config ........ { "enabled": false, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] fp16_auto_cast ............... False [2023-04-19 16:57:10,320] [INFO] [config.py:957:print] fp16_enabled ................. True [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] fp16_master_weights_and_gradients False [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] global_rank .................. 0 [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] grad_accum_dtype ............. None [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] gradient_accumulation_steps .. 2 [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] gradient_clipping ............ 1.0 [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] gradient_predivide_factor .... 1.0 [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] initial_dynamic_scale ........ 65536 [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] load_universal_checkpoint .... False [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] loss_scale ................... 0 [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] memory_breakdown ............. False [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] optimizer_legacy_fusion ...... False [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] optimizer_name ............... None [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] optimizer_params ............. None [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] pld_enabled .................. False [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] pld_params ................... False [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] prescale_gradients ........... False [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] scheduler_name ............... None [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] scheduler_params ............. None [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] sparse_attention ............. None [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] sparse_gradients_enabled ..... False [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] steps_per_print .............. 10 [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] train_batch_size ............. 32 [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] train_micro_batch_size_per_gpu 4 [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] use_node_local_storage ....... False [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] wall_clock_breakdown ......... False [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] world_size ................... 4 [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] zero_allow_untested_optimizer False [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=30000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=30000000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False memory_efficient_linear=False [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] zero_enabled ................. True [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] zero_force_ds_cpu_optimizer .. True [2023-04-19 16:57:10,321] [INFO] [config.py:957:print] zero_optimization_stage ...... 3 [2023-04-19 16:57:10,321] [INFO] [config.py:943:print_user_config] json = { "train_batch_size": 32, "train_micro_batch_size_per_gpu": 4, "steps_per_print": 10, "zero_optimization": { "stage": 3, "offload_param": { "device": "none" }, "offload_optimizer": { "device": "none" }, "stage3_param_persistence_threshold": 1.000000e+04, "stage3_max_live_parameters": 3.000000e+07, "stage3_prefetch_bucket_size": 3.000000e+07, "memory_efficient_linear": false }, "fp16": { "enabled": true, "loss_scale_window": 100 }, "gradient_clipping": 1.0, "prescale_gradients": false, "wall_clock_breakdown": false, "hybrid_engine": { "enabled": false, "inference_tp_size": 1, "release_inference_cache": false, "pin_parameters": true, "tp_gather_partition_size": 8 } } Using /home/ubuntu/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0003190040588378906 seconds ***** Running training ***** ***** Evaluating perplexity, Epoch 0/1 ***** ppl: 1.932552456855774 Beginning of Epoch 1/1, Total Micro Batches 5116 Invalidate trace cache @ step 0: expected module 16, but got module 0 [2023-04-19 17:04:45,205] [WARNING] [stage3.py:1787:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time [2023-04-19 17:05:06,421] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=0, lr=[0.000999962292024615], mom=[(0.9, 0.95)] [2023-04-19 17:05:06,421] [INFO] [timer.py:199:stop] epoch=0/micro_step=20/global_step=10, RunningAvgSamplesPerSec=12.07828969080794, CurrSamplesPerSec=12.495106931233762, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:05:32,168] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=0, lr=[0.0009998491737860256], mom=[(0.9, 0.95)] [2023-04-19 17:05:32,169] [INFO] [timer.py:199:stop] epoch=0/micro_step=40/global_step=20, RunningAvgSamplesPerSec=12.27707734265837, CurrSamplesPerSec=12.316409530577367, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:05:58,637] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=0, lr=[0.0009996606623460709], mom=[(0.9, 0.95)] [2023-04-19 17:05:58,637] [INFO] [timer.py:199:stop] epoch=0/micro_step=60/global_step=30, RunningAvgSamplesPerSec=12.213891713093837, CurrSamplesPerSec=12.335717709296935, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:06:25,208] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=0, lr=[0.0009993967861382895], mom=[(0.9, 0.95)] [2023-04-19 17:06:25,208] [INFO] [timer.py:199:stop] epoch=0/micro_step=80/global_step=40, RunningAvgSamplesPerSec=12.171658146585271, CurrSamplesPerSec=12.415960783326993, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:06:50,948] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=0, lr=[0.0009990575849636322], mom=[(0.9, 0.95)] [2023-04-19 17:06:50,949] [INFO] [timer.py:199:stop] epoch=0/micro_step=100/global_step=50, RunningAvgSamplesPerSec=12.227460835941864, CurrSamplesPerSec=12.498281061505862, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:07:17,048] [INFO] [logging.py:96:log_dist] [Rank 0] step=60, skipped=0, lr=[0.0009986431099844567], mom=[(0.9, 0.95)] [2023-04-19 17:07:17,049] [INFO] [timer.py:199:stop] epoch=0/micro_step=120/global_step=60, RunningAvgSamplesPerSec=12.235269962686647, CurrSamplesPerSec=12.427871199814515, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:07:42,760] [INFO] [logging.py:96:log_dist] [Rank 0] step=70, skipped=0, lr=[0.0009981534237168124], mom=[(0.9, 0.95)] [2023-04-19 17:07:42,760] [INFO] [timer.py:199:stop] epoch=0/micro_step=140/global_step=70, RunningAvgSamplesPerSec=12.26760719734709, CurrSamplesPerSec=12.447521475862185, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:08:09,216] [INFO] [logging.py:96:log_dist] [Rank 0] step=80, skipped=0, lr=[0.0009975886000210103], mom=[(0.9, 0.95)] [2023-04-19 17:08:09,216] [INFO] [timer.py:199:stop] epoch=0/micro_step=160/global_step=80, RunningAvgSamplesPerSec=12.246875617612654, CurrSamplesPerSec=12.463447132852547, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:08:34,878] [INFO] [logging.py:96:log_dist] [Rank 0] step=90, skipped=0, lr=[0.0009969487240904821], mom=[(0.9, 0.95)] [2023-04-19 17:08:34,878] [INFO] [timer.py:199:stop] epoch=0/micro_step=180/global_step=90, RunningAvgSamplesPerSec=12.273182191155067, CurrSamplesPerSec=12.479842927239387, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:09:01,201] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=0, lr=[0.0009962338924389318], mom=[(0.9, 0.95)] [2023-04-19 17:09:01,202] [INFO] [timer.py:199:stop] epoch=0/micro_step=200/global_step=100, RunningAvgSamplesPerSec=12.262435671957197, CurrSamplesPerSec=12.444532300839375, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:09:03,712] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-19 17:09:06,241] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536 [2023-04-19 17:09:27,458] [INFO] [logging.py:96:log_dist] [Rank 0] step=110, skipped=2, lr=[0.0009956081310737383], mom=[(0.9, 0.95)] [2023-04-19 17:09:27,458] [INFO] [timer.py:199:stop] epoch=0/micro_step=220/global_step=110, RunningAvgSamplesPerSec=12.256601643093727, CurrSamplesPerSec=10.073982804736758, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:09:53,231] [INFO] [logging.py:96:log_dist] [Rank 0] step=120, skipped=2, lr=[0.0009947586584163801], mom=[(0.9, 0.95)] [2023-04-19 17:09:53,232] [INFO] [timer.py:199:stop] epoch=0/micro_step=240/global_step=120, RunningAvgSamplesPerSec=12.270975908562303, CurrSamplesPerSec=12.399814784909182, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:10:19,854] [INFO] [logging.py:96:log_dist] [Rank 0] step=130, skipped=2, lr=[0.0009938345603697695], mom=[(0.9, 0.95)] [2023-04-19 17:10:19,854] [INFO] [timer.py:199:stop] epoch=0/micro_step=260/global_step=130, RunningAvgSamplesPerSec=12.25194301856043, CurrSamplesPerSec=12.366203557310797, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:10:45,602] [INFO] [logging.py:96:log_dist] [Rank 0] step=140, skipped=2, lr=[0.0009928359763173725], mom=[(0.9, 0.95)] [2023-04-19 17:10:45,603] [INFO] [timer.py:199:stop] epoch=0/micro_step=280/global_step=140, RunningAvgSamplesPerSec=12.26540750325886, CurrSamplesPerSec=12.465680069991771, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:11:12,054] [INFO] [logging.py:96:log_dist] [Rank 0] step=150, skipped=2, lr=[0.0009917630568775197], mom=[(0.9, 0.95)] [2023-04-19 17:11:12,055] [INFO] [timer.py:199:stop] epoch=0/micro_step=300/global_step=150, RunningAvgSamplesPerSec=12.254730873875417, CurrSamplesPerSec=12.451094211070863, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:11:37,804] [INFO] [logging.py:96:log_dist] [Rank 0] step=160, skipped=2, lr=[0.0009906159638806912], mom=[(0.9, 0.95)] [2023-04-19 17:11:37,805] [INFO] [timer.py:199:stop] epoch=0/micro_step=320/global_step=160, RunningAvgSamplesPerSec=12.266273586942694, CurrSamplesPerSec=12.372864453261226, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:12:04,310] [INFO] [logging.py:96:log_dist] [Rank 0] step=170, skipped=2, lr=[0.0009893948703451048], mom=[(0.9, 0.95)] [2023-04-19 17:12:04,311] [INFO] [timer.py:199:stop] epoch=0/micro_step=340/global_step=170, RunningAvgSamplesPerSec=12.255316195957107, CurrSamplesPerSec=12.35051278684349, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:12:30,058] [INFO] [logging.py:96:log_dist] [Rank 0] step=180, skipped=2, lr=[0.00098809996045062], mom=[(0.9, 0.95)] [2023-04-19 17:12:30,058] [INFO] [timer.py:199:stop] epoch=0/micro_step=360/global_step=180, RunningAvgSamplesPerSec=12.265586794678816, CurrSamplesPerSec=12.53894458266291, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:12:55,974] [INFO] [logging.py:96:log_dist] [Rank 0] step=190, skipped=2, lr=[0.0009867314295109592], mom=[(0.9, 0.95)] [2023-04-19 17:12:55,974] [INFO] [timer.py:199:stop] epoch=0/micro_step=380/global_step=190, RunningAvgSamplesPerSec=12.270574293164143, CurrSamplesPerSec=12.805881607610917, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:13:21,596] [INFO] [logging.py:96:log_dist] [Rank 0] step=200, skipped=2, lr=[0.0009852894839442454], mom=[(0.9, 0.95)] [2023-04-19 17:13:21,597] [INFO] [timer.py:199:stop] epoch=0/micro_step=400/global_step=200, RunningAvgSamplesPerSec=12.28206325598975, CurrSamplesPerSec=10.304444444615054, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:13:29,036] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-19 17:13:31,505] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536 [2023-04-19 17:13:46,471] [INFO] [logging.py:96:log_dist] [Rank 0] step=210, skipped=4, lr=[0.0009840832147423797], mom=[(0.9, 0.95)] [2023-04-19 17:13:46,471] [INFO] [timer.py:199:stop] epoch=0/micro_step=420/global_step=210, RunningAvgSamplesPerSec=12.309450362425165, CurrSamplesPerSec=12.82604532079847, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:14:12,095] [INFO] [logging.py:96:log_dist] [Rank 0] step=220, skipped=4, lr=[0.0009825096783456148], mom=[(0.9, 0.95)] [2023-04-19 17:14:12,095] [INFO] [timer.py:199:stop] epoch=0/micro_step=440/global_step=220, RunningAvgSamplesPerSec=12.318099158821909, CurrSamplesPerSec=12.881070542377875, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:14:36,998] [INFO] [logging.py:96:log_dist] [Rank 0] step=230, skipped=4, lr=[0.000980863364096554], mom=[(0.9, 0.95)] [2023-04-19 17:14:36,999] [INFO] [timer.py:199:stop] epoch=0/micro_step=460/global_step=230, RunningAvgSamplesPerSec=12.341027356371558, CurrSamplesPerSec=12.877603892322144, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:15:02,518] [INFO] [logging.py:96:log_dist] [Rank 0] step=240, skipped=4, lr=[0.0009791445203119053], mom=[(0.9, 0.95)] [2023-04-19 17:15:02,518] [INFO] [timer.py:199:stop] epoch=0/micro_step=480/global_step=240, RunningAvgSamplesPerSec=12.349763302828249, CurrSamplesPerSec=12.845641906787987, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:15:27,392] [INFO] [logging.py:96:log_dist] [Rank 0] step=250, skipped=4, lr=[0.0009773534062481454], mom=[(0.9, 0.95)] [2023-04-19 17:15:27,392] [INFO] [timer.py:199:stop] epoch=0/micro_step=500/global_step=250, RunningAvgSamplesPerSec=12.370237821031711, CurrSamplesPerSec=12.822156205013146, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:15:52,996] [INFO] [logging.py:96:log_dist] [Rank 0] step=260, skipped=4, lr=[0.0009754902920624147], mom=[(0.9, 0.95)] [2023-04-19 17:15:52,996] [INFO] [timer.py:199:stop] epoch=0/micro_step=520/global_step=260, RunningAvgSamplesPerSec=12.375631075615024, CurrSamplesPerSec=12.870813186986009, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:16:17,856] [INFO] [logging.py:96:log_dist] [Rank 0] step=270, skipped=4, lr=[0.0009735554587717682], mom=[(0.9, 0.95)] [2023-04-19 17:16:17,857] [INFO] [timer.py:199:stop] epoch=0/micro_step=540/global_step=270, RunningAvgSamplesPerSec=12.393919730920452, CurrSamplesPerSec=12.862547808911936, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:16:43,461] [INFO] [logging.py:96:log_dist] [Rank 0] step=280, skipped=4, lr=[0.0009715491982107905], mom=[(0.9, 0.95)] [2023-04-19 17:16:43,461] [INFO] [timer.py:199:stop] epoch=0/micro_step=560/global_step=280, RunningAvgSamplesPerSec=12.398073557379359, CurrSamplesPerSec=12.863134582943513, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:17:08,936] [INFO] [logging.py:96:log_dist] [Rank 0] step=290, skipped=4, lr=[0.0009694718129875771], mom=[(0.9, 0.95)] [2023-04-19 17:17:08,936] [INFO] [timer.py:199:stop] epoch=0/micro_step=580/global_step=290, RunningAvgSamplesPerSec=12.404108615541837, CurrSamplesPerSec=12.860413192082941, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:17:33,853] [INFO] [logging.py:96:log_dist] [Rank 0] step=300, skipped=4, lr=[0.0009673236164380912], mom=[(0.9, 0.95)] [2023-04-19 17:17:33,853] [INFO] [timer.py:199:stop] epoch=0/micro_step=600/global_step=300, RunningAvgSamplesPerSec=12.418762449265945, CurrSamplesPerSec=12.851047499700309, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:17:46,275] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-19 17:17:48,741] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536 [2023-04-19 17:17:59,496] [INFO] [logging.py:96:log_dist] [Rank 0] step=310, skipped=6, lr=[0.0009655542924250932], mom=[(0.9, 0.95)] [2023-04-19 17:17:59,496] [INFO] [timer.py:199:stop] epoch=0/micro_step=620/global_step=310, RunningAvgSamplesPerSec=12.421115391503212, CurrSamplesPerSec=12.816023451561925, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:18:24,477] [INFO] [logging.py:96:log_dist] [Rank 0] step=320, skipped=6, lr=[0.0009632794591562836], mom=[(0.9, 0.95)] [2023-04-19 17:18:24,478] [INFO] [timer.py:199:stop] epoch=0/micro_step=640/global_step=320, RunningAvgSamplesPerSec=12.433364398393774, CurrSamplesPerSec=12.829373894544624, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:18:50,281] [INFO] [logging.py:96:log_dist] [Rank 0] step=330, skipped=6, lr=[0.000960934748565705], mom=[(0.9, 0.95)] [2023-04-19 17:18:50,281] [INFO] [timer.py:199:stop] epoch=0/micro_step=660/global_step=330, RunningAvgSamplesPerSec=12.4327725639688, CurrSamplesPerSec=12.797726234057041, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:19:15,444] [INFO] [logging.py:96:log_dist] [Rank 0] step=340, skipped=6, lr=[0.0009585205143105142], mom=[(0.9, 0.95)] [2023-04-19 17:19:15,444] [INFO] [timer.py:199:stop] epoch=0/micro_step=680/global_step=340, RunningAvgSamplesPerSec=12.441374838659565, CurrSamplesPerSec=12.701174220816803, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:19:41,502] [INFO] [logging.py:96:log_dist] [Rank 0] step=350, skipped=6, lr=[0.0009560371205342551], mom=[(0.9, 0.95)] [2023-04-19 17:19:41,502] [INFO] [timer.py:199:stop] epoch=0/micro_step=700/global_step=350, RunningAvgSamplesPerSec=12.437047446225632, CurrSamplesPerSec=12.582267551983822, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:20:07,018] [INFO] [logging.py:96:log_dist] [Rank 0] step=360, skipped=6, lr=[0.0009534849418119328], mom=[(0.9, 0.95)] [2023-04-19 17:20:07,019] [INFO] [timer.py:199:stop] epoch=0/micro_step=720/global_step=360, RunningAvgSamplesPerSec=12.440282802443646, CurrSamplesPerSec=12.476507663951448, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:20:33,388] [INFO] [logging.py:96:log_dist] [Rank 0] step=370, skipped=6, lr=[0.0009508643630935172], mom=[(0.9, 0.95)] [2023-04-19 17:20:33,388] [INFO] [timer.py:199:stop] epoch=0/micro_step=740/global_step=370, RunningAvgSamplesPerSec=12.43213407083404, CurrSamplesPerSec=12.501919098490374, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:20:59,805] [INFO] [logging.py:96:log_dist] [Rank 0] step=380, skipped=6, lr=[0.0009481757796458796], mom=[(0.9, 0.95)] [2023-04-19 17:20:59,805] [INFO] [timer.py:199:stop] epoch=0/micro_step=760/global_step=380, RunningAvgSamplesPerSec=12.423823090381852, CurrSamplesPerSec=12.381460225246393, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:21:25,673] [INFO] [logging.py:96:log_dist] [Rank 0] step=390, skipped=6, lr=[0.0009454195969931738], mom=[(0.9, 0.95)] [2023-04-19 17:21:25,673] [INFO] [timer.py:199:stop] epoch=0/micro_step=780/global_step=390, RunningAvgSamplesPerSec=12.422769578096686, CurrSamplesPerSec=12.376197022211878, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:21:52,290] [INFO] [logging.py:96:log_dist] [Rank 0] step=400, skipped=6, lr=[0.0009425962308556705], mom=[(0.9, 0.95)] [2023-04-19 17:21:52,290] [INFO] [timer.py:199:stop] epoch=0/micro_step=800/global_step=400, RunningAvgSamplesPerSec=12.412703792380244, CurrSamplesPerSec=12.393729084304733, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:22:10,308] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-19 17:22:12,846] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536 [2023-04-19 17:22:18,036] [INFO] [logging.py:96:log_dist] [Rank 0] step=410, skipped=8, lr=[0.0009402894516714383], mom=[(0.9, 0.95)] [2023-04-19 17:22:18,036] [INFO] [timer.py:199:stop] epoch=0/micro_step=820/global_step=410, RunningAvgSamplesPerSec=12.41341628187407, CurrSamplesPerSec=12.390996757255785, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:22:44,528] [INFO] [logging.py:96:log_dist] [Rank 0] step=420, skipped=8, lr=[0.0009373462351812672], mom=[(0.9, 0.95)] [2023-04-19 17:22:44,528] [INFO] [timer.py:199:stop] epoch=0/micro_step=840/global_step=420, RunningAvgSamplesPerSec=12.405510418676558, CurrSamplesPerSec=12.443687744002439, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:23:10,274] [INFO] [logging.py:96:log_dist] [Rank 0] step=430, skipped=8, lr=[0.0009343370529268123], mom=[(0.9, 0.95)] [2023-04-19 17:23:10,274] [INFO] [timer.py:199:stop] epoch=0/micro_step=860/global_step=430, RunningAvgSamplesPerSec=12.406355781250136, CurrSamplesPerSec=12.453858887280555, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:23:36,828] [INFO] [logging.py:96:log_dist] [Rank 0] step=440, skipped=8, lr=[0.000931262358788755], mom=[(0.9, 0.95)] [2023-04-19 17:23:36,829] [INFO] [timer.py:199:stop] epoch=0/micro_step=880/global_step=440, RunningAvgSamplesPerSec=12.398296176552364, CurrSamplesPerSec=12.41375136191017, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:24:03,063] [INFO] [logging.py:96:log_dist] [Rank 0] step=450, skipped=8, lr=[0.000928122616529059], mom=[(0.9, 0.95)] [2023-04-19 17:24:03,064] [INFO] [timer.py:199:stop] epoch=0/micro_step=900/global_step=450, RunningAvgSamplesPerSec=12.394027658246998, CurrSamplesPerSec=10.898765305383344, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:24:29,250] [INFO] [logging.py:96:log_dist] [Rank 0] step=460, skipped=8, lr=[0.0009249182997210198], mom=[(0.9, 0.95)] [2023-04-19 17:24:29,250] [INFO] [timer.py:199:stop] epoch=0/micro_step=920/global_step=460, RunningAvgSamplesPerSec=12.390452563746633, CurrSamplesPerSec=12.425945131237496, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:24:55,622] [INFO] [logging.py:96:log_dist] [Rank 0] step=470, skipped=8, lr=[0.0009216498916778344], mom=[(0.9, 0.95)] [2023-04-19 17:24:55,623] [INFO] [timer.py:199:stop] epoch=0/micro_step=940/global_step=470, RunningAvgSamplesPerSec=12.38512612561435, CurrSamplesPerSec=12.463959862063918, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:25:21,314] [INFO] [logging.py:96:log_dist] [Rank 0] step=480, skipped=8, lr=[0.0009183178853797029], mom=[(0.9, 0.95)] [2023-04-19 17:25:21,314] [INFO] [timer.py:199:stop] epoch=0/micro_step=960/global_step=480, RunningAvgSamplesPerSec=12.386858313326691, CurrSamplesPerSec=12.438629725788537, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:25:47,838] [INFO] [logging.py:96:log_dist] [Rank 0] step=490, skipped=8, lr=[0.0009149227833994717], mom=[(0.9, 0.95)] [2023-04-19 17:25:47,838] [INFO] [timer.py:199:stop] epoch=0/micro_step=980/global_step=490, RunningAvgSamplesPerSec=12.38034147155676, CurrSamplesPerSec=12.433708277773485, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:26:13,689] [INFO] [logging.py:96:log_dist] [Rank 0] step=500, skipped=8, lr=[0.000911465097826828], mom=[(0.9, 0.95)] [2023-04-19 17:26:13,690] [INFO] [timer.py:199:stop] epoch=0/micro_step=1000/global_step=500, RunningAvgSamplesPerSec=12.380554445613633, CurrSamplesPerSec=12.456449068622906, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:26:37,501] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-19 17:26:40,028] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536 [2023-04-19 17:26:40,028] [INFO] [logging.py:96:log_dist] [Rank 0] step=510, skipped=10, lr=[0.0009086542393346895], mom=[(0.9, 0.95)] [2023-04-19 17:26:40,029] [INFO] [timer.py:199:stop] epoch=0/micro_step=1020/global_step=510, RunningAvgSamplesPerSec=12.376163880112658, CurrSamplesPerSec=12.67544033782103, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:27:05,740] [INFO] [logging.py:96:log_dist] [Rank 0] step=520, skipped=10, lr=[0.0009050852238427441], mom=[(0.9, 0.95)] [2023-04-19 17:27:05,741] [INFO] [timer.py:199:stop] epoch=0/micro_step=1040/global_step=520, RunningAvgSamplesPerSec=12.377737660987972, CurrSamplesPerSec=12.420433708306238, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:27:32,180] [INFO] [logging.py:96:log_dist] [Rank 0] step=530, skipped=10, lr=[0.0009014551085762004], mom=[(0.9, 0.95)] [2023-04-19 17:27:32,180] [INFO] [timer.py:199:stop] epoch=0/micro_step=1060/global_step=530, RunningAvgSamplesPerSec=12.372661889023702, CurrSamplesPerSec=12.377309799385882, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:27:58,847] [INFO] [logging.py:96:log_dist] [Rank 0] step=540, skipped=10, lr=[0.0008977644410722474], mom=[(0.9, 0.95)] [2023-04-19 17:27:58,848] [INFO] [timer.py:199:stop] epoch=0/micro_step=1080/global_step=540, RunningAvgSamplesPerSec=12.3657484083089, CurrSamplesPerSec=10.83456878461236, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:28:24,615] [INFO] [logging.py:96:log_dist] [Rank 0] step=550, skipped=10, lr=[0.0008940137780012825], mom=[(0.9, 0.95)] [2023-04-19 17:28:24,615] [INFO] [timer.py:199:stop] epoch=0/micro_step=1100/global_step=550, RunningAvgSamplesPerSec=12.366938514326245, CurrSamplesPerSec=12.479740812725497, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:28:50,954] [INFO] [logging.py:96:log_dist] [Rank 0] step=560, skipped=10, lr=[0.0008902036850829485], mom=[(0.9, 0.95)] [2023-04-19 17:28:50,954] [INFO] [timer.py:199:stop] epoch=0/micro_step=1120/global_step=560, RunningAvgSamplesPerSec=12.363198818138457, CurrSamplesPerSec=12.443068244459347, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:29:16,787] [INFO] [logging.py:96:log_dist] [Rank 0] step=570, skipped=10, lr=[0.0008863347370008057], mom=[(0.9, 0.95)] [2023-04-19 17:29:16,787] [INFO] [timer.py:199:stop] epoch=0/micro_step=1140/global_step=570, RunningAvgSamplesPerSec=12.3638415350952, CurrSamplesPerSec=12.362678227713483, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:29:43,335] [INFO] [logging.py:96:log_dist] [Rank 0] step=580, skipped=10, lr=[0.0008824075173156499], mom=[(0.9, 0.95)] [2023-04-19 17:29:43,336] [INFO] [timer.py:199:stop] epoch=0/micro_step=1160/global_step=580, RunningAvgSamplesPerSec=12.35855420493839, CurrSamplesPerSec=12.404735802011816, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:30:09,222] [INFO] [logging.py:96:log_dist] [Rank 0] step=590, skipped=10, lr=[0.0008784226183774943], mom=[(0.9, 0.95)] [2023-04-19 17:30:09,222] [INFO] [timer.py:199:stop] epoch=0/micro_step=1180/global_step=590, RunningAvgSamplesPerSec=12.358818701228529, CurrSamplesPerSec=12.365673775252885, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:30:35,854] [INFO] [logging.py:96:log_dist] [Rank 0] step=600, skipped=10, lr=[0.000874380641236223], mom=[(0.9, 0.95)] [2023-04-19 17:30:35,854] [INFO] [timer.py:199:stop] epoch=0/micro_step=1200/global_step=600, RunningAvgSamplesPerSec=12.353132909214159, CurrSamplesPerSec=12.35099239814929, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:31:01,682] [INFO] [logging.py:96:log_dist] [Rank 0] step=610, skipped=10, lr=[0.0008702821955509344], mom=[(0.9, 0.95)] [2023-04-19 17:31:01,682] [INFO] [timer.py:199:stop] epoch=0/micro_step=1220/global_step=610, RunningAvgSamplesPerSec=12.353940832433471, CurrSamplesPerSec=12.39328276766209, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:31:04,560] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-19 17:31:07,486] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536 [2023-04-19 17:31:28,208] [INFO] [logging.py:96:log_dist] [Rank 0] step=620, skipped=12, lr=[0.0008669631967817167], mom=[(0.9, 0.95)] [2023-04-19 17:31:28,208] [INFO] [timer.py:199:stop] epoch=0/micro_step=1240/global_step=620, RunningAvgSamplesPerSec=12.349334304280147, CurrSamplesPerSec=12.38552657235039, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:31:54,781] [INFO] [logging.py:96:log_dist] [Rank 0] step=630, skipped=12, lr=[0.0008627646711857188], mom=[(0.9, 0.95)] [2023-04-19 17:31:54,782] [INFO] [timer.py:199:stop] epoch=0/micro_step=1260/global_step=630, RunningAvgSamplesPerSec=12.344519090997533, CurrSamplesPerSec=10.799963854672287, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:32:20,670] [INFO] [logging.py:96:log_dist] [Rank 0] step=640, skipped=12, lr=[0.0008585114291045544], mom=[(0.9, 0.95)] [2023-04-19 17:32:20,671] [INFO] [timer.py:199:stop] epoch=0/micro_step=1280/global_step=640, RunningAvgSamplesPerSec=12.344965329862202, CurrSamplesPerSec=12.400174503397956, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:32:47,133] [INFO] [logging.py:96:log_dist] [Rank 0] step=650, skipped=12, lr=[0.0008542041120628143], mom=[(0.9, 0.95)] [2023-04-19 17:32:47,133] [INFO] [timer.py:199:stop] epoch=0/micro_step=1300/global_step=650, RunningAvgSamplesPerSec=12.341185303447821, CurrSamplesPerSec=12.410061177588837, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:33:13,000] [INFO] [logging.py:96:log_dist] [Rank 0] step=660, skipped=12, lr=[0.0008498433697413186], mom=[(0.9, 0.95)] [2023-04-19 17:33:13,001] [INFO] [timer.py:199:stop] epoch=0/micro_step=1320/global_step=660, RunningAvgSamplesPerSec=12.341826339333801, CurrSamplesPerSec=12.325113758079137, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:33:39,578] [INFO] [logging.py:96:log_dist] [Rank 0] step=670, skipped=12, lr=[0.0008454298598791235], mom=[(0.9, 0.95)] [2023-04-19 17:33:39,579] [INFO] [timer.py:199:stop] epoch=0/micro_step=1340/global_step=670, RunningAvgSamplesPerSec=12.337384792748615, CurrSamplesPerSec=12.432642918761928, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:34:05,425] [INFO] [logging.py:96:log_dist] [Rank 0] step=680, skipped=12, lr=[0.000840964248174314], mom=[(0.9, 0.95)] [2023-04-19 17:34:05,426] [INFO] [timer.py:199:stop] epoch=0/micro_step=1360/global_step=680, RunningAvgSamplesPerSec=12.338203936574846, CurrSamplesPerSec=12.39753896095559, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:34:32,004] [INFO] [logging.py:96:log_dist] [Rank 0] step=690, skipped=12, lr=[0.0008364472081835954], mom=[(0.9, 0.95)] [2023-04-19 17:34:32,004] [INFO] [timer.py:199:stop] epoch=0/micro_step=1380/global_step=690, RunningAvgSamplesPerSec=12.333943151295635, CurrSamplesPerSec=12.388178734300265, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:34:58,287] [INFO] [logging.py:96:log_dist] [Rank 0] step=700, skipped=12, lr=[0.0008318794212206986], mom=[(0.9, 0.95)] [2023-04-19 17:34:58,288] [INFO] [timer.py:199:stop] epoch=0/micro_step=1400/global_step=700, RunningAvgSamplesPerSec=12.331818651748383, CurrSamplesPerSec=10.815059116797121, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:35:24,531] [INFO] [logging.py:96:log_dist] [Rank 0] step=710, skipped=12, lr=[0.0008272615762536171], mom=[(0.9, 0.95)] [2023-04-19 17:35:24,532] [INFO] [timer.py:199:stop] epoch=0/micro_step=1420/global_step=710, RunningAvgSamplesPerSec=12.330020191929695, CurrSamplesPerSec=12.390247522970165, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:35:32,232] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-19 17:35:34,771] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536 [2023-04-19 17:35:50,892] [INFO] [logging.py:96:log_dist] [Rank 0] step=720, skipped=14, lr=[0.0008235317263262469], mom=[(0.9, 0.95)] [2023-04-19 17:35:50,892] [INFO] [timer.py:199:stop] epoch=0/micro_step=1440/global_step=720, RunningAvgSamplesPerSec=12.327496611969542, CurrSamplesPerSec=12.364812549050821, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:36:16,725] [INFO] [logging.py:96:log_dist] [Rank 0] step=730, skipped=14, lr=[0.0008188255371846346], mom=[(0.9, 0.95)] [2023-04-19 17:36:16,725] [INFO] [timer.py:199:stop] epoch=0/micro_step=1460/global_step=730, RunningAvgSamplesPerSec=12.328485193644248, CurrSamplesPerSec=12.39404266958273, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:36:43,182] [INFO] [logging.py:96:log_dist] [Rank 0] step=740, skipped=14, lr=[0.0008140712589809891], mom=[(0.9, 0.95)] [2023-04-19 17:36:43,182] [INFO] [timer.py:199:stop] epoch=0/micro_step=1480/global_step=740, RunningAvgSamplesPerSec=12.32543440337834, CurrSamplesPerSec=12.378232129443344, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:37:09,039] [INFO] [logging.py:96:log_dist] [Rank 0] step=750, skipped=14, lr=[0.0008092696088121323], mom=[(0.9, 0.95)] [2023-04-19 17:37:09,039] [INFO] [timer.py:199:stop] epoch=0/micro_step=1500/global_step=750, RunningAvgSamplesPerSec=12.326270629881153, CurrSamplesPerSec=12.410741659317063, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:37:35,654] [INFO] [logging.py:96:log_dist] [Rank 0] step=760, skipped=14, lr=[0.0008044213109200901], mom=[(0.9, 0.95)] [2023-04-19 17:37:35,655] [INFO] [timer.py:199:stop] epoch=0/micro_step=1520/global_step=760, RunningAvgSamplesPerSec=12.322338561705967, CurrSamplesPerSec=12.403199713747565, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:38:01,485] [INFO] [logging.py:96:log_dist] [Rank 0] step=770, skipped=14, lr=[0.0007995270965828522], mom=[(0.9, 0.95)] [2023-04-19 17:38:01,486] [INFO] [timer.py:199:stop] epoch=0/micro_step=1540/global_step=770, RunningAvgSamplesPerSec=12.32335401862153, CurrSamplesPerSec=12.418311166214071, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:38:27,898] [INFO] [logging.py:96:log_dist] [Rank 0] step=780, skipped=14, lr=[0.0007945877040040741], mom=[(0.9, 0.95)] [2023-04-19 17:38:27,899] [INFO] [timer.py:199:stop] epoch=0/micro_step=1560/global_step=780, RunningAvgSamplesPerSec=12.320793923911, CurrSamplesPerSec=12.407547599586076, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:38:54,088] [INFO] [logging.py:96:log_dist] [Rank 0] step=790, skipped=14, lr=[0.0007896038782017308], mom=[(0.9, 0.95)] [2023-04-19 17:38:54,089] [INFO] [timer.py:199:stop] epoch=0/micro_step=1580/global_step=790, RunningAvgSamplesPerSec=12.319644966599517, CurrSamplesPerSec=12.290209522225812, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:39:19,990] [INFO] [logging.py:96:log_dist] [Rank 0] step=800, skipped=14, lr=[0.0007845763708957448], mom=[(0.9, 0.95)] [2023-04-19 17:39:19,990] [INFO] [timer.py:199:stop] epoch=0/micro_step=1600/global_step=800, RunningAvgSamplesPerSec=12.320236636211249, CurrSamplesPerSec=12.744466835576597, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:39:45,824] [INFO] [logging.py:96:log_dist] [Rank 0] step=810, skipped=14, lr=[0.0007795059403946033], mom=[(0.9, 0.95)] [2023-04-19 17:39:45,825] [INFO] [timer.py:199:stop] epoch=0/micro_step=1620/global_step=810, RunningAvgSamplesPerSec=12.321208267862733, CurrSamplesPerSec=12.790884182215338, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:39:58,273] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-19 17:40:00,745] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536 [2023-04-19 17:40:10,738] [INFO] [logging.py:96:log_dist] [Rank 0] step=820, skipped=16, lr=[0.0007754192050125431], mom=[(0.9, 0.95)] [2023-04-19 17:40:10,738] [INFO] [timer.py:199:stop] epoch=0/micro_step=1640/global_step=820, RunningAvgSamplesPerSec=12.327498662124329, CurrSamplesPerSec=12.849750726700075, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:40:36,307] [INFO] [logging.py:96:log_dist] [Rank 0] step=830, skipped=16, lr=[0.000770273444289497], mom=[(0.9, 0.95)] [2023-04-19 17:40:36,307] [INFO] [timer.py:199:stop] epoch=0/micro_step=1660/global_step=830, RunningAvgSamplesPerSec=12.329883167590443, CurrSamplesPerSec=12.826796704158907, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:41:01,233] [INFO] [logging.py:96:log_dist] [Rank 0] step=840, skipped=16, lr=[0.0007650869177089128], mom=[(0.9, 0.95)] [2023-04-19 17:41:01,234] [INFO] [timer.py:199:stop] epoch=0/micro_step=1680/global_step=840, RunningAvgSamplesPerSec=12.335855108865013, CurrSamplesPerSec=12.83786436378276, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:41:26,908] [INFO] [logging.py:96:log_dist] [Rank 0] step=850, skipped=16, lr=[0.0007598604075644574], mom=[(0.9, 0.95)] [2023-04-19 17:41:26,909] [INFO] [timer.py:199:stop] epoch=0/micro_step=1700/global_step=850, RunningAvgSamplesPerSec=12.337496481370808, CurrSamplesPerSec=12.849296794964276, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:41:51,839] [INFO] [logging.py:96:log_dist] [Rank 0] step=860, skipped=16, lr=[0.0007545947021805939], mom=[(0.9, 0.95)] [2023-04-19 17:41:51,839] [INFO] [timer.py:199:stop] epoch=0/micro_step=1720/global_step=860, RunningAvgSamplesPerSec=12.343223968422315, CurrSamplesPerSec=12.837561070790468, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:42:17,574] [INFO] [logging.py:96:log_dist] [Rank 0] step=870, skipped=16, lr=[0.0007492905957936784], mom=[(0.9, 0.95)] [2023-04-19 17:42:17,575] [INFO] [timer.py:199:stop] epoch=0/micro_step=1740/global_step=870, RunningAvgSamplesPerSec=12.344409548735772, CurrSamplesPerSec=12.804170059275858, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:42:43,355] [INFO] [logging.py:96:log_dist] [Rank 0] step=880, skipped=16, lr=[0.0007439488884321635], mom=[(0.9, 0.95)] [2023-04-19 17:42:43,355] [INFO] [timer.py:199:stop] epoch=0/micro_step=1760/global_step=880, RunningAvgSamplesPerSec=12.345321702239133, CurrSamplesPerSec=11.084045608043938, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:43:08,447] [INFO] [logging.py:96:log_dist] [Rank 0] step=890, skipped=16, lr=[0.0007385703857959276], mom=[(0.9, 0.95)] [2023-04-19 17:43:08,447] [INFO] [timer.py:199:stop] epoch=0/micro_step=1780/global_step=890, RunningAvgSamplesPerSec=12.349907530652988, CurrSamplesPerSec=12.766375184074995, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:43:34,362] [INFO] [logging.py:96:log_dist] [Rank 0] step=900, skipped=16, lr=[0.0007331558991347511], mom=[(0.9, 0.95)] [2023-04-19 17:43:34,363] [INFO] [timer.py:199:stop] epoch=0/micro_step=1800/global_step=900, RunningAvgSamplesPerSec=12.350024217868127, CurrSamplesPerSec=12.711335014884598, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:43:59,715] [INFO] [logging.py:96:log_dist] [Rank 0] step=910, skipped=16, lr=[0.0007277062451259528], mom=[(0.9, 0.95)] [2023-04-19 17:43:59,716] [INFO] [timer.py:199:stop] epoch=0/micro_step=1820/global_step=910, RunningAvgSamplesPerSec=12.353091868169075, CurrSamplesPerSec=12.581144742324476, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:44:18,143] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-19 17:44:20,668] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536 [2023-04-19 17:44:25,811] [INFO] [logging.py:96:log_dist] [Rank 0] step=920, skipped=18, lr=[0.0007233217536252489], mom=[(0.9, 0.95)] [2023-04-19 17:44:25,811] [INFO] [timer.py:199:stop] epoch=0/micro_step=1840/global_step=920, RunningAvgSamplesPerSec=12.35223511923478, CurrSamplesPerSec=12.48053108339018, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:44:51,480] [INFO] [logging.py:96:log_dist] [Rank 0] step=930, skipped=18, lr=[0.0007178108732699562], mom=[(0.9, 0.95)] [2023-04-19 17:44:51,480] [INFO] [timer.py:199:stop] epoch=0/micro_step=1860/global_step=930, RunningAvgSamplesPerSec=12.3535888911587, CurrSamplesPerSec=12.434895936355083, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:45:18,016] [INFO] [logging.py:96:log_dist] [Rank 0] step=940, skipped=18, lr=[0.000712267140086472], mom=[(0.9, 0.95)] [2023-04-19 17:45:18,016] [INFO] [timer.py:199:stop] epoch=0/micro_step=1880/global_step=940, RunningAvgSamplesPerSec=12.350506541077507, CurrSamplesPerSec=12.352032441613526, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:45:44,301] [INFO] [logging.py:96:log_dist] [Rank 0] step=950, skipped=18, lr=[0.0007066913902466141], mom=[(0.9, 0.95)] [2023-04-19 17:45:44,301] [INFO] [timer.py:199:stop] epoch=0/micro_step=1900/global_step=950, RunningAvgSamplesPerSec=12.34875418553038, CurrSamplesPerSec=10.873309381187074, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:46:10,498] [INFO] [logging.py:96:log_dist] [Rank 0] step=960, skipped=18, lr=[0.0007010844647513335], mom=[(0.9, 0.95)] [2023-04-19 17:46:10,499] [INFO] [timer.py:199:stop] epoch=0/micro_step=1920/global_step=960, RunningAvgSamplesPerSec=12.347471208831822, CurrSamplesPerSec=12.461529691197287, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:46:37,017] [INFO] [logging.py:96:log_dist] [Rank 0] step=970, skipped=18, lr=[0.000695447209303864], mom=[(0.9, 0.95)] [2023-04-19 17:46:37,018] [INFO] [timer.py:199:stop] epoch=0/micro_step=1940/global_step=970, RunningAvgSamplesPerSec=12.344635758493899, CurrSamplesPerSec=10.841980905932228, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:47:02,810] [INFO] [logging.py:96:log_dist] [Rank 0] step=980, skipped=18, lr=[0.0006897804741821649], mom=[(0.9, 0.95)] [2023-04-19 17:47:02,810] [INFO] [timer.py:199:stop] epoch=0/micro_step=1960/global_step=980, RunningAvgSamplesPerSec=12.345394722587116, CurrSamplesPerSec=12.393337697206483, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:47:29,294] [INFO] [logging.py:96:log_dist] [Rank 0] step=990, skipped=18, lr=[0.0006840851141106694], mom=[(0.9, 0.95)] [2023-04-19 17:47:29,294] [INFO] [timer.py:199:stop] epoch=0/micro_step=1980/global_step=990, RunningAvgSamplesPerSec=12.34280801200078, CurrSamplesPerSec=12.368272990522472, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:47:55,162] [INFO] [logging.py:96:log_dist] [Rank 0] step=1000, skipped=18, lr=[0.0006783619881313676], mom=[(0.9, 0.95)] [2023-04-19 17:47:55,162] [INFO] [timer.py:199:stop] epoch=0/micro_step=2000/global_step=1000, RunningAvgSamplesPerSec=12.343210472186934, CurrSamplesPerSec=12.339334311891653, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:48:21,596] [INFO] [logging.py:96:log_dist] [Rank 0] step=1010, skipped=18, lr=[0.0006726119594742333], mom=[(0.9, 0.95)] [2023-04-19 17:48:21,597] [INFO] [timer.py:199:stop] epoch=0/micro_step=2020/global_step=1010, RunningAvgSamplesPerSec=12.340929143415744, CurrSamplesPerSec=12.375618457152315, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:48:44,790] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-19 17:48:47,330] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536 [2023-04-19 17:48:47,331] [INFO] [logging.py:96:log_dist] [Rank 0] step=1020, skipped=20, lr=[0.0006679931493048548], mom=[(0.9, 0.95)] [2023-04-19 17:48:47,331] [INFO] [timer.py:199:stop] epoch=0/micro_step=2040/global_step=1020, RunningAvgSamplesPerSec=12.341967169670106, CurrSamplesPerSec=12.609502682829376, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:49:13,879] [INFO] [logging.py:96:log_dist] [Rank 0] step=1030, skipped=20, lr=[0.000662196884036101], mom=[(0.9, 0.95)] [2023-04-19 17:49:13,880] [INFO] [timer.py:199:stop] epoch=0/micro_step=2060/global_step=1030, RunningAvgSamplesPerSec=12.339214394571659, CurrSamplesPerSec=12.447448749159538, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:49:40,041] [INFO] [logging.py:96:log_dist] [Rank 0] step=1040, skipped=20, lr=[0.0006563761543029039], mom=[(0.9, 0.95)] [2023-04-19 17:49:40,042] [INFO] [timer.py:199:stop] epoch=0/micro_step=2080/global_step=1040, RunningAvgSamplesPerSec=12.338287948072862, CurrSamplesPerSec=10.883461229547047, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:50:06,208] [INFO] [logging.py:96:log_dist] [Rank 0] step=1050, skipped=20, lr=[0.000650531838056998], mom=[(0.9, 0.95)] [2023-04-19 17:50:06,208] [INFO] [timer.py:199:stop] epoch=0/micro_step=2100/global_step=1050, RunningAvgSamplesPerSec=12.337358766530718, CurrSamplesPerSec=12.446037095975667, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:50:32,591] [INFO] [logging.py:96:log_dist] [Rank 0] step=1060, skipped=20, lr=[0.0006446648168077156], mom=[(0.9, 0.95)] [2023-04-19 17:50:32,592] [INFO] [timer.py:199:stop] epoch=0/micro_step=2120/global_step=1060, RunningAvgSamplesPerSec=12.335472446032648, CurrSamplesPerSec=12.43088691461683, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:50:58,360] [INFO] [logging.py:96:log_dist] [Rank 0] step=1070, skipped=20, lr=[0.000638775975489028], mom=[(0.9, 0.95)] [2023-04-19 17:50:58,361] [INFO] [timer.py:199:stop] epoch=0/micro_step=2140/global_step=1070, RunningAvgSamplesPerSec=12.336356910225373, CurrSamplesPerSec=12.431173598785648, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:51:24,880] [INFO] [logging.py:96:log_dist] [Rank 0] step=1080, skipped=20, lr=[0.0006328662023260695], mom=[(0.9, 0.95)] [2023-04-19 17:51:24,880] [INFO] [timer.py:199:stop] epoch=0/micro_step=2160/global_step=1080, RunningAvgSamplesPerSec=12.333914084704624, CurrSamplesPerSec=12.393311376739035, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:51:50,708] [INFO] [logging.py:96:log_dist] [Rank 0] step=1090, skipped=20, lr=[0.0006269363887011636], mom=[(0.9, 0.95)] [2023-04-19 17:51:50,709] [INFO] [timer.py:199:stop] epoch=0/micro_step=2180/global_step=1090, RunningAvgSamplesPerSec=12.334540626156128, CurrSamplesPerSec=12.313492024901896, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:52:17,372] [INFO] [logging.py:96:log_dist] [Rank 0] step=1100, skipped=20, lr=[0.0006209874290193754], mom=[(0.9, 0.95)] [2023-04-19 17:52:17,372] [INFO] [timer.py:199:stop] epoch=0/micro_step=2200/global_step=1100, RunningAvgSamplesPerSec=12.33154887382222, CurrSamplesPerSec=12.279735312421261, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:52:43,396] [INFO] [logging.py:96:log_dist] [Rank 0] step=1110, skipped=20, lr=[0.0006150202205736057], mom=[(0.9, 0.95)] [2023-04-19 17:52:43,397] [INFO] [timer.py:199:stop] epoch=0/micro_step=2220/global_step=1110, RunningAvgSamplesPerSec=12.331350008579633, CurrSamplesPerSec=12.308929838645925, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:53:10,046] [INFO] [logging.py:96:log_dist] [Rank 0] step=1120, skipped=20, lr=[0.0006090356634092513], mom=[(0.9, 0.95)] [2023-04-19 17:53:10,047] [INFO] [timer.py:199:stop] epoch=0/micro_step=2240/global_step=1120, RunningAvgSamplesPerSec=12.328499180436237, CurrSamplesPerSec=12.302516879811852, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:53:12,580] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-19 17:53:15,130] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536 [2023-04-19 17:53:36,361] [INFO] [logging.py:96:log_dist] [Rank 0] step=1130, skipped=22, lr=[0.0006042361331048955], mom=[(0.9, 0.95)] [2023-04-19 17:53:36,361] [INFO] [timer.py:199:stop] epoch=0/micro_step=2260/global_step=1130, RunningAvgSamplesPerSec=12.327112608764459, CurrSamplesPerSec=12.31348072818655, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:54:02,770] [INFO] [logging.py:96:log_dist] [Rank 0] step=1140, skipped=22, lr=[0.0005982226246272145], mom=[(0.9, 0.95)] [2023-04-19 17:54:02,771] [INFO] [timer.py:199:stop] epoch=0/micro_step=2280/global_step=1140, RunningAvgSamplesPerSec=12.3253524395009, CurrSamplesPerSec=12.32361769217392, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:54:29,536] [INFO] [logging.py:96:log_dist] [Rank 0] step=1150, skipped=22, lr=[0.0005921943010442869], mom=[(0.9, 0.95)] [2023-04-19 17:54:29,537] [INFO] [timer.py:199:stop] epoch=0/micro_step=2300/global_step=1150, RunningAvgSamplesPerSec=12.322150881860248, CurrSamplesPerSec=12.302494326659394, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:54:55,579] [INFO] [logging.py:96:log_dist] [Rank 0] step=1160, skipped=22, lr=[0.0005861520716196217], mom=[(0.9, 0.95)] [2023-04-19 17:54:55,579] [INFO] [timer.py:199:stop] epoch=0/micro_step=2320/global_step=1160, RunningAvgSamplesPerSec=12.321969112440607, CurrSamplesPerSec=12.287758879731456, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:55:22,343] [INFO] [logging.py:96:log_dist] [Rank 0] step=1170, skipped=22, lr=[0.0005800968477141724], mom=[(0.9, 0.95)] [2023-04-19 17:55:22,344] [INFO] [timer.py:199:stop] epoch=0/micro_step=2340/global_step=1170, RunningAvgSamplesPerSec=12.318860355457225, CurrSamplesPerSec=12.258176434330014, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:55:48,361] [INFO] [logging.py:96:log_dist] [Rank 0] step=1180, skipped=22, lr=[0.000574029542648875], mom=[(0.9, 0.95)] [2023-04-19 17:55:48,361] [INFO] [timer.py:199:stop] epoch=0/micro_step=2360/global_step=1180, RunningAvgSamplesPerSec=12.318809050479222, CurrSamplesPerSec=12.348782177117398, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:56:14,938] [INFO] [logging.py:96:log_dist] [Rank 0] step=1190, skipped=22, lr=[0.0005679510715668897], mom=[(0.9, 0.95)] [2023-04-19 17:56:14,939] [INFO] [timer.py:199:stop] epoch=0/micro_step=2380/global_step=1190, RunningAvgSamplesPerSec=12.316525641512154, CurrSamplesPerSec=12.347010026434933, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:56:41,307] [INFO] [logging.py:96:log_dist] [Rank 0] step=1200, skipped=22, lr=[0.0005618623512955685], mom=[(0.9, 0.95)] [2023-04-19 17:56:41,308] [INFO] [timer.py:199:stop] epoch=0/micro_step=2400/global_step=1200, RunningAvgSamplesPerSec=12.315105038960837, CurrSamplesPerSec=10.827393883169135, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:57:07,704] [INFO] [logging.py:96:log_dist] [Rank 0] step=1210, skipped=22, lr=[0.0005557643002081674], mom=[(0.9, 0.95)] [2023-04-19 17:57:07,705] [INFO] [timer.py:199:stop] epoch=0/micro_step=2420/global_step=1210, RunningAvgSamplesPerSec=12.313597870894732, CurrSamplesPerSec=12.29607905829239, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:57:34,436] [INFO] [logging.py:96:log_dist] [Rank 0] step=1220, skipped=22, lr=[0.000549657838085328], mom=[(0.9, 0.95)] [2023-04-19 17:57:34,437] [INFO] [timer.py:199:stop] epoch=0/micro_step=2440/global_step=1220, RunningAvgSamplesPerSec=12.3108149927377, CurrSamplesPerSec=10.727947607851789, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:57:42,160] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-19 17:57:44,721] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536 [2023-04-19 17:58:00,350] [INFO] [logging.py:96:log_dist] [Rank 0] step=1230, skipped=24, lr=[0.000544767231347586], mom=[(0.9, 0.95)] [2023-04-19 17:58:00,351] [INFO] [timer.py:199:stop] epoch=0/micro_step=2460/global_step=1230, RunningAvgSamplesPerSec=12.311230504672189, CurrSamplesPerSec=12.343848682564769, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:58:27,105] [INFO] [logging.py:96:log_dist] [Rank 0] step=1240, skipped=24, lr=[0.0005386479511690275], mom=[(0.9, 0.95)] [2023-04-19 17:58:27,106] [INFO] [timer.py:199:stop] epoch=0/micro_step=2480/global_step=1240, RunningAvgSamplesPerSec=12.3084254813281, CurrSamplesPerSec=12.358059046593949, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:58:53,096] [INFO] [logging.py:96:log_dist] [Rank 0] step=1250, skipped=24, lr=[0.0005325228416465036], mom=[(0.9, 0.95)] [2023-04-19 17:58:53,096] [INFO] [timer.py:199:stop] epoch=0/micro_step=2500/global_step=1250, RunningAvgSamplesPerSec=12.308565558430116, CurrSamplesPerSec=12.362821707537037, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:59:19,726] [INFO] [logging.py:96:log_dist] [Rank 0] step=1260, skipped=24, lr=[0.0005263928266419306], mom=[(0.9, 0.95)] [2023-04-19 17:59:19,726] [INFO] [timer.py:199:stop] epoch=0/micro_step=2520/global_step=1260, RunningAvgSamplesPerSec=12.306294872236686, CurrSamplesPerSec=12.285429537813988, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 17:59:45,724] [INFO] [logging.py:96:log_dist] [Rank 0] step=1270, skipped=24, lr=[0.0005202588307571282], mom=[(0.9, 0.95)] [2023-04-19 17:59:45,725] [INFO] [timer.py:199:stop] epoch=0/micro_step=2540/global_step=1270, RunningAvgSamplesPerSec=12.306418820802854, CurrSamplesPerSec=12.375375407494394, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 18:00:12,349] [INFO] [logging.py:96:log_dist] [Rank 0] step=1280, skipped=24, lr=[0.0005141217791943596], mom=[(0.9, 0.95)] [2023-04-19 18:00:12,350] [INFO] [timer.py:199:stop] epoch=0/micro_step=2560/global_step=1280, RunningAvgSamplesPerSec=12.304221787202156, CurrSamplesPerSec=12.32757139885218, MemAllocated=1.95GB, MaxMemAllocated=13.61GB [2023-04-19 18:00:38,706] [INFO] [logging.py:96:log_dist] [Rank 0] step=1290, skipped=24, lr=[0.0005079825976167822], mom=[(0.9, 0.95)] [2023-04-19 18:00:38,707] [INFO] [timer.py:199:stop] epoch=0/micro_step=2580/global_step=1290, RunningAvgSamplesPerSec=12.303041637793852, CurrSamplesPerSec=10.799450282422233, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:01:05,115] [INFO] [logging.py:96:log_dist] [Rank 0] step=1300, skipped=24, lr=[0.000501842212008827], mom=[(0.9, 0.95)] [2023-04-19 18:01:05,116] [INFO] [timer.py:199:stop] epoch=0/micro_step=2600/global_step=1300, RunningAvgSamplesPerSec=12.301692628837143, CurrSamplesPerSec=12.321747554864698, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:01:31,885] [INFO] [logging.py:96:log_dist] [Rank 0] step=1310, skipped=24, lr=[0.0004957015485365313], mom=[(0.9, 0.95)] [2023-04-19 18:01:31,886] [INFO] [timer.py:199:stop] epoch=0/micro_step=2620/global_step=1310, RunningAvgSamplesPerSec=12.299057767730432, CurrSamplesPerSec=10.716507333570151, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:01:57,912] [INFO] [logging.py:96:log_dist] [Rank 0] step=1320, skipped=24, lr=[0.0004895615334078436], mom=[(0.9, 0.95)] [2023-04-19 18:01:57,913] [INFO] [timer.py:199:stop] epoch=0/micro_step=2640/global_step=1320, RunningAvgSamplesPerSec=12.299131376088003, CurrSamplesPerSec=12.290784630892936, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:02:11,118] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-19 18:02:13,676] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536 [2023-04-19 18:02:24,488] [INFO] [logging.py:96:log_dist] [Rank 0] step=1330, skipped=26, lr=[0.0004846506104651698], mom=[(0.9, 0.95)] [2023-04-19 18:02:24,489] [INFO] [timer.py:199:stop] epoch=0/micro_step=2660/global_step=1330, RunningAvgSamplesPerSec=12.297245436028547, CurrSamplesPerSec=12.321097156185356, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:02:50,544] [INFO] [logging.py:96:log_dist] [Rank 0] step=1340, skipped=26, lr=[0.00047851409599768043], mom=[(0.9, 0.95)] [2023-04-19 18:02:50,545] [INFO] [timer.py:199:stop] epoch=0/micro_step=2680/global_step=1340, RunningAvgSamplesPerSec=12.297227368370496, CurrSamplesPerSec=12.2613119643179, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:03:17,310] [INFO] [logging.py:96:log_dist] [Rank 0] step=1350, skipped=26, lr=[0.0004723808222899481], mom=[(0.9, 0.95)] [2023-04-19 18:03:17,311] [INFO] [timer.py:199:stop] epoch=0/micro_step=2700/global_step=1350, RunningAvgSamplesPerSec=12.294722758257807, CurrSamplesPerSec=12.307697273572408, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:03:43,362] [INFO] [logging.py:96:log_dist] [Rank 0] step=1360, skipped=26, lr=[0.0004662517144353085], mom=[(0.9, 0.95)] [2023-04-19 18:03:43,363] [INFO] [timer.py:199:stop] epoch=0/micro_step=2720/global_step=1360, RunningAvgSamplesPerSec=12.294736732905998, CurrSamplesPerSec=12.305249806208746, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:04:10,013] [INFO] [logging.py:96:log_dist] [Rank 0] step=1370, skipped=26, lr=[0.0004601276968987546], mom=[(0.9, 0.95)] [2023-04-19 18:04:10,014] [INFO] [timer.py:199:stop] epoch=0/micro_step=2740/global_step=1370, RunningAvgSamplesPerSec=12.292682779748246, CurrSamplesPerSec=12.342967791868034, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:04:36,396] [INFO] [logging.py:96:log_dist] [Rank 0] step=1380, skipped=26, lr=[0.0004540096933774962], mom=[(0.9, 0.95)] [2023-04-19 18:04:36,396] [INFO] [timer.py:199:stop] epoch=0/micro_step=2760/global_step=1380, RunningAvgSamplesPerSec=12.291579956405803, CurrSamplesPerSec=12.317193944852322, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:05:02,819] [INFO] [logging.py:96:log_dist] [Rank 0] step=1390, skipped=26, lr=[0.00044789862666163807], mom=[(0.9, 0.95)] [2023-04-19 18:05:02,819] [INFO] [timer.py:199:stop] epoch=0/micro_step=2780/global_step=1390, RunningAvgSamplesPerSec=12.290354577889605, CurrSamplesPerSec=12.290194891983273, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:05:29,595] [INFO] [logging.py:96:log_dist] [Rank 0] step=1400, skipped=26, lr=[0.0004417954184949932], mom=[(0.9, 0.95)] [2023-04-19 18:05:29,596] [INFO] [timer.py:199:stop] epoch=0/micro_step=2800/global_step=1400, RunningAvgSamplesPerSec=12.287955655202952, CurrSamplesPerSec=12.235999243692683, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:05:55,671] [INFO] [logging.py:96:log_dist] [Rank 0] step=1410, skipped=26, lr=[0.0004357009894360553], mom=[(0.9, 0.95)] [2023-04-19 18:05:55,671] [INFO] [timer.py:199:stop] epoch=0/micro_step=2820/global_step=1410, RunningAvgSamplesPerSec=12.287937989215635, CurrSamplesPerSec=12.295943882201728, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:06:22,415] [INFO] [logging.py:96:log_dist] [Rank 0] step=1420, skipped=26, lr=[0.0004296162587191479], mom=[(0.9, 0.95)] [2023-04-19 18:06:22,416] [INFO] [timer.py:199:stop] epoch=0/micro_step=2840/global_step=1420, RunningAvgSamplesPerSec=12.285696192675177, CurrSamplesPerSec=12.326868307105308, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:06:40,557] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-19 18:06:43,112] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536 [2023-04-19 18:06:48,337] [INFO] [logging.py:96:log_dist] [Rank 0] step=1430, skipped=28, lr=[0.0004247560737470216], mom=[(0.9, 0.95)] [2023-04-19 18:06:48,338] [INFO] [timer.py:199:stop] epoch=0/micro_step=2860/global_step=1430, RunningAvgSamplesPerSec=12.286200975609571, CurrSamplesPerSec=12.302398476683939, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:07:15,113] [INFO] [logging.py:96:log_dist] [Rank 0] step=1440, skipped=28, lr=[0.00041869111175856633], mom=[(0.9, 0.95)] [2023-04-19 18:07:15,114] [INFO] [timer.py:199:stop] epoch=0/micro_step=2880/global_step=1440, RunningAvgSamplesPerSec=12.283901533897822, CurrSamplesPerSec=12.290463868931733, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:07:41,149] [INFO] [logging.py:96:log_dist] [Rank 0] step=1450, skipped=28, lr=[0.00041263841374433654], mom=[(0.9, 0.95)] [2023-04-19 18:07:41,149] [INFO] [timer.py:199:stop] epoch=0/micro_step=2900/global_step=1450, RunningAvgSamplesPerSec=12.284042842085487, CurrSamplesPerSec=12.325609509044535, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:08:07,778] [INFO] [logging.py:96:log_dist] [Rank 0] step=1460, skipped=28, lr=[0.00040659889264428324], mom=[(0.9, 0.95)] [2023-04-19 18:08:07,779] [INFO] [timer.py:199:stop] epoch=0/micro_step=2920/global_step=1460, RunningAvgSamplesPerSec=12.282261851329789, CurrSamplesPerSec=12.314158567797463, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:08:34,176] [INFO] [logging.py:96:log_dist] [Rank 0] step=1470, skipped=28, lr=[0.0004005734594108583], mom=[(0.9, 0.95)] [2023-04-19 18:08:34,177] [INFO] [timer.py:199:stop] epoch=0/micro_step=2940/global_step=1470, RunningAvgSamplesPerSec=12.28124751646962, CurrSamplesPerSec=12.284246646807013, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:09:00,567] [INFO] [logging.py:96:log_dist] [Rank 0] step=1480, skipped=28, lr=[0.00039456302287161396], mom=[(0.9, 0.95)] [2023-04-19 18:09:00,567] [INFO] [timer.py:199:stop] epoch=0/micro_step=2960/global_step=1480, RunningAvgSamplesPerSec=12.280273099429602, CurrSamplesPerSec=12.304348471794587, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:09:27,304] [INFO] [logging.py:96:log_dist] [Rank 0] step=1490, skipped=28, lr=[0.0003885684895921226], mom=[(0.9, 0.95)] [2023-04-19 18:09:27,304] [INFO] [timer.py:199:stop] epoch=0/micro_step=2980/global_step=1490, RunningAvgSamplesPerSec=12.278213162378892, CurrSamplesPerSec=12.288593653679008, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:09:53,333] [INFO] [logging.py:96:log_dist] [Rank 0] step=1500, skipped=28, lr=[0.0003825907637392375], mom=[(0.9, 0.95)] [2023-04-19 18:09:53,334] [INFO] [timer.py:199:stop] epoch=0/micro_step=3000/global_step=1500, RunningAvgSamplesPerSec=12.27840720321145, CurrSamplesPerSec=12.275688721970516, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:10:20,080] [INFO] [logging.py:96:log_dist] [Rank 0] step=1510, skipped=28, lr=[0.0003766307469447161], mom=[(0.9, 0.95)] [2023-04-19 18:10:20,080] [INFO] [timer.py:199:stop] epoch=0/micro_step=3020/global_step=1510, RunningAvgSamplesPerSec=12.276358165025238, CurrSamplesPerSec=12.341332347628354, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:10:46,100] [INFO] [logging.py:96:log_dist] [Rank 0] step=1520, skipped=28, lr=[0.00037068933816922456], mom=[(0.9, 0.95)] [2023-04-19 18:10:46,100] [INFO] [timer.py:199:stop] epoch=0/micro_step=3040/global_step=1520, RunningAvgSamplesPerSec=12.276591206833402, CurrSamplesPerSec=12.318812821547843, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:11:10,214] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-19 18:11:12,767] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536 [2023-04-19 18:11:12,767] [INFO] [logging.py:96:log_dist] [Rank 0] step=1530, skipped=30, lr=[0.000365950211235768], mom=[(0.9, 0.95)] [2023-04-19 18:11:12,768] [INFO] [timer.py:199:stop] epoch=0/micro_step=3060/global_step=1530, RunningAvgSamplesPerSec=12.274823812996173, CurrSamplesPerSec=12.545723730485774, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:11:39,159] [INFO] [logging.py:96:log_dist] [Rank 0] step=1540, skipped=30, lr=[0.00036004455323017474], mom=[(0.9, 0.95)] [2023-04-19 18:11:39,159] [INFO] [timer.py:199:stop] epoch=0/micro_step=3080/global_step=1540, RunningAvgSamplesPerSec=12.273924370839671, CurrSamplesPerSec=10.841243528321511, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:12:05,392] [INFO] [logging.py:96:log_dist] [Rank 0] step=1550, skipped=30, lr=[0.00035416000497074865], mom=[(0.9, 0.95)] [2023-04-19 18:12:05,393] [INFO] [timer.py:199:stop] epoch=0/micro_step=3100/global_step=1550, RunningAvgSamplesPerSec=12.273519460152688, CurrSamplesPerSec=12.274678332960356, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:12:31,989] [INFO] [logging.py:96:log_dist] [Rank 0] step=1560, skipped=30, lr=[0.0003482974540350933], mom=[(0.9, 0.95)] [2023-04-19 18:12:31,990] [INFO] [timer.py:199:stop] epoch=0/micro_step=3120/global_step=1560, RunningAvgSamplesPerSec=12.272017635829767, CurrSamplesPerSec=10.848107117940044, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:12:57,722] [INFO] [logging.py:96:log_dist] [Rank 0] step=1570, skipped=30, lr=[0.0003424577846829144], mom=[(0.9, 0.95)] [2023-04-19 18:12:57,722] [INFO] [timer.py:199:stop] epoch=0/micro_step=3140/global_step=1570, RunningAvgSamplesPerSec=12.273125881399084, CurrSamplesPerSec=12.488360282731884, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:13:24,190] [INFO] [logging.py:96:log_dist] [Rank 0] step=1580, skipped=30, lr=[0.00033664187772264466], mom=[(0.9, 0.95)] [2023-04-19 18:13:24,190] [INFO] [timer.py:199:stop] epoch=0/micro_step=3160/global_step=1580, RunningAvgSamplesPerSec=12.27202700089158, CurrSamplesPerSec=12.450442790397208, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:13:49,948] [INFO] [logging.py:96:log_dist] [Rank 0] step=1590, skipped=30, lr=[0.00033085061037859], mom=[(0.9, 0.95)] [2023-04-19 18:13:49,949] [INFO] [timer.py:199:stop] epoch=0/micro_step=3180/global_step=1590, RunningAvgSamplesPerSec=12.27304390587419, CurrSamplesPerSec=12.406213786289506, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:14:16,439] [INFO] [logging.py:96:log_dist] [Rank 0] step=1600, skipped=30, lr=[0.00032508485615861607], mom=[(0.9, 0.95)] [2023-04-19 18:14:16,439] [INFO] [timer.py:199:stop] epoch=0/micro_step=3200/global_step=1600, RunningAvgSamplesPerSec=12.271896058667723, CurrSamplesPerSec=12.42768362926357, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:14:42,173] [INFO] [logging.py:96:log_dist] [Rank 0] step=1610, skipped=30, lr=[0.0003193454847223962], mom=[(0.9, 0.95)] [2023-04-19 18:14:42,174] [INFO] [timer.py:199:stop] epoch=0/micro_step=3220/global_step=1610, RunningAvgSamplesPerSec=12.272971178673968, CurrSamplesPerSec=12.442486870831228, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:15:08,528] [INFO] [logging.py:96:log_dist] [Rank 0] step=1620, skipped=30, lr=[0.00031363336175023725], mom=[(0.9, 0.95)] [2023-04-19 18:15:08,528] [INFO] [timer.py:199:stop] epoch=0/micro_step=3240/global_step=1620, RunningAvgSamplesPerSec=12.27222947378135, CurrSamplesPerSec=12.439632699307893, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:15:34,663] [INFO] [logging.py:96:log_dist] [Rank 0] step=1630, skipped=30, lr=[0.0003079493488125092], mom=[(0.9, 0.95)] [2023-04-19 18:15:34,664] [INFO] [timer.py:199:stop] epoch=0/micro_step=3260/global_step=1630, RunningAvgSamplesPerSec=12.272130563979738, CurrSamplesPerSec=12.395891314029205, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:15:37,182] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-19 18:15:39,970] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536 [2023-04-19 18:16:00,634] [INFO] [logging.py:96:log_dist] [Rank 0] step=1640, skipped=32, lr=[0.0003034229539589651], mom=[(0.9, 0.95)] [2023-04-19 18:16:00,635] [INFO] [timer.py:199:stop] epoch=0/micro_step=3280/global_step=1640, RunningAvgSamplesPerSec=12.272505412199054, CurrSamplesPerSec=12.391153478786137, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:16:27,180] [INFO] [logging.py:96:log_dist] [Rank 0] step=1650, skipped=32, lr=[0.00029779169662424564], mom=[(0.9, 0.95)] [2023-04-19 18:16:27,180] [INFO] [timer.py:199:stop] epoch=0/micro_step=3300/global_step=1650, RunningAvgSamplesPerSec=12.271236433244209, CurrSamplesPerSec=10.890016115419574, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:16:52,970] [INFO] [logging.py:96:log_dist] [Rank 0] step=1660, skipped=32, lr=[0.00029219093875243143], mom=[(0.9, 0.95)] [2023-04-19 18:16:52,971] [INFO] [timer.py:199:stop] epoch=0/micro_step=3320/global_step=1660, RunningAvgSamplesPerSec=12.27212462539587, CurrSamplesPerSec=12.348976463033916, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:17:19,514] [INFO] [logging.py:96:log_dist] [Rank 0] step=1670, skipped=32, lr=[0.0002866215251164824], mom=[(0.9, 0.95)] [2023-04-19 18:17:19,515] [INFO] [timer.py:199:stop] epoch=0/micro_step=3340/global_step=1670, RunningAvgSamplesPerSec=12.270876288262468, CurrSamplesPerSec=12.430595638622705, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:17:45,354] [INFO] [logging.py:96:log_dist] [Rank 0] step=1680, skipped=32, lr=[0.0002810842957616477], mom=[(0.9, 0.95)] [2023-04-19 18:17:45,355] [INFO] [timer.py:199:stop] epoch=0/micro_step=3360/global_step=1680, RunningAvgSamplesPerSec=12.271617283094267, CurrSamplesPerSec=12.406374333326339, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:18:11,907] [INFO] [logging.py:96:log_dist] [Rank 0] step=1690, skipped=32, lr=[0.00027558008587876047], mom=[(0.9, 0.95)] [2023-04-19 18:18:11,907] [INFO] [timer.py:199:stop] epoch=0/micro_step=3380/global_step=1690, RunningAvgSamplesPerSec=12.270363834350324, CurrSamplesPerSec=12.39498237596127, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:18:37,648] [INFO] [logging.py:96:log_dist] [Rank 0] step=1700, skipped=32, lr=[0.00027010972567826367], mom=[(0.9, 0.95)] [2023-04-19 18:18:37,649] [INFO] [timer.py:199:stop] epoch=0/micro_step=3400/global_step=1700, RunningAvgSamplesPerSec=12.271373139771677, CurrSamplesPerSec=12.584368640670158, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:19:03,554] [INFO] [logging.py:96:log_dist] [Rank 0] step=1710, skipped=32, lr=[0.000264674040264988], mom=[(0.9, 0.95)] [2023-04-19 18:19:03,554] [INFO] [timer.py:199:stop] epoch=0/micro_step=3420/global_step=1710, RunningAvgSamplesPerSec=12.271917693197654, CurrSamplesPerSec=12.828423573004878, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:19:28,898] [INFO] [logging.py:96:log_dist] [Rank 0] step=1720, skipped=32, lr=[0.00025927384951370127], mom=[(0.9, 0.95)] [2023-04-19 18:19:28,899] [INFO] [timer.py:199:stop] epoch=0/micro_step=3440/global_step=1720, RunningAvgSamplesPerSec=12.273993814762228, CurrSamplesPerSec=12.802104838251223, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:19:54,046] [INFO] [logging.py:96:log_dist] [Rank 0] step=1730, skipped=32, lr=[0.0002539099679454425], mom=[(0.9, 0.95)] [2023-04-19 18:19:54,047] [INFO] [timer.py:199:stop] epoch=0/micro_step=3460/global_step=1730, RunningAvgSamplesPerSec=12.276580675964897, CurrSamplesPerSec=12.900282740307876, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:20:01,466] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-19 18:20:03,937] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536 [2023-04-19 18:20:19,636] [INFO] [logging.py:96:log_dist] [Rank 0] step=1740, skipped=34, lr=[0.00024964554916762446], mom=[(0.9, 0.95)] [2023-04-19 18:20:19,637] [INFO] [timer.py:199:stop] epoch=0/micro_step=3480/global_step=1740, RunningAvgSamplesPerSec=12.277943442484174, CurrSamplesPerSec=12.880564950258988, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:20:44,554] [INFO] [logging.py:96:log_dist] [Rank 0] step=1750, skipped=34, lr=[0.00024434905916265827], mom=[(0.9, 0.95)] [2023-04-19 18:20:44,555] [INFO] [timer.py:199:stop] epoch=0/micro_step=3500/global_step=1750, RunningAvgSamplesPerSec=12.281099911613431, CurrSamplesPerSec=12.852529143234488, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:21:10,261] [INFO] [logging.py:96:log_dist] [Rank 0] step=1760, skipped=34, lr=[0.00023909112947522872], mom=[(0.9, 0.95)] [2023-04-19 18:21:10,262] [INFO] [timer.py:199:stop] epoch=0/micro_step=3520/global_step=1760, RunningAvgSamplesPerSec=12.282106025091206, CurrSamplesPerSec=12.809001687377384, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:21:35,230] [INFO] [logging.py:96:log_dist] [Rank 0] step=1770, skipped=34, lr=[0.00023387255316886947], mom=[(0.9, 0.95)] [2023-04-19 18:21:35,231] [INFO] [timer.py:199:stop] epoch=0/micro_step=3540/global_step=1770, RunningAvgSamplesPerSec=12.285070635156188, CurrSamplesPerSec=12.802605510726918, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:22:00,969] [INFO] [logging.py:96:log_dist] [Rank 0] step=1780, skipped=34, lr=[0.00022869411737136774], mom=[(0.9, 0.95)] [2023-04-19 18:22:00,969] [INFO] [timer.py:199:stop] epoch=0/micro_step=3560/global_step=1780, RunningAvgSamplesPerSec=12.285961243756674, CurrSamplesPerSec=12.798800161308426, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:22:26,394] [INFO] [logging.py:96:log_dist] [Rank 0] step=1790, skipped=34, lr=[0.0002235566031560417], mom=[(0.9, 0.95)] [2023-04-19 18:22:26,395] [INFO] [timer.py:199:stop] epoch=0/micro_step=3580/global_step=1790, RunningAvgSamplesPerSec=12.287666145186579, CurrSamplesPerSec=11.188044943992558, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:22:51,936] [INFO] [logging.py:96:log_dist] [Rank 0] step=1800, skipped=34, lr=[0.00021846078542393004], mom=[(0.9, 0.95)] [2023-04-19 18:22:51,936] [INFO] [timer.py:199:stop] epoch=0/micro_step=3600/global_step=1800, RunningAvgSamplesPerSec=12.289049568018969, CurrSamplesPerSec=12.733015954391846, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:23:17,617] [INFO] [logging.py:96:log_dist] [Rank 0] step=1810, skipped=34, lr=[0.00021340743278691076], mom=[(0.9, 0.95)] [2023-04-19 18:23:17,617] [INFO] [timer.py:199:stop] epoch=0/micro_step=3620/global_step=1810, RunningAvgSamplesPerSec=12.290053744673086, CurrSamplesPerSec=12.60355859829003, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:23:43,421] [INFO] [logging.py:96:log_dist] [Rank 0] step=1820, skipped=34, lr=[0.00020839730745177148], mom=[(0.9, 0.95)] [2023-04-19 18:23:43,421] [INFO] [timer.py:199:stop] epoch=0/micro_step=3640/global_step=1820, RunningAvgSamplesPerSec=12.290727365115005, CurrSamplesPerSec=12.491374039968793, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:24:09,755] [INFO] [logging.py:96:log_dist] [Rank 0] step=1830, skipped=34, lr=[0.00020343116510524367], mom=[(0.9, 0.95)] [2023-04-19 18:24:09,756] [INFO] [timer.py:199:stop] epoch=0/micro_step=3660/global_step=1830, RunningAvgSamplesPerSec=12.290024273445455, CurrSamplesPerSec=12.415638048515792, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:24:22,647] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-19 18:24:25,197] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536 [2023-04-19 18:24:35,610] [INFO] [logging.py:96:log_dist] [Rank 0] step=1840, skipped=36, lr=[0.00019949042256902537], mom=[(0.9, 0.95)] [2023-04-19 18:24:35,610] [INFO] [timer.py:199:stop] epoch=0/micro_step=3680/global_step=1840, RunningAvgSamplesPerSec=12.290559956150119, CurrSamplesPerSec=12.350383229989983, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:25:02,172] [INFO] [logging.py:96:log_dist] [Rank 0] step=1850, skipped=36, lr=[0.00019460533268455865], mom=[(0.9, 0.95)] [2023-04-19 18:25:02,173] [INFO] [timer.py:199:stop] epoch=0/micro_step=3700/global_step=1850, RunningAvgSamplesPerSec=12.289282459116265, CurrSamplesPerSec=12.435905221269186, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:25:27,919] [INFO] [logging.py:96:log_dist] [Rank 0] step=1860, skipped=36, lr=[0.00018976630605848356], mom=[(0.9, 0.95)] [2023-04-19 18:25:27,920] [INFO] [timer.py:199:stop] epoch=0/micro_step=3720/global_step=1860, RunningAvgSamplesPerSec=12.290090474590235, CurrSamplesPerSec=12.435068747489805, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:25:54,350] [INFO] [logging.py:96:log_dist] [Rank 0] step=1870, skipped=36, lr=[0.00018497407257038722], mom=[(0.9, 0.95)] [2023-04-19 18:25:54,351] [INFO] [timer.py:199:stop] epoch=0/micro_step=3740/global_step=1870, RunningAvgSamplesPerSec=12.28916153999188, CurrSamplesPerSec=12.468483660129504, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:26:20,422] [INFO] [logging.py:96:log_dist] [Rank 0] step=1880, skipped=36, lr=[0.00018022935504195952], mom=[(0.9, 0.95)] [2023-04-19 18:26:20,422] [INFO] [timer.py:199:stop] epoch=0/micro_step=3760/global_step=1880, RunningAvgSamplesPerSec=12.289146106288744, CurrSamplesPerSec=10.895669559644977, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:26:46,605] [INFO] [logging.py:96:log_dist] [Rank 0] step=1890, skipped=36, lr=[0.00017553286912796773], mom=[(0.9, 0.95)] [2023-04-19 18:26:46,605] [INFO] [timer.py:199:stop] epoch=0/micro_step=3780/global_step=1890, RunningAvgSamplesPerSec=12.288853181453334, CurrSamplesPerSec=12.376522274843671, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:27:12,953] [INFO] [logging.py:96:log_dist] [Rank 0] step=1900, skipped=36, lr=[0.00017088532320831245], mom=[(0.9, 0.95)] [2023-04-19 18:27:12,954] [INFO] [timer.py:199:stop] epoch=0/micro_step=3800/global_step=1900, RunningAvgSamplesPerSec=12.288151161040204, CurrSamplesPerSec=11.314906634422599, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:27:38,340] [INFO] [logging.py:96:log_dist] [Rank 0] step=1910, skipped=36, lr=[0.00016628741828118255], mom=[(0.9, 0.95)] [2023-04-19 18:27:38,340] [INFO] [timer.py:199:stop] epoch=0/micro_step=3820/global_step=1910, RunningAvgSamplesPerSec=12.289834917827946, CurrSamplesPerSec=12.758622957963134, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:28:04,093] [INFO] [logging.py:96:log_dist] [Rank 0] step=1920, skipped=36, lr=[0.0001617398478573211], mom=[(0.9, 0.95)] [2023-04-19 18:28:04,094] [INFO] [timer.py:199:stop] epoch=0/micro_step=3840/global_step=1920, RunningAvgSamplesPerSec=12.290598323679253, CurrSamplesPerSec=12.841313334019006, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:28:29,058] [INFO] [logging.py:96:log_dist] [Rank 0] step=1930, skipped=36, lr=[0.0001572432978554223], mom=[(0.9, 0.95)] [2023-04-19 18:28:29,058] [INFO] [timer.py:199:stop] epoch=0/micro_step=3860/global_step=1930, RunningAvgSamplesPerSec=12.293287579333615, CurrSamplesPerSec=12.792861649178567, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:28:47,200] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-19 18:28:49,671] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536 [2023-04-19 18:28:54,671] [INFO] [logging.py:96:log_dist] [Rank 0] step=1940, skipped=38, lr=[0.0001536832485848859], mom=[(0.9, 0.95)] [2023-04-19 18:28:54,671] [INFO] [timer.py:199:stop] epoch=0/micro_step=3880/global_step=1940, RunningAvgSamplesPerSec=12.294367966683168, CurrSamplesPerSec=12.86009034984407, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:29:19,558] [INFO] [logging.py:96:log_dist] [Rank 0] step=1950, skipped=38, lr=[0.00014928023922823442], mom=[(0.9, 0.95)] [2023-04-19 18:29:19,558] [INFO] [timer.py:199:stop] epoch=0/micro_step=3900/global_step=1950, RunningAvgSamplesPerSec=12.29720127164094, CurrSamplesPerSec=12.879749164030441, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:29:45,078] [INFO] [logging.py:96:log_dist] [Rank 0] step=1960, skipped=38, lr=[0.00014493012960000785], mom=[(0.9, 0.95)] [2023-04-19 18:29:45,079] [INFO] [timer.py:199:stop] epoch=0/micro_step=3920/global_step=1960, RunningAvgSamplesPerSec=12.298476526249685, CurrSamplesPerSec=12.867855376739639, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:30:10,380] [INFO] [logging.py:96:log_dist] [Rank 0] step=1970, skipped=38, lr=[0.0001406335758355134], mom=[(0.9, 0.95)] [2023-04-19 18:30:10,380] [INFO] [timer.py:199:stop] epoch=0/micro_step=3940/global_step=1970, RunningAvgSamplesPerSec=12.300264143525318, CurrSamplesPerSec=12.820006808415691, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:30:35,732] [INFO] [logging.py:96:log_dist] [Rank 0] step=1980, skipped=38, lr=[0.00013639122599212533], mom=[(0.9, 0.95)] [2023-04-19 18:30:35,733] [INFO] [timer.py:199:stop] epoch=0/micro_step=3960/global_step=1980, RunningAvgSamplesPerSec=12.301912515264485, CurrSamplesPerSec=12.802970659982275, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:31:01,381] [INFO] [logging.py:96:log_dist] [Rank 0] step=1990, skipped=38, lr=[0.00013220371995153736], mom=[(0.9, 0.95)] [2023-04-19 18:31:01,382] [INFO] [timer.py:199:stop] epoch=0/micro_step=3980/global_step=1990, RunningAvgSamplesPerSec=12.302840564663645, CurrSamplesPerSec=11.593263562391549, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:31:26,506] [INFO] [logging.py:96:log_dist] [Rank 0] step=2000, skipped=38, lr=[0.00012807168932324857], mom=[(0.9, 0.95)] [2023-04-19 18:31:26,506] [INFO] [timer.py:199:stop] epoch=0/micro_step=4000/global_step=2000, RunningAvgSamplesPerSec=12.304999928758503, CurrSamplesPerSec=12.715702879906292, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:31:52,481] [INFO] [logging.py:96:log_dist] [Rank 0] step=2010, skipped=38, lr=[0.0001239957573492957], mom=[(0.9, 0.95)] [2023-04-19 18:31:52,481] [INFO] [timer.py:199:stop] epoch=0/micro_step=4020/global_step=2010, RunningAvgSamplesPerSec=12.305134776463118, CurrSamplesPerSec=12.599764184596898, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:32:17,970] [INFO] [logging.py:96:log_dist] [Rank 0] step=2020, skipped=38, lr=[0.00011997653881024884], mom=[(0.9, 0.95)] [2023-04-19 18:32:17,971] [INFO] [timer.py:199:stop] epoch=0/micro_step=4040/global_step=2020, RunningAvgSamplesPerSec=12.306405867559443, CurrSamplesPerSec=12.496697285427834, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:32:44,382] [INFO] [logging.py:96:log_dist] [Rank 0] step=2030, skipped=38, lr=[0.0001160146399324833], mom=[(0.9, 0.95)] [2023-04-19 18:32:44,383] [INFO] [timer.py:199:stop] epoch=0/micro_step=4060/global_step=2030, RunningAvgSamplesPerSec=12.305514093554024, CurrSamplesPerSec=12.39382407379021, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:33:07,621] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-19 18:33:10,518] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536 [2023-04-19 18:33:10,518] [INFO] [logging.py:96:log_dist] [Rank 0] step=2040, skipped=40, lr=[0.00011288679285345288], mom=[(0.9, 0.95)] [2023-04-19 18:33:10,518] [INFO] [timer.py:199:stop] epoch=0/micro_step=4080/global_step=2040, RunningAvgSamplesPerSec=12.305270434756762, CurrSamplesPerSec=11.058220762964236, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:33:36,786] [INFO] [logging.py:96:log_dist] [Rank 0] step=2050, skipped=40, lr=[0.0001090295694020207], mom=[(0.9, 0.95)] [2023-04-19 18:33:36,786] [INFO] [timer.py:199:stop] epoch=0/micro_step=4100/global_step=2050, RunningAvgSamplesPerSec=12.304723993392827, CurrSamplesPerSec=12.338405292495112, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:34:02,975] [INFO] [logging.py:96:log_dist] [Rank 0] step=2060, skipped=40, lr=[0.00010523131676408154], mom=[(0.9, 0.95)] [2023-04-19 18:34:02,975] [INFO] [timer.py:199:stop] epoch=0/micro_step=4120/global_step=2060, RunningAvgSamplesPerSec=12.304364466002099, CurrSamplesPerSec=12.451194702243312, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:34:29,152] [INFO] [logging.py:96:log_dist] [Rank 0] step=2070, skipped=40, lr=[0.00010149260783730319], mom=[(0.9, 0.95)] [2023-04-19 18:34:29,153] [INFO] [timer.py:199:stop] epoch=0/micro_step=4140/global_step=2070, RunningAvgSamplesPerSec=12.304034186898377, CurrSamplesPerSec=12.411747026891463, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:34:55,549] [INFO] [logging.py:96:log_dist] [Rank 0] step=2080, skipped=40, lr=[9.781400653826244e-05], mom=[(0.9, 0.95)] [2023-04-19 18:34:55,550] [INFO] [timer.py:199:stop] epoch=0/micro_step=4160/global_step=2080, RunningAvgSamplesPerSec=12.303206574085898, CurrSamplesPerSec=12.411290230109326, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:35:21,383] [INFO] [logging.py:96:log_dist] [Rank 0] step=2090, skipped=40, lr=[9.419606771738853e-05], mom=[(0.9, 0.95)] [2023-04-19 18:35:21,383] [INFO] [timer.py:199:stop] epoch=0/micro_step=4180/global_step=2090, RunningAvgSamplesPerSec=12.30366604428486, CurrSamplesPerSec=12.406835355771703, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:35:47,968] [INFO] [logging.py:96:log_dist] [Rank 0] step=2100, skipped=40, lr=[9.063933707527306e-05], mom=[(0.9, 0.95)] [2023-04-19 18:35:47,968] [INFO] [timer.py:199:stop] epoch=0/micro_step=4200/global_step=2100, RunningAvgSamplesPerSec=12.302425094480283, CurrSamplesPerSec=12.383200007011915, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:36:13,793] [INFO] [logging.py:96:log_dist] [Rank 0] step=2110, skipped=40, lr=[8.714435108036234e-05], mom=[(0.9, 0.95)] [2023-04-19 18:36:13,794] [INFO] [timer.py:199:stop] epoch=0/micro_step=4220/global_step=2110, RunningAvgSamplesPerSec=12.302901308765746, CurrSamplesPerSec=12.434363708237937, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:36:40,308] [INFO] [logging.py:96:log_dist] [Rank 0] step=2120, skipped=40, lr=[8.371163688803967e-05], mom=[(0.9, 0.95)] [2023-04-19 18:36:40,308] [INFO] [timer.py:199:stop] epoch=0/micro_step=4240/global_step=2120, RunningAvgSamplesPerSec=12.301832484359094, CurrSamplesPerSec=12.420096949094322, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:37:06,496] [INFO] [logging.py:96:log_dist] [Rank 0] step=2130, skipped=40, lr=[8.034171226111403e-05], mom=[(0.9, 0.95)] [2023-04-19 18:37:06,497] [INFO] [timer.py:199:stop] epoch=0/micro_step=4260/global_step=2130, RunningAvgSamplesPerSec=12.301499367685778, CurrSamplesPerSec=10.856264810636182, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:37:32,691] [INFO] [logging.py:96:log_dist] [Rank 0] step=2140, skipped=40, lr=[7.703508549172528e-05], mom=[(0.9, 0.95)] [2023-04-19 18:37:32,691] [INFO] [timer.py:199:stop] epoch=0/micro_step=4280/global_step=2140, RunningAvgSamplesPerSec=12.30115622408466, CurrSamplesPerSec=12.401573479735115, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:37:35,215] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-19 18:37:37,759] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536 [2023-04-19 18:37:58,792] [INFO] [logging.py:96:log_dist] [Rank 0] step=2150, skipped=42, lr=[7.443569401286737e-05], mom=[(0.9, 0.95)] [2023-04-19 18:37:58,792] [INFO] [timer.py:199:stop] epoch=0/micro_step=4300/global_step=2150, RunningAvgSamplesPerSec=12.30102199734571, CurrSamplesPerSec=12.400159610180129, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:38:24,948] [INFO] [logging.py:96:log_dist] [Rank 0] step=2160, skipped=42, lr=[7.124425376007727e-05], mom=[(0.9, 0.95)] [2023-04-19 18:38:24,948] [INFO] [timer.py:199:stop] epoch=0/micro_step=4320/global_step=2160, RunningAvgSamplesPerSec=12.300768461536611, CurrSamplesPerSec=12.453067368968957, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:38:51,245] [INFO] [logging.py:96:log_dist] [Rank 0] step=2170, skipped=42, lr=[6.811748355178887e-05], mom=[(0.9, 0.95)] [2023-04-19 18:38:51,245] [INFO] [timer.py:199:stop] epoch=0/micro_step=4340/global_step=2170, RunningAvgSamplesPerSec=12.300209402649584, CurrSamplesPerSec=12.43781132701424, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:39:17,070] [INFO] [logging.py:96:log_dist] [Rank 0] step=2180, skipped=42, lr=[6.505585500469818e-05], mom=[(0.9, 0.95)] [2023-04-19 18:39:17,071] [INFO] [timer.py:199:stop] epoch=0/micro_step=4360/global_step=2180, RunningAvgSamplesPerSec=12.300679290112301, CurrSamplesPerSec=12.365838971239125, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:39:43,643] [INFO] [logging.py:96:log_dist] [Rank 0] step=2190, skipped=42, lr=[6.205982991006093e-05], mom=[(0.9, 0.95)] [2023-04-19 18:39:43,643] [INFO] [timer.py:199:stop] epoch=0/micro_step=4380/global_step=2190, RunningAvgSamplesPerSec=12.299530650271372, CurrSamplesPerSec=12.405328558682902, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:40:09,493] [INFO] [logging.py:96:log_dist] [Rank 0] step=2200, skipped=42, lr=[5.912986016403909e-05], mom=[(0.9, 0.95)] [2023-04-19 18:40:09,494] [INFO] [timer.py:199:stop] epoch=0/micro_step=4400/global_step=2200, RunningAvgSamplesPerSec=12.299945811894498, CurrSamplesPerSec=12.441138616114486, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:40:36,096] [INFO] [logging.py:96:log_dist] [Rank 0] step=2210, skipped=42, lr=[5.6266387699540786e-05], mom=[(0.9, 0.95)] [2023-04-19 18:40:36,097] [INFO] [timer.py:199:stop] epoch=0/micro_step=4420/global_step=2210, RunningAvgSamplesPerSec=12.298745310472013, CurrSamplesPerSec=12.415833295329127, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:41:02,293] [INFO] [logging.py:96:log_dist] [Rank 0] step=2220, skipped=42, lr=[5.346984441956315e-05], mom=[(0.9, 0.95)] [2023-04-19 18:41:02,293] [INFO] [timer.py:199:stop] epoch=0/micro_step=4440/global_step=2220, RunningAvgSamplesPerSec=12.298422423831754, CurrSamplesPerSec=12.331312379976227, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:41:28,489] [INFO] [logging.py:96:log_dist] [Rank 0] step=2230, skipped=42, lr=[5.074065213204676e-05], mom=[(0.9, 0.95)] [2023-04-19 18:41:28,490] [INFO] [timer.py:199:stop] epoch=0/micro_step=4460/global_step=2230, RunningAvgSamplesPerSec=12.298102818031259, CurrSamplesPerSec=12.428839061774376, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:41:54,606] [INFO] [logging.py:96:log_dist] [Rank 0] step=2240, skipped=42, lr=[4.8079222486253736e-05], mom=[(0.9, 0.95)] [2023-04-19 18:41:54,607] [INFO] [timer.py:199:stop] epoch=0/micro_step=4480/global_step=2240, RunningAvgSamplesPerSec=12.297953704777502, CurrSamplesPerSec=12.434558392459168, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:42:02,610] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-19 18:42:05,144] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536 [2023-04-19 18:42:20,545] [INFO] [logging.py:96:log_dist] [Rank 0] step=2250, skipped=44, lr=[4.599913797658045e-05], mom=[(0.9, 0.95)] [2023-04-19 18:42:20,546] [INFO] [timer.py:199:stop] epoch=0/micro_step=4500/global_step=2250, RunningAvgSamplesPerSec=12.298179937217506, CurrSamplesPerSec=12.529763874087092, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:42:46,846] [INFO] [logging.py:96:log_dist] [Rank 0] step=2260, skipped=44, lr=[4.346068577864587e-05], mom=[(0.9, 0.95)] [2023-04-19 18:42:46,846] [INFO] [timer.py:199:stop] epoch=0/micro_step=4520/global_step=2260, RunningAvgSamplesPerSec=12.29764823685467, CurrSamplesPerSec=12.45570692530282, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:43:12,672] [INFO] [logging.py:96:log_dist] [Rank 0] step=2270, skipped=44, lr=[4.099109427360304e-05], mom=[(0.9, 0.95)] [2023-04-19 18:43:12,672] [INFO] [timer.py:199:stop] epoch=0/micro_step=4540/global_step=2270, RunningAvgSamplesPerSec=12.298109115305559, CurrSamplesPerSec=12.437736408563667, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:43:38,689] [INFO] [logging.py:96:log_dist] [Rank 0] step=2280, skipped=44, lr=[3.859073595463469e-05], mom=[(0.9, 0.95)] [2023-04-19 18:43:38,690] [INFO] [timer.py:199:stop] epoch=0/micro_step=4560/global_step=2280, RunningAvgSamplesPerSec=12.298170178178243, CurrSamplesPerSec=12.75117692923243, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:44:03,731] [INFO] [logging.py:96:log_dist] [Rank 0] step=2290, skipped=44, lr=[3.6259972872350666e-05], mom=[(0.9, 0.95)] [2023-04-19 18:44:03,732] [INFO] [timer.py:199:stop] epoch=0/micro_step=4580/global_step=2290, RunningAvgSamplesPerSec=12.300246104187035, CurrSamplesPerSec=12.824455816125798, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:44:29,392] [INFO] [logging.py:96:log_dist] [Rank 0] step=2300, skipped=44, lr=[3.399915658017838e-05], mom=[(0.9, 0.95)] [2023-04-19 18:44:29,392] [INFO] [timer.py:199:stop] epoch=0/micro_step=4600/global_step=2300, RunningAvgSamplesPerSec=12.301030032165718, CurrSamplesPerSec=12.871611794362979, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:44:54,682] [INFO] [logging.py:96:log_dist] [Rank 0] step=2310, skipped=44, lr=[3.1808628081338496e-05], mom=[(0.9, 0.95)] [2023-04-19 18:44:54,682] [INFO] [timer.py:199:stop] epoch=0/micro_step=4620/global_step=2310, RunningAvgSamplesPerSec=12.302568709380266, CurrSamplesPerSec=12.835298490770182, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:45:19,976] [INFO] [logging.py:96:log_dist] [Rank 0] step=2320, skipped=44, lr=[2.9688717777409667e-05], mom=[(0.9, 0.95)] [2023-04-19 18:45:19,976] [INFO] [timer.py:199:stop] epoch=0/micro_step=4640/global_step=2320, RunningAvgSamplesPerSec=12.304085416510999, CurrSamplesPerSec=12.840191724087525, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:45:45,627] [INFO] [logging.py:96:log_dist] [Rank 0] step=2330, skipped=44, lr=[2.7639745418494234e-05], mom=[(0.9, 0.95)] [2023-04-19 18:45:45,627] [INFO] [timer.py:199:stop] epoch=0/micro_step=4660/global_step=2330, RunningAvgSamplesPerSec=12.304862619859534, CurrSamplesPerSec=11.164547013836554, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:46:10,564] [INFO] [logging.py:96:log_dist] [Rank 0] step=2340, skipped=44, lr=[2.5662020054989298e-05], mom=[(0.9, 0.95)] [2023-04-19 18:46:10,564] [INFO] [timer.py:199:stop] epoch=0/micro_step=4680/global_step=2340, RunningAvgSamplesPerSec=12.307078782682758, CurrSamplesPerSec=12.811729488667604, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:46:23,358] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-19 18:46:25,838] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536 [2023-04-19 18:46:36,126] [INFO] [logging.py:96:log_dist] [Rank 0] step=2350, skipped=46, lr=[2.413133842345444e-05], mom=[(0.9, 0.95)] [2023-04-19 18:46:36,126] [INFO] [timer.py:199:stop] epoch=0/micro_step=4700/global_step=2350, RunningAvgSamplesPerSec=12.308017841166846, CurrSamplesPerSec=12.765987834726452, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:47:01,158] [INFO] [logging.py:96:log_dist] [Rank 0] step=2360, skipped=46, lr=[2.2282602127670638e-05], mom=[(0.9, 0.95)] [2023-04-19 18:47:01,159] [INFO] [timer.py:199:stop] epoch=0/micro_step=4720/global_step=2360, RunningAvgSamplesPerSec=12.310010809005004, CurrSamplesPerSec=12.779279202937284, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:47:27,031] [INFO] [logging.py:96:log_dist] [Rank 0] step=2370, skipped=46, lr=[2.0505920855406757e-05], mom=[(0.9, 0.95)] [2023-04-19 18:47:27,032] [INFO] [timer.py:199:stop] epoch=0/micro_step=4740/global_step=2370, RunningAvgSamplesPerSec=12.310307228959495, CurrSamplesPerSec=12.691875719261018, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:47:52,702] [INFO] [logging.py:96:log_dist] [Rank 0] step=2380, skipped=46, lr=[1.8801562586877375e-05], mom=[(0.9, 0.95)] [2023-04-19 18:47:52,702] [INFO] [timer.py:199:stop] epoch=0/micro_step=4760/global_step=2380, RunningAvgSamplesPerSec=12.311004314974385, CurrSamplesPerSec=11.0915473904342, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:48:18,609] [INFO] [logging.py:96:log_dist] [Rank 0] step=2390, skipped=46, lr=[1.7169784393681164e-05], mom=[(0.9, 0.95)] [2023-04-19 18:48:18,610] [INFO] [timer.py:199:stop] epoch=0/micro_step=4780/global_step=2390, RunningAvgSamplesPerSec=12.311227600222733, CurrSamplesPerSec=12.524098052787979, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:48:44,677] [INFO] [logging.py:96:log_dist] [Rank 0] step=2400, skipped=46, lr=[1.561083240002592e-05], mom=[(0.9, 0.95)] [2023-04-19 18:48:44,678] [INFO] [timer.py:199:stop] epoch=0/micro_step=4800/global_step=2400, RunningAvgSamplesPerSec=12.311130099979922, CurrSamplesPerSec=12.407947914210096, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:49:10,952] [INFO] [logging.py:96:log_dist] [Rank 0] step=2410, skipped=46, lr=[1.4124941745605024e-05], mom=[(0.9, 0.95)] [2023-04-19 18:49:10,952] [INFO] [timer.py:199:stop] epoch=0/micro_step=4820/global_step=2410, RunningAvgSamplesPerSec=12.310627790060659, CurrSamplesPerSec=12.370306632484153, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:49:37,487] [INFO] [logging.py:96:log_dist] [Rank 0] step=2420, skipped=46, lr=[1.2712336550131598e-05], mom=[(0.9, 0.95)] [2023-04-19 18:49:37,487] [INFO] [timer.py:199:stop] epoch=0/micro_step=4840/global_step=2420, RunningAvgSamplesPerSec=12.309618788598172, CurrSamplesPerSec=10.852950064021522, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:50:03,211] [INFO] [logging.py:96:log_dist] [Rank 0] step=2430, skipped=46, lr=[1.1373229879533375e-05], mom=[(0.9, 0.95)] [2023-04-19 18:50:03,211] [INFO] [timer.py:199:stop] epoch=0/micro_step=4860/global_step=2430, RunningAvgSamplesPerSec=12.310200725412857, CurrSamplesPerSec=12.48376864028326, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:50:29,508] [INFO] [logging.py:96:log_dist] [Rank 0] step=2440, skipped=46, lr=[1.010782371381569e-05], mom=[(0.9, 0.95)] [2023-04-19 18:50:29,508] [INFO] [timer.py:199:stop] epoch=0/micro_step=4880/global_step=2440, RunningAvgSamplesPerSec=12.309663735759242, CurrSamplesPerSec=12.454816932342066, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:50:47,517] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-19 18:50:50,067] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536 [2023-04-19 18:50:55,269] [INFO] [logging.py:96:log_dist] [Rank 0] step=2450, skipped=48, lr=[9.14869177936145e-06], mom=[(0.9, 0.95)] [2023-04-19 18:50:55,269] [INFO] [timer.py:199:stop] epoch=0/micro_step=4900/global_step=2450, RunningAvgSamplesPerSec=12.31016926218093, CurrSamplesPerSec=12.382171843136206, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:51:21,646] [INFO] [logging.py:96:log_dist] [Rank 0] step=2460, skipped=48, lr=[8.016420010113156e-06], mom=[(0.9, 0.95)] [2023-04-19 18:51:21,647] [INFO] [timer.py:199:stop] epoch=0/micro_step=4920/global_step=2460, RunningAvgSamplesPerSec=12.309482308619254, CurrSamplesPerSec=12.48385456463192, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:51:47,716] [INFO] [logging.py:96:log_dist] [Rank 0] step=2470, skipped=48, lr=[6.958355059761279e-06], mom=[(0.9, 0.95)] [2023-04-19 18:51:47,717] [INFO] [timer.py:199:stop] epoch=0/micro_step=4940/global_step=2470, RunningAvgSamplesPerSec=12.309390904770943, CurrSamplesPerSec=12.388855674772541, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:52:13,820] [INFO] [logging.py:96:log_dist] [Rank 0] step=2480, skipped=48, lr=[5.974656518254129e-06], mom=[(0.9, 0.95)] [2023-04-19 18:52:13,820] [INFO] [timer.py:199:stop] epoch=0/micro_step=4960/global_step=2480, RunningAvgSamplesPerSec=12.309235890996996, CurrSamplesPerSec=12.403854224243707, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:52:40,021] [INFO] [logging.py:96:log_dist] [Rank 0] step=2490, skipped=48, lr=[5.06547275871333e-06], mom=[(0.9, 0.95)] [2023-04-19 18:52:40,021] [INFO] [timer.py:199:stop] epoch=0/micro_step=4980/global_step=2490, RunningAvgSamplesPerSec=12.308896336184107, CurrSamplesPerSec=12.31674295094959, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:53:06,296] [INFO] [logging.py:96:log_dist] [Rank 0] step=2500, skipped=48, lr=[4.23094091505416e-06], mom=[(0.9, 0.95)] [2023-04-19 18:53:06,297] [INFO] [timer.py:199:stop] epoch=0/micro_step=5000/global_step=2500, RunningAvgSamplesPerSec=12.30841818692135, CurrSamplesPerSec=12.305871454185846, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:53:32,845] [INFO] [logging.py:96:log_dist] [Rank 0] step=2510, skipped=48, lr=[3.471186861301545e-06], mom=[(0.9, 0.95)] [2023-04-19 18:53:32,846] [INFO] [timer.py:199:stop] epoch=0/micro_step=5020/global_step=2510, RunningAvgSamplesPerSec=12.307428673506953, CurrSamplesPerSec=12.437062183120354, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:53:58,554] [INFO] [logging.py:96:log_dist] [Rank 0] step=2520, skipped=48, lr=[2.7863251926040224e-06], mom=[(0.9, 0.95)] [2023-04-19 18:53:58,554] [INFO] [timer.py:199:stop] epoch=0/micro_step=5040/global_step=2520, RunningAvgSamplesPerSec=12.308026366186626, CurrSamplesPerSec=12.448089466013181, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:54:24,848] [INFO] [logging.py:96:log_dist] [Rank 0] step=2530, skipped=48, lr=[2.1764592079493996e-06], mom=[(0.9, 0.95)] [2023-04-19 18:54:24,848] [INFO] [timer.py:199:stop] epoch=0/micro_step=5060/global_step=2530, RunningAvgSamplesPerSec=12.307524238014608, CurrSamplesPerSec=12.462064247476162, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:54:50,593] [INFO] [logging.py:96:log_dist] [Rank 0] step=2540, skipped=48, lr=[1.6416808945838302e-06], mom=[(0.9, 0.95)] [2023-04-19 18:54:50,593] [INFO] [timer.py:199:stop] epoch=0/micro_step=5080/global_step=2540, RunningAvgSamplesPerSec=12.308049471202695, CurrSamplesPerSec=12.400358953137731, MemAllocated=1.95GB, MaxMemAllocated=13.62GB [2023-04-19 18:55:14,484] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-19 18:55:17,038] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536 [2023-04-19 18:55:17,039] [INFO] [logging.py:96:log_dist] [Rank 0] step=2550, skipped=50, lr=[1.2679759526949552e-06], mom=[(0.9, 0.95)] [2023-04-19 18:55:17,039] [INFO] [timer.py:199:stop] epoch=0/micro_step=5100/global_step=2550, RunningAvgSamplesPerSec=12.307268386762189, CurrSamplesPerSec=12.540417228668503, MemAllocated=1.95GB, MaxMemAllocated=13.62GB ***** Evaluating perplexity, Epoch 1/1 ***** Invalidate trace cache @ step 0: expected module 0, but got module 16 ppl: 1.6646381616592407 saving the final model ... [2023-04-19 19:03:02,394] [INFO] [launch.py:460:main] Process 10813 exits successfully. [2023-04-19 19:03:02,395] [INFO] [launch.py:460:main] Process 10814 exits successfully. [2023-04-19 19:03:02,395] [INFO] [launch.py:460:main] Process 10815 exits successfully. [2023-04-19 19:03:14,408] [INFO] [launch.py:460:main] Process 10812 exits successfully.