2023-04-23 15:28:46.796128: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 [2023-04-23 15:28:48,338] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-04-23 15:28:48,365] [INFO] [runner.py:540:main] cmd = /root/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets --data_split 2,4,4 --model_name_or_path facebook/opt-1.3b --per_device_train_batch_size 8 --per_device_eval_batch_size 16 --max_seq_len 512 --learning_rate 9.65e-6 --weight_decay 0. --num_train_epochs 16 --gradient_accumulation_steps 4 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --zero_stage 2 --deepspeed --output_dir ./output 2023-04-23 15:28:50.591037: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 [2023-04-23 15:28:51,955] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1]} [2023-04-23 15:28:51,955] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=2, node_rank=0 [2023-04-23 15:28:51,955] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1]}) [2023-04-23 15:28:51,955] [INFO] [launch.py:247:main] dist_world_size=2 [2023-04-23 15:28:51,955] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1 2023-04-23 15:28:53.908696: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2023-04-23 15:28:54.024671: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 [2023-04-23 15:28:56,583] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Found cached dataset parquet (/root/.cache/huggingface/datasets/Dahoas___parquet/default-b9d2c4937d617106/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec) 0%| | 0/2 [00:00 [2023-04-23 15:31:23,437] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer [2023-04-23 15:31:23,438] [INFO] [stage_1_and_2.py:133:__init__] Reduce bucket size 500,000,000 [2023-04-23 15:31:23,438] [INFO] [stage_1_and_2.py:134:__init__] Allgather bucket size 500,000,000 [2023-04-23 15:31:23,438] [INFO] [stage_1_and_2.py:135:__init__] CPU Offload: False [2023-04-23 15:31:23,438] [INFO] [stage_1_and_2.py:136:__init__] Round robin gradient partitioning: False Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) ninja: no work to do. Loading extension module utils... Time to load utils op: 1.064605951309204 seconds Loading extension module utils... Time to load utils op: 1.111020803451538 seconds Rank: 1 partition count [2, 2] and sizes[(657607680, False), (271360, False)] Rank: 0 partition count [2, 2] and sizes[(657607680, False), (271360, False)] Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0008490085601806641 seconds [2023-04-23 15:31:31,483] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states [2023-04-23 15:31:31,484] [INFO] [utils.py:786:see_memory_usage] MA 4.9 GB Max_MA 4.9 GB CA 4.9 GB Max_CA 5 GB [2023-04-23 15:31:31,485] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 154.26 GB, percent = 24.5% [2023-04-23 15:31:31,816] [INFO] [utils.py:785:see_memory_usage] After initializing optimizer states [2023-04-23 15:31:31,817] [INFO] [utils.py:786:see_memory_usage] MA 9.8 GB Max_MA 12.26 GB CA 12.26 GB Max_CA 12 GB [2023-04-23 15:31:31,817] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 154.3 GB, percent = 24.5% [2023-04-23 15:31:31,817] [INFO] [stage_1_and_2.py:489:__init__] optimizer state initialized [2023-04-23 15:31:32,150] [INFO] [utils.py:785:see_memory_usage] After initializing ZeRO optimizer [2023-04-23 15:31:32,150] [INFO] [utils.py:786:see_memory_usage] MA 9.8 GB Max_MA 9.8 GB CA 12.26 GB Max_CA 12 GB [2023-04-23 15:31:32,151] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 154.24 GB, percent = 24.5% [2023-04-23 15:31:32,156] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam [2023-04-23 15:31:32,156] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler [2023-04-23 15:31:32,157] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = [2023-04-23 15:31:32,157] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[9.65e-06, 9.65e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:31:32,157] [INFO] [config.py:953:print] DeepSpeedEngine configuration: [2023-04-23 15:31:32,158] [INFO] [config.py:957:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2023-04-23 15:31:32,158] [INFO] [config.py:957:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2023-04-23 15:31:32,158] [INFO] [config.py:957:print] amp_enabled .................. False [2023-04-23 15:31:32,158] [INFO] [config.py:957:print] amp_params ................... False [2023-04-23 15:31:32,158] [INFO] [config.py:957:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2023-04-23 15:31:32,158] [INFO] [config.py:957:print] bfloat16_enabled ............. False [2023-04-23 15:31:32,158] [INFO] [config.py:957:print] checkpoint_parallel_write_pipeline False [2023-04-23 15:31:32,158] [INFO] [config.py:957:print] checkpoint_tag_validation_enabled True [2023-04-23 15:31:32,158] [INFO] [config.py:957:print] checkpoint_tag_validation_fail False [2023-04-23 15:31:32,158] [INFO] [config.py:957:print] comms_config ................. [2023-04-23 15:31:32,158] [INFO] [config.py:957:print] communication_data_type ...... None [2023-04-23 15:31:32,158] [INFO] [config.py:957:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2023-04-23 15:31:32,158] [INFO] [config.py:957:print] curriculum_enabled_legacy .... False [2023-04-23 15:31:32,158] [INFO] [config.py:957:print] curriculum_params_legacy ..... False [2023-04-23 15:31:32,158] [INFO] [config.py:957:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2023-04-23 15:31:32,158] [INFO] [config.py:957:print] data_efficiency_enabled ...... False [2023-04-23 15:31:32,158] [INFO] [config.py:957:print] dataloader_drop_last ......... False [2023-04-23 15:31:32,158] [INFO] [config.py:957:print] disable_allgather ............ False [2023-04-23 15:31:32,158] [INFO] [config.py:957:print] dump_state ................... False [2023-04-23 15:31:32,158] [INFO] [config.py:957:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 100, 'delayed_shift': 2, 'min_scale': 1} [2023-04-23 15:31:32,158] [INFO] [config.py:957:print] eigenvalue_enabled ........... False [2023-04-23 15:31:32,158] [INFO] [config.py:957:print] eigenvalue_gas_boundary_resolution 1 [2023-04-23 15:31:32,158] [INFO] [config.py:957:print] eigenvalue_layer_name ........ bert.encoder.layer [2023-04-23 15:31:32,158] [INFO] [config.py:957:print] eigenvalue_layer_num ......... 0 [2023-04-23 15:31:32,158] [INFO] [config.py:957:print] eigenvalue_max_iter .......... 100 [2023-04-23 15:31:32,159] [INFO] [config.py:957:print] eigenvalue_stability ......... 1e-06 [2023-04-23 15:31:32,159] [INFO] [config.py:957:print] eigenvalue_tol ............... 0.01 [2023-04-23 15:31:32,159] [INFO] [config.py:957:print] eigenvalue_verbose ........... False [2023-04-23 15:31:32,159] [INFO] [config.py:957:print] elasticity_enabled ........... False [2023-04-23 15:31:32,159] [INFO] [config.py:957:print] flops_profiler_config ........ { "enabled": false, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2023-04-23 15:31:32,159] [INFO] [config.py:957:print] fp16_auto_cast ............... False [2023-04-23 15:31:32,159] [INFO] [config.py:957:print] fp16_enabled ................. True [2023-04-23 15:31:32,159] [INFO] [config.py:957:print] fp16_master_weights_and_gradients False [2023-04-23 15:31:32,159] [INFO] [config.py:957:print] global_rank .................. 0 [2023-04-23 15:31:32,159] [INFO] [config.py:957:print] grad_accum_dtype ............. None [2023-04-23 15:31:32,159] [INFO] [config.py:957:print] gradient_accumulation_steps .. 4 [2023-04-23 15:31:32,159] [INFO] [config.py:957:print] gradient_clipping ............ 1.0 [2023-04-23 15:31:32,159] [INFO] [config.py:957:print] gradient_predivide_factor .... 1.0 [2023-04-23 15:31:32,161] [INFO] [config.py:957:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2023-04-23 15:31:32,161] [INFO] [config.py:957:print] initial_dynamic_scale ........ 65536 [2023-04-23 15:31:32,161] [INFO] [config.py:957:print] load_universal_checkpoint .... False [2023-04-23 15:31:32,161] [INFO] [config.py:957:print] loss_scale ................... 0 [2023-04-23 15:31:32,161] [INFO] [config.py:957:print] memory_breakdown ............. False [2023-04-23 15:31:32,161] [INFO] [config.py:957:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2023-04-23 15:31:32,161] [INFO] [config.py:957:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2023-04-23 15:31:32,161] [INFO] [config.py:957:print] optimizer_legacy_fusion ...... False [2023-04-23 15:31:32,161] [INFO] [config.py:957:print] optimizer_name ............... None [2023-04-23 15:31:32,161] [INFO] [config.py:957:print] optimizer_params ............. None [2023-04-23 15:31:32,161] [INFO] [config.py:957:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2023-04-23 15:31:32,161] [INFO] [config.py:957:print] pld_enabled .................. False [2023-04-23 15:31:32,161] [INFO] [config.py:957:print] pld_params ................... False [2023-04-23 15:31:32,161] [INFO] [config.py:957:print] prescale_gradients ........... False [2023-04-23 15:31:32,161] [INFO] [config.py:957:print] scheduler_name ............... None [2023-04-23 15:31:32,161] [INFO] [config.py:957:print] scheduler_params ............. None [2023-04-23 15:31:32,161] [INFO] [config.py:957:print] sparse_attention ............. None [2023-04-23 15:31:32,161] [INFO] [config.py:957:print] sparse_gradients_enabled ..... False [2023-04-23 15:31:32,161] [INFO] [config.py:957:print] steps_per_print .............. 10 [2023-04-23 15:31:32,161] [INFO] [config.py:957:print] train_batch_size ............. 64 [2023-04-23 15:31:32,161] [INFO] [config.py:957:print] train_micro_batch_size_per_gpu 8 [2023-04-23 15:31:32,161] [INFO] [config.py:957:print] use_node_local_storage ....... False [2023-04-23 15:31:32,161] [INFO] [config.py:957:print] wall_clock_breakdown ......... False [2023-04-23 15:31:32,161] [INFO] [config.py:957:print] world_size ................... 2 [2023-04-23 15:31:32,162] [INFO] [config.py:957:print] zero_allow_untested_optimizer False [2023-04-23 15:31:32,162] [INFO] [config.py:957:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=30000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=30000000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False memory_efficient_linear=False [2023-04-23 15:31:32,162] [INFO] [config.py:957:print] zero_enabled ................. True [2023-04-23 15:31:32,162] [INFO] [config.py:957:print] zero_force_ds_cpu_optimizer .. True [2023-04-23 15:31:32,162] [INFO] [config.py:957:print] zero_optimization_stage ...... 2 [2023-04-23 15:31:32,162] [INFO] [config.py:943:print_user_config] json = { "train_batch_size": 64, "train_micro_batch_size_per_gpu": 8, "steps_per_print": 10, "zero_optimization": { "stage": 2, "offload_param": { "device": "none" }, "offload_optimizer": { "device": "none" }, "stage3_param_persistence_threshold": 1.000000e+04, "stage3_max_live_parameters": 3.000000e+07, "stage3_prefetch_bucket_size": 3.000000e+07, "memory_efficient_linear": false }, "fp16": { "enabled": true, "loss_scale_window": 100 }, "gradient_clipping": 1.0, "prescale_gradients": false, "wall_clock_breakdown": false, "hybrid_engine": { "enabled": false, "max_out_tokens": 512, "inference_tp_size": 1, "release_inference_cache": false, "pin_parameters": true, "tp_gather_partition_size": 8 } } Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.00041031837463378906 seconds ***** Running training ***** ***** Evaluating perplexity, Epoch 0/16 ***** ppl: 4894.6640625 Beginning of Epoch 1/16, Total Micro Batches 3680 [2023-04-23 15:32:08,330] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 15:32:10,759] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-04-23 15:32:13,190] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768, reducing to 16384 [2023-04-23 15:32:15,634] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16384, reducing to 8192 [2023-04-23 15:32:18,075] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8192, reducing to 4096 [2023-04-23 15:32:31,438] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=5, lr=[9.649997252792808e-06, 9.649997252792808e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:32:31,675] [INFO] [timer.py:199:stop] epoch=0/micro_step=40/global_step=10, RunningAvgSamplesPerSec=24.537163562504265, CurrSamplesPerSec=23.61150794628124, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:32:58,340] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=5, lr=[9.649975275154037e-06, 9.649975275154037e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:32:58,582] [INFO] [timer.py:199:stop] epoch=0/micro_step=80/global_step=20, RunningAvgSamplesPerSec=24.140219724123206, CurrSamplesPerSec=23.89741280577237, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:33:25,221] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=5, lr=[9.649931319976603e-06, 9.649931319976603e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:33:25,464] [INFO] [timer.py:199:stop] epoch=0/micro_step=120/global_step=30, RunningAvgSamplesPerSec=24.038584635993875, CurrSamplesPerSec=23.79104929485341, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:33:52,049] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=5, lr=[9.649865387460722e-06, 9.649865387460722e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:33:52,296] [INFO] [timer.py:199:stop] epoch=0/micro_step=160/global_step=40, RunningAvgSamplesPerSec=24.000618919220944, CurrSamplesPerSec=23.84538435121495, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:34:18,849] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=5, lr=[9.649777477906706e-06, 9.649777477906706e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:34:19,092] [INFO] [timer.py:199:stop] epoch=0/micro_step=200/global_step=50, RunningAvgSamplesPerSec=23.98535112146434, CurrSamplesPerSec=23.941017979612866, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:34:45,673] [INFO] [logging.py:96:log_dist] [Rank 0] step=60, skipped=5, lr=[9.649667591714989e-06, 9.649667591714989e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:34:45,916] [INFO] [timer.py:199:stop] epoch=0/micro_step=240/global_step=60, RunningAvgSamplesPerSec=23.971382019328495, CurrSamplesPerSec=23.887985485427137, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:35:12,457] [INFO] [logging.py:96:log_dist] [Rank 0] step=70, skipped=5, lr=[9.649535729386089e-06, 9.649535729386089e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:35:12,698] [INFO] [timer.py:199:stop] epoch=0/micro_step=280/global_step=70, RunningAvgSamplesPerSec=23.96687318369912, CurrSamplesPerSec=23.983072545912716, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:35:39,260] [INFO] [logging.py:96:log_dist] [Rank 0] step=80, skipped=5, lr=[9.64938189152064e-06, 9.64938189152064e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:35:39,506] [INFO] [timer.py:199:stop] epoch=0/micro_step=320/global_step=80, RunningAvgSamplesPerSec=23.960484073066528, CurrSamplesPerSec=23.881898790778248, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:36:06,210] [INFO] [logging.py:96:log_dist] [Rank 0] step=90, skipped=5, lr=[9.64920607881936e-06, 9.64920607881936e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:36:06,452] [INFO] [timer.py:199:stop] epoch=0/micro_step=360/global_step=90, RunningAvgSamplesPerSec=23.941773827558226, CurrSamplesPerSec=23.762716530620068, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:36:33,178] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=5, lr=[9.64900829208307e-06, 9.64900829208307e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:36:33,420] [INFO] [timer.py:199:stop] epoch=0/micro_step=400/global_step=100, RunningAvgSamplesPerSec=23.924789644344823, CurrSamplesPerSec=23.840672267892554, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:37:00,165] [INFO] [logging.py:96:log_dist] [Rank 0] step=110, skipped=5, lr=[9.64878853221268e-06, 9.64878853221268e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:37:00,404] [INFO] [timer.py:199:stop] epoch=0/micro_step=440/global_step=110, RunningAvgSamplesPerSec=23.90959065405221, CurrSamplesPerSec=23.75915365566045, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:37:27,210] [INFO] [logging.py:96:log_dist] [Rank 0] step=120, skipped=5, lr=[9.648546800209186e-06, 9.648546800209186e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:37:27,454] [INFO] [timer.py:199:stop] epoch=0/micro_step=480/global_step=120, RunningAvgSamplesPerSec=23.89223380986927, CurrSamplesPerSec=23.70932253023489, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:37:54,301] [INFO] [logging.py:96:log_dist] [Rank 0] step=130, skipped=5, lr=[9.648283097173667e-06, 9.648283097173667e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:37:54,545] [INFO] [timer.py:199:stop] epoch=0/micro_step=520/global_step=130, RunningAvgSamplesPerSec=23.874718249167362, CurrSamplesPerSec=23.692443505972594, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:38:21,393] [INFO] [logging.py:96:log_dist] [Rank 0] step=140, skipped=5, lr=[9.647997424307275e-06, 9.647997424307275e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:38:21,635] [INFO] [timer.py:199:stop] epoch=0/micro_step=560/global_step=140, RunningAvgSamplesPerSec=23.859893357261974, CurrSamplesPerSec=23.681135923322778, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:38:48,446] [INFO] [logging.py:96:log_dist] [Rank 0] step=150, skipped=5, lr=[9.64768978291124e-06, 9.64768978291124e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:38:48,690] [INFO] [timer.py:199:stop] epoch=0/micro_step=600/global_step=150, RunningAvgSamplesPerSec=23.849261017287798, CurrSamplesPerSec=23.69347656637194, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:39:15,495] [INFO] [logging.py:96:log_dist] [Rank 0] step=160, skipped=5, lr=[9.647360174386853e-06, 9.647360174386853e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:39:15,738] [INFO] [timer.py:199:stop] epoch=0/micro_step=640/global_step=160, RunningAvgSamplesPerSec=23.84037642916495, CurrSamplesPerSec=23.65740216281528, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:39:42,593] [INFO] [logging.py:96:log_dist] [Rank 0] step=170, skipped=5, lr=[9.647008600235464e-06, 9.647008600235464e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:39:42,834] [INFO] [timer.py:199:stop] epoch=0/micro_step=680/global_step=170, RunningAvgSamplesPerSec=23.83016240041546, CurrSamplesPerSec=23.62063720484043, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:40:09,673] [INFO] [logging.py:96:log_dist] [Rank 0] step=180, skipped=5, lr=[9.64663506205848e-06, 9.64663506205848e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:40:09,915] [INFO] [timer.py:199:stop] epoch=0/micro_step=720/global_step=180, RunningAvgSamplesPerSec=23.82165348145364, CurrSamplesPerSec=23.685772607439702, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:40:36,783] [INFO] [logging.py:96:log_dist] [Rank 0] step=190, skipped=5, lr=[9.646239561557348e-06, 9.646239561557348e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:40:37,023] [INFO] [timer.py:199:stop] epoch=0/micro_step=760/global_step=190, RunningAvgSamplesPerSec=23.813832417397748, CurrSamplesPerSec=23.665915963769834, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:41:03,893] [INFO] [logging.py:96:log_dist] [Rank 0] step=200, skipped=5, lr=[9.645822100533555e-06, 9.645822100533555e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:41:04,134] [INFO] [timer.py:199:stop] epoch=0/micro_step=800/global_step=200, RunningAvgSamplesPerSec=23.806443436114083, CurrSamplesPerSec=23.596257023881815, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:41:31,013] [INFO] [logging.py:96:log_dist] [Rank 0] step=210, skipped=5, lr=[9.645382680888615e-06, 9.645382680888615e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:41:31,253] [INFO] [timer.py:199:stop] epoch=0/micro_step=840/global_step=210, RunningAvgSamplesPerSec=23.799399380595826, CurrSamplesPerSec=23.65682047796681, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:41:58,081] [INFO] [logging.py:96:log_dist] [Rank 0] step=220, skipped=5, lr=[9.644921304624067e-06, 9.644921304624067e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:41:58,323] [INFO] [timer.py:199:stop] epoch=0/micro_step=880/global_step=220, RunningAvgSamplesPerSec=23.795360992103124, CurrSamplesPerSec=23.644501318420826, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:42:25,157] [INFO] [logging.py:96:log_dist] [Rank 0] step=230, skipped=5, lr=[9.644437973841459e-06, 9.644437973841459e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:42:25,401] [INFO] [timer.py:199:stop] epoch=0/micro_step=920/global_step=230, RunningAvgSamplesPerSec=23.7908300244481, CurrSamplesPerSec=23.693654328763756, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:42:52,221] [INFO] [logging.py:96:log_dist] [Rank 0] step=240, skipped=5, lr=[9.643932690742336e-06, 9.643932690742336e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:42:52,464] [INFO] [timer.py:199:stop] epoch=0/micro_step=960/global_step=240, RunningAvgSamplesPerSec=23.787071653036936, CurrSamplesPerSec=23.639010593263436, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:43:19,297] [INFO] [logging.py:96:log_dist] [Rank 0] step=250, skipped=5, lr=[9.64340545762824e-06, 9.64340545762824e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:43:19,538] [INFO] [timer.py:199:stop] epoch=0/micro_step=1000/global_step=250, RunningAvgSamplesPerSec=23.783368040065476, CurrSamplesPerSec=23.715396990044294, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:43:46,350] [INFO] [logging.py:96:log_dist] [Rank 0] step=260, skipped=5, lr=[9.642856276900698e-06, 9.642856276900698e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:43:46,591] [INFO] [timer.py:199:stop] epoch=0/micro_step=1040/global_step=260, RunningAvgSamplesPerSec=23.78047570190329, CurrSamplesPerSec=23.79948022204358, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:44:13,382] [INFO] [logging.py:96:log_dist] [Rank 0] step=270, skipped=5, lr=[9.642285151061199e-06, 9.642285151061199e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:44:13,624] [INFO] [timer.py:199:stop] epoch=0/micro_step=1080/global_step=270, RunningAvgSamplesPerSec=23.778336918667634, CurrSamplesPerSec=23.660319358019198, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:44:40,478] [INFO] [logging.py:96:log_dist] [Rank 0] step=280, skipped=5, lr=[9.641692082711195e-06, 9.641692082711195e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:44:40,721] [INFO] [timer.py:199:stop] epoch=0/micro_step=1120/global_step=280, RunningAvgSamplesPerSec=23.774549562246953, CurrSamplesPerSec=23.655509185875932, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:45:07,591] [INFO] [logging.py:96:log_dist] [Rank 0] step=290, skipped=5, lr=[9.641077074552085e-06, 9.641077074552085e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:45:07,834] [INFO] [timer.py:199:stop] epoch=0/micro_step=1160/global_step=290, RunningAvgSamplesPerSec=23.770425193694305, CurrSamplesPerSec=23.694457429755744, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:45:34,722] [INFO] [logging.py:96:log_dist] [Rank 0] step=300, skipped=5, lr=[9.640440129385204e-06, 9.640440129385204e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:45:34,963] [INFO] [timer.py:199:stop] epoch=0/micro_step=1200/global_step=300, RunningAvgSamplesPerSec=23.76666281955425, CurrSamplesPerSec=23.69568100857531, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:46:01,778] [INFO] [logging.py:96:log_dist] [Rank 0] step=310, skipped=5, lr=[9.639781250111804e-06, 9.639781250111804e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:46:02,020] [INFO] [timer.py:199:stop] epoch=0/micro_step=1240/global_step=310, RunningAvgSamplesPerSec=23.764557324962148, CurrSamplesPerSec=23.68184624836558, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:46:30,044] [INFO] [logging.py:96:log_dist] [Rank 0] step=320, skipped=5, lr=[9.639100439733056e-06, 9.639100439733056e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:46:30,288] [INFO] [timer.py:199:stop] epoch=0/micro_step=1280/global_step=320, RunningAvgSamplesPerSec=23.730180980503224, CurrSamplesPerSec=23.686230314634948, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:46:57,118] [INFO] [logging.py:96:log_dist] [Rank 0] step=330, skipped=5, lr=[9.638397701350013e-06, 9.638397701350013e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:46:57,359] [INFO] [timer.py:199:stop] epoch=0/micro_step=1320/global_step=330, RunningAvgSamplesPerSec=23.7290535863951, CurrSamplesPerSec=23.691922827488746, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:47:24,641] [INFO] [logging.py:96:log_dist] [Rank 0] step=340, skipped=5, lr=[9.637673038163619e-06, 9.637673038163619e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:47:24,881] [INFO] [timer.py:199:stop] epoch=0/micro_step=1360/global_step=340, RunningAvgSamplesPerSec=23.718096878407668, CurrSamplesPerSec=20.64149157678474, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:47:51,724] [INFO] [logging.py:96:log_dist] [Rank 0] step=350, skipped=5, lr=[9.63692645347468e-06, 9.63692645347468e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:47:51,966] [INFO] [timer.py:199:stop] epoch=0/micro_step=1400/global_step=350, RunningAvgSamplesPerSec=23.716886128937702, CurrSamplesPerSec=23.667172069795534, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:48:18,792] [INFO] [logging.py:96:log_dist] [Rank 0] step=360, skipped=5, lr=[9.636157950683857e-06, 9.636157950683857e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:48:19,034] [INFO] [timer.py:199:stop] epoch=0/micro_step=1440/global_step=360, RunningAvgSamplesPerSec=23.716221418563286, CurrSamplesPerSec=23.57732873759115, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:48:46,140] [INFO] [logging.py:96:log_dist] [Rank 0] step=370, skipped=5, lr=[9.635367533291643e-06, 9.635367533291643e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:48:46,381] [INFO] [timer.py:199:stop] epoch=0/micro_step=1480/global_step=370, RunningAvgSamplesPerSec=23.710402204229286, CurrSamplesPerSec=23.725619577591267, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:49:13,212] [INFO] [logging.py:96:log_dist] [Rank 0] step=380, skipped=5, lr=[9.634555204898352e-06, 9.634555204898352e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:49:13,456] [INFO] [timer.py:199:stop] epoch=0/micro_step=1520/global_step=380, RunningAvgSamplesPerSec=23.709707364246487, CurrSamplesPerSec=23.697923518609727, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:49:40,255] [INFO] [logging.py:96:log_dist] [Rank 0] step=390, skipped=5, lr=[9.633720969204103e-06, 9.633720969204103e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:49:40,496] [INFO] [timer.py:199:stop] epoch=0/micro_step=1560/global_step=390, RunningAvgSamplesPerSec=23.70982290770526, CurrSamplesPerSec=23.76339178851321, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:50:07,302] [INFO] [logging.py:96:log_dist] [Rank 0] step=400, skipped=5, lr=[9.632864830008802e-06, 9.632864830008802e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:50:07,542] [INFO] [timer.py:199:stop] epoch=0/micro_step=1600/global_step=400, RunningAvgSamplesPerSec=23.70987792298654, CurrSamplesPerSec=23.774615943202825, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:50:34,352] [INFO] [logging.py:96:log_dist] [Rank 0] step=410, skipped=5, lr=[9.63198679121212e-06, 9.63198679121212e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:50:34,595] [INFO] [timer.py:199:stop] epoch=0/micro_step=1640/global_step=410, RunningAvgSamplesPerSec=23.70981551225716, CurrSamplesPerSec=23.689955329505022, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:51:01,461] [INFO] [logging.py:96:log_dist] [Rank 0] step=420, skipped=5, lr=[9.631086856813484e-06, 9.631086856813484e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:51:01,702] [INFO] [timer.py:199:stop] epoch=0/micro_step=1680/global_step=420, RunningAvgSamplesPerSec=23.708604008546438, CurrSamplesPerSec=23.664570283835072, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:51:28,532] [INFO] [logging.py:96:log_dist] [Rank 0] step=430, skipped=5, lr=[9.630165030912056e-06, 9.630165030912056e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:51:28,775] [INFO] [timer.py:199:stop] epoch=0/micro_step=1720/global_step=430, RunningAvgSamplesPerSec=23.708113598642846, CurrSamplesPerSec=23.663888113860807, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:51:55,619] [INFO] [logging.py:96:log_dist] [Rank 0] step=440, skipped=5, lr=[9.629221317706709e-06, 9.629221317706709e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:51:55,863] [INFO] [timer.py:199:stop] epoch=0/micro_step=1760/global_step=440, RunningAvgSamplesPerSec=23.70733422496478, CurrSamplesPerSec=23.59932928382243, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:52:22,788] [INFO] [logging.py:96:log_dist] [Rank 0] step=450, skipped=5, lr=[9.628255721496013e-06, 9.628255721496013e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:52:23,030] [INFO] [timer.py:199:stop] epoch=0/micro_step=1800/global_step=450, RunningAvgSamplesPerSec=23.70510665043777, CurrSamplesPerSec=23.661408016878827, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:52:49,863] [INFO] [logging.py:96:log_dist] [Rank 0] step=460, skipped=5, lr=[9.627268246678213e-06, 9.627268246678213e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:52:50,104] [INFO] [timer.py:199:stop] epoch=0/micro_step=1840/global_step=460, RunningAvgSamplesPerSec=23.704798412716645, CurrSamplesPerSec=23.734365129026187, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:53:16,926] [INFO] [logging.py:96:log_dist] [Rank 0] step=470, skipped=5, lr=[9.62625889775121e-06, 9.62625889775121e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:53:17,167] [INFO] [timer.py:199:stop] epoch=0/micro_step=1880/global_step=470, RunningAvgSamplesPerSec=23.7046032054626, CurrSamplesPerSec=23.65625966912225, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:53:43,974] [INFO] [logging.py:96:log_dist] [Rank 0] step=480, skipped=5, lr=[9.625227679312546e-06, 9.625227679312546e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:53:44,216] [INFO] [timer.py:199:stop] epoch=0/micro_step=1920/global_step=480, RunningAvgSamplesPerSec=23.704693696987686, CurrSamplesPerSec=23.692301310380138, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:54:10,984] [INFO] [logging.py:96:log_dist] [Rank 0] step=490, skipped=5, lr=[9.624174596059368e-06, 9.624174596059368e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:54:11,225] [INFO] [timer.py:199:stop] epoch=0/micro_step=1960/global_step=490, RunningAvgSamplesPerSec=23.705422764289146, CurrSamplesPerSec=23.682322607852157, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:54:38,068] [INFO] [logging.py:96:log_dist] [Rank 0] step=500, skipped=5, lr=[9.623099652788424e-06, 9.623099652788424e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:54:38,311] [INFO] [timer.py:199:stop] epoch=0/micro_step=2000/global_step=500, RunningAvgSamplesPerSec=23.704826309835486, CurrSamplesPerSec=23.622921662077204, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:55:05,131] [INFO] [logging.py:96:log_dist] [Rank 0] step=510, skipped=5, lr=[9.622002854396033e-06, 9.622002854396033e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:55:05,372] [INFO] [timer.py:199:stop] epoch=0/micro_step=2040/global_step=510, RunningAvgSamplesPerSec=23.70469832155814, CurrSamplesPerSec=23.67559893823143, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:55:32,223] [INFO] [logging.py:96:log_dist] [Rank 0] step=520, skipped=5, lr=[9.620884205878055e-06, 9.620884205878055e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:55:32,463] [INFO] [timer.py:199:stop] epoch=0/micro_step=2080/global_step=520, RunningAvgSamplesPerSec=23.70403699167987, CurrSamplesPerSec=23.690139311214573, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:55:59,282] [INFO] [logging.py:96:log_dist] [Rank 0] step=530, skipped=5, lr=[9.619743712329887e-06, 9.619743712329887e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:55:59,523] [INFO] [timer.py:199:stop] epoch=0/micro_step=2120/global_step=530, RunningAvgSamplesPerSec=23.703914334520146, CurrSamplesPerSec=23.686986928786855, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:56:26,338] [INFO] [logging.py:96:log_dist] [Rank 0] step=540, skipped=5, lr=[9.618581378946423e-06, 9.618581378946423e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:56:26,579] [INFO] [timer.py:199:stop] epoch=0/micro_step=2160/global_step=540, RunningAvgSamplesPerSec=23.703847932472225, CurrSamplesPerSec=23.713955595211015, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:56:53,414] [INFO] [logging.py:96:log_dist] [Rank 0] step=550, skipped=5, lr=[9.617397211022037e-06, 9.617397211022037e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:56:53,655] [INFO] [timer.py:199:stop] epoch=0/micro_step=2200/global_step=550, RunningAvgSamplesPerSec=23.703598095197314, CurrSamplesPerSec=23.72779854471635, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:57:20,460] [INFO] [logging.py:96:log_dist] [Rank 0] step=560, skipped=5, lr=[9.616191213950558e-06, 9.616191213950558e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:57:20,700] [INFO] [timer.py:199:stop] epoch=0/micro_step=2240/global_step=560, RunningAvgSamplesPerSec=23.703694507516865, CurrSamplesPerSec=23.788122964452263, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:57:47,554] [INFO] [logging.py:96:log_dist] [Rank 0] step=570, skipped=5, lr=[9.61496339322525e-06, 9.61496339322525e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:57:47,793] [INFO] [timer.py:199:stop] epoch=0/micro_step=2280/global_step=570, RunningAvgSamplesPerSec=23.703096633386625, CurrSamplesPerSec=23.640436645657285, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:58:14,617] [INFO] [logging.py:96:log_dist] [Rank 0] step=580, skipped=5, lr=[9.613713754438776e-06, 9.613713754438776e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:58:14,859] [INFO] [timer.py:199:stop] epoch=0/micro_step=2320/global_step=580, RunningAvgSamplesPerSec=23.702931462385664, CurrSamplesPerSec=23.64785280926958, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:58:41,702] [INFO] [logging.py:96:log_dist] [Rank 0] step=590, skipped=5, lr=[9.612442303283185e-06, 9.612442303283185e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:58:41,945] [INFO] [timer.py:199:stop] epoch=0/micro_step=2360/global_step=590, RunningAvgSamplesPerSec=23.702510986559872, CurrSamplesPerSec=23.682872116501517, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:59:08,855] [INFO] [logging.py:96:log_dist] [Rank 0] step=600, skipped=5, lr=[9.611149045549879e-06, 9.611149045549879e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:59:09,095] [INFO] [timer.py:199:stop] epoch=0/micro_step=2400/global_step=600, RunningAvgSamplesPerSec=23.70116987496296, CurrSamplesPerSec=23.597516118077838, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 15:59:25,087] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 15:59:27,506] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 15:59:35,437] [INFO] [logging.py:96:log_dist] [Rank 0] step=610, skipped=7, lr=[9.610098742582508e-06, 9.610098742582508e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 15:59:35,680] [INFO] [timer.py:199:stop] epoch=0/micro_step=2440/global_step=610, RunningAvgSamplesPerSec=23.707981335432976, CurrSamplesPerSec=23.536264953177, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:00:02,562] [INFO] [logging.py:96:log_dist] [Rank 0] step=620, skipped=7, lr=[9.60876624792068e-06, 9.60876624792068e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:00:02,802] [INFO] [timer.py:199:stop] epoch=0/micro_step=2480/global_step=620, RunningAvgSamplesPerSec=23.706941643434604, CurrSamplesPerSec=23.613958897433598, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:00:29,710] [INFO] [logging.py:96:log_dist] [Rank 0] step=630, skipped=7, lr=[9.607411963425395e-06, 9.607411963425395e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:00:29,953] [INFO] [timer.py:199:stop] epoch=0/micro_step=2520/global_step=630, RunningAvgSamplesPerSec=23.705545270358723, CurrSamplesPerSec=23.68881805146025, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:00:56,807] [INFO] [logging.py:96:log_dist] [Rank 0] step=640, skipped=7, lr=[9.606035895265358e-06, 9.606035895265358e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:00:57,048] [INFO] [timer.py:199:stop] epoch=0/micro_step=2560/global_step=640, RunningAvgSamplesPerSec=23.705044566957433, CurrSamplesPerSec=23.668369876472543, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:01:23,991] [INFO] [logging.py:96:log_dist] [Rank 0] step=650, skipped=7, lr=[9.604638049708498e-06, 9.604638049708498e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:01:24,233] [INFO] [timer.py:199:stop] epoch=0/micro_step=2600/global_step=650, RunningAvgSamplesPerSec=23.70322536095331, CurrSamplesPerSec=23.69856999256825, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:01:51,170] [INFO] [logging.py:96:log_dist] [Rank 0] step=660, skipped=7, lr=[9.603218433121933e-06, 9.603218433121933e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:01:51,411] [INFO] [timer.py:199:stop] epoch=0/micro_step=2640/global_step=660, RunningAvgSamplesPerSec=23.70157178187665, CurrSamplesPerSec=23.5802821442046, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:02:18,339] [INFO] [logging.py:96:log_dist] [Rank 0] step=670, skipped=7, lr=[9.601777051971952e-06, 9.601777051971952e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:02:18,581] [INFO] [timer.py:199:stop] epoch=0/micro_step=2680/global_step=670, RunningAvgSamplesPerSec=23.700140331526978, CurrSamplesPerSec=23.637919831365192, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:02:45,414] [INFO] [logging.py:96:log_dist] [Rank 0] step=680, skipped=7, lr=[9.600313912823979e-06, 9.600313912823979e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:02:45,657] [INFO] [timer.py:199:stop] epoch=0/micro_step=2720/global_step=680, RunningAvgSamplesPerSec=23.70001933488215, CurrSamplesPerSec=23.61401290717136, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:03:12,541] [INFO] [logging.py:96:log_dist] [Rank 0] step=690, skipped=7, lr=[9.598829022342547e-06, 9.598829022342547e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:03:12,782] [INFO] [timer.py:199:stop] epoch=0/micro_step=2760/global_step=690, RunningAvgSamplesPerSec=23.69923619334412, CurrSamplesPerSec=23.689662637220174, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:03:39,650] [INFO] [logging.py:96:log_dist] [Rank 0] step=700, skipped=7, lr=[9.597322387291262e-06, 9.597322387291262e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:03:39,891] [INFO] [timer.py:199:stop] epoch=0/micro_step=2800/global_step=700, RunningAvgSamplesPerSec=23.69883266922105, CurrSamplesPerSec=23.603500207558508, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:04:01,272] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 16:04:03,693] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 16:04:06,157] [INFO] [logging.py:96:log_dist] [Rank 0] step=710, skipped=9, lr=[9.596101427768314e-06, 9.596101427768314e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:04:06,398] [INFO] [timer.py:199:stop] epoch=0/micro_step=2840/global_step=710, RunningAvgSamplesPerSec=23.705898388554274, CurrSamplesPerSec=23.73127019963637, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:04:33,272] [INFO] [logging.py:96:log_dist] [Rank 0] step=720, skipped=9, lr=[9.594555669851717e-06, 9.594555669851717e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:04:33,513] [INFO] [timer.py:199:stop] epoch=0/micro_step=2880/global_step=720, RunningAvgSamplesPerSec=23.705233431436206, CurrSamplesPerSec=23.699371334487715, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:05:00,382] [INFO] [logging.py:96:log_dist] [Rank 0] step=730, skipped=9, lr=[9.592988186830188e-06, 9.592988186830188e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:05:00,624] [INFO] [timer.py:199:stop] epoch=0/micro_step=2920/global_step=730, RunningAvgSamplesPerSec=23.704639682175387, CurrSamplesPerSec=23.55074207500254, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:05:27,482] [INFO] [logging.py:96:log_dist] [Rank 0] step=740, skipped=9, lr=[9.591398985843542e-06, 9.591398985843542e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:05:27,722] [INFO] [timer.py:199:stop] epoch=0/micro_step=2960/global_step=740, RunningAvgSamplesPerSec=23.70419501784094, CurrSamplesPerSec=23.687791668152105, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:05:54,555] [INFO] [logging.py:96:log_dist] [Rank 0] step=750, skipped=9, lr=[9.589788074130512e-06, 9.589788074130512e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:05:54,797] [INFO] [timer.py:199:stop] epoch=0/micro_step=3000/global_step=750, RunningAvgSamplesPerSec=23.70392542290114, CurrSamplesPerSec=23.60546167380365, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:06:21,633] [INFO] [logging.py:96:log_dist] [Rank 0] step=760, skipped=9, lr=[9.588155459028732e-06, 9.588155459028732e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:06:21,874] [INFO] [timer.py:199:stop] epoch=0/micro_step=3040/global_step=760, RunningAvgSamplesPerSec=23.703638255188142, CurrSamplesPerSec=23.688420866106885, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:06:48,758] [INFO] [logging.py:96:log_dist] [Rank 0] step=770, skipped=9, lr=[9.586501147974682e-06, 9.586501147974682e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:06:49,000] [INFO] [timer.py:199:stop] epoch=0/micro_step=3080/global_step=770, RunningAvgSamplesPerSec=23.70281357133256, CurrSamplesPerSec=23.57862516299295, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:07:15,872] [INFO] [logging.py:96:log_dist] [Rank 0] step=780, skipped=9, lr=[9.584825148503677e-06, 9.584825148503677e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:07:16,114] [INFO] [timer.py:199:stop] epoch=0/micro_step=3120/global_step=780, RunningAvgSamplesPerSec=23.702145307530554, CurrSamplesPerSec=23.5478848871057, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:07:42,967] [INFO] [logging.py:96:log_dist] [Rank 0] step=790, skipped=9, lr=[9.583127468249814e-06, 9.583127468249814e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:07:43,211] [INFO] [timer.py:199:stop] epoch=0/micro_step=3160/global_step=790, RunningAvgSamplesPerSec=23.701656344925784, CurrSamplesPerSec=23.677981759248077, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:08:10,015] [INFO] [logging.py:96:log_dist] [Rank 0] step=800, skipped=9, lr=[9.581408114945948e-06, 9.581408114945948e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:08:10,259] [INFO] [timer.py:199:stop] epoch=0/micro_step=3200/global_step=800, RunningAvgSamplesPerSec=23.701699522423258, CurrSamplesPerSec=23.67552376500472, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:08:36,912] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 16:08:36,913] [INFO] [logging.py:96:log_dist] [Rank 0] step=810, skipped=10, lr=[9.579842172985476e-06, 9.579842172985476e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:08:36,914] [INFO] [timer.py:199:stop] epoch=0/micro_step=3240/global_step=810, RunningAvgSamplesPerSec=23.706051834300478, CurrSamplesPerSec=26.79023570996075, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:08:39,312] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 16:09:03,234] [INFO] [logging.py:96:log_dist] [Rank 0] step=820, skipped=11, lr=[9.578258687974373e-06, 9.578258687974373e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:09:03,474] [INFO] [timer.py:199:stop] epoch=0/micro_step=3280/global_step=820, RunningAvgSamplesPerSec=23.711250021098813, CurrSamplesPerSec=23.90407361909651, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:09:30,085] [INFO] [logging.py:96:log_dist] [Rank 0] step=830, skipped=11, lr=[9.576478692109454e-06, 9.576478692109454e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:09:30,326] [INFO] [timer.py:199:stop] epoch=0/micro_step=3320/global_step=830, RunningAvgSamplesPerSec=23.7132776058276, CurrSamplesPerSec=23.853809325180983, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:09:56,938] [INFO] [logging.py:96:log_dist] [Rank 0] step=840, skipped=11, lr=[9.5746770534794e-06, 9.5746770534794e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:09:57,180] [INFO] [timer.py:199:stop] epoch=0/micro_step=3360/global_step=840, RunningAvgSamplesPerSec=23.71521549851655, CurrSamplesPerSec=23.831685296398508, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:10:23,775] [INFO] [logging.py:96:log_dist] [Rank 0] step=850, skipped=11, lr=[9.572853780290592e-06, 9.572853780290592e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:10:24,015] [INFO] [timer.py:199:stop] epoch=0/micro_step=3400/global_step=850, RunningAvgSamplesPerSec=23.717303888735298, CurrSamplesPerSec=23.917610971209193, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:10:50,617] [INFO] [logging.py:96:log_dist] [Rank 0] step=860, skipped=11, lr=[9.571008880847953e-06, 9.571008880847953e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:10:50,859] [INFO] [timer.py:199:stop] epoch=0/micro_step=3440/global_step=860, RunningAvgSamplesPerSec=23.71923008803062, CurrSamplesPerSec=23.845505089729958, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:11:17,469] [INFO] [logging.py:96:log_dist] [Rank 0] step=870, skipped=11, lr=[9.569142363554916e-06, 9.569142363554916e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:11:17,710] [INFO] [timer.py:199:stop] epoch=0/micro_step=3480/global_step=870, RunningAvgSamplesPerSec=23.721045513172168, CurrSamplesPerSec=23.94829279433596, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:11:45,239] [INFO] [logging.py:96:log_dist] [Rank 0] step=880, skipped=11, lr=[9.56725423691338e-06, 9.56725423691338e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:11:45,480] [INFO] [timer.py:199:stop] epoch=0/micro_step=3520/global_step=880, RunningAvgSamplesPerSec=23.715280056781452, CurrSamplesPerSec=23.85002413118837, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:12:12,114] [INFO] [logging.py:96:log_dist] [Rank 0] step=890, skipped=11, lr=[9.565344509523676e-06, 9.565344509523676e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:12:12,354] [INFO] [timer.py:199:stop] epoch=0/micro_step=3560/global_step=890, RunningAvgSamplesPerSec=23.71690731079688, CurrSamplesPerSec=23.84560252872132, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:12:38,981] [INFO] [logging.py:96:log_dist] [Rank 0] step=900, skipped=11, lr=[9.56341319008452e-06, 9.56341319008452e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:12:39,222] [INFO] [timer.py:199:stop] epoch=0/micro_step=3600/global_step=900, RunningAvgSamplesPerSec=23.718572683121415, CurrSamplesPerSec=23.854171800397523, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:13:05,873] [INFO] [logging.py:96:log_dist] [Rank 0] step=910, skipped=11, lr=[9.561460287392985e-06, 9.561460287392985e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:13:06,113] [INFO] [timer.py:199:stop] epoch=0/micro_step=3640/global_step=910, RunningAvgSamplesPerSec=23.719978591672493, CurrSamplesPerSec=23.973648188527303, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:13:11,201] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 16:13:13,595] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 16:13:32,212] [INFO] [logging.py:96:log_dist] [Rank 0] step=920, skipped=13, lr=[9.559882431272438e-06, 9.559882431272438e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:13:32,455] [INFO] [timer.py:199:stop] epoch=0/micro_step=3680/global_step=920, RunningAvgSamplesPerSec=23.726641954189354, CurrSamplesPerSec=23.873337191338468, MemAllocated=11.44GB, MaxMemAllocated=31.66GB ***** Evaluating perplexity, Epoch 1/16 ***** ppl: 2.00248384475708 saving the final model ... Beginning of Epoch 2/16, Total Micro Batches 3680 [2023-04-23 16:14:36,932] [INFO] [logging.py:96:log_dist] [Rank 0] step=930, skipped=13, lr=[9.55789070120902e-06, 9.55789070120902e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:14:37,177] [INFO] [timer.py:199:stop] epoch=1/micro_step=40/global_step=930, RunningAvgSamplesPerSec=23.726613974125694, CurrSamplesPerSec=23.707846278280282, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:15:03,845] [INFO] [logging.py:96:log_dist] [Rank 0] step=940, skipped=13, lr=[9.555877413047903e-06, 9.555877413047903e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:15:04,088] [INFO] [timer.py:199:stop] epoch=1/micro_step=80/global_step=940, RunningAvgSamplesPerSec=23.727643626569453, CurrSamplesPerSec=23.82439444861977, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:15:30,716] [INFO] [logging.py:96:log_dist] [Rank 0] step=950, skipped=13, lr=[9.553842575959522e-06, 9.553842575959522e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:15:30,959] [INFO] [timer.py:199:stop] epoch=1/micro_step=120/global_step=950, RunningAvgSamplesPerSec=23.729024407871616, CurrSamplesPerSec=23.865384293957096, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:15:57,577] [INFO] [logging.py:96:log_dist] [Rank 0] step=960, skipped=13, lr=[9.551786199212467e-06, 9.551786199212467e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:15:57,818] [INFO] [timer.py:199:stop] epoch=1/micro_step=160/global_step=960, RunningAvgSamplesPerSec=23.730471054662587, CurrSamplesPerSec=23.886136196348716, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:16:25,084] [INFO] [logging.py:96:log_dist] [Rank 0] step=970, skipped=13, lr=[9.549708292173435e-06, 9.549708292173435e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:16:25,304] [INFO] [timer.py:199:stop] epoch=1/micro_step=200/global_step=970, RunningAvgSamplesPerSec=23.727707240146025, CurrSamplesPerSec=20.31804408190996, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:16:51,893] [INFO] [logging.py:96:log_dist] [Rank 0] step=980, skipped=13, lr=[9.547608864307198e-06, 9.547608864307198e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:16:52,135] [INFO] [timer.py:199:stop] epoch=1/micro_step=240/global_step=980, RunningAvgSamplesPerSec=23.729509143886624, CurrSamplesPerSec=23.913586075935385, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:17:18,740] [INFO] [logging.py:96:log_dist] [Rank 0] step=990, skipped=13, lr=[9.545487925176554e-06, 9.545487925176554e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:17:18,982] [INFO] [timer.py:199:stop] epoch=1/micro_step=280/global_step=990, RunningAvgSamplesPerSec=23.731031093495215, CurrSamplesPerSec=23.92097212541132, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:17:45,563] [INFO] [logging.py:96:log_dist] [Rank 0] step=1000, skipped=13, lr=[9.543345484442282e-06, 9.543345484442282e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:17:45,807] [INFO] [timer.py:199:stop] epoch=1/micro_step=320/global_step=1000, RunningAvgSamplesPerSec=23.732747860708802, CurrSamplesPerSec=23.95233151929987, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:18:12,422] [INFO] [logging.py:96:log_dist] [Rank 0] step=1010, skipped=13, lr=[9.541181551863098e-06, 9.541181551863098e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:18:12,665] [INFO] [timer.py:199:stop] epoch=1/micro_step=360/global_step=1010, RunningAvgSamplesPerSec=23.734141314909518, CurrSamplesPerSec=23.843198560075237, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:18:23,150] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 16:18:25,540] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 16:18:38,720] [INFO] [logging.py:96:log_dist] [Rank 0] step=1020, skipped=15, lr=[9.539434938291769e-06, 9.539434938291769e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:18:38,962] [INFO] [timer.py:199:stop] epoch=1/micro_step=400/global_step=1020, RunningAvgSamplesPerSec=23.74038885350426, CurrSamplesPerSec=23.88389405172956, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:19:05,611] [INFO] [logging.py:96:log_dist] [Rank 0] step=1030, skipped=15, lr=[9.537232345296166e-06, 9.537232345296166e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:19:05,854] [INFO] [timer.py:199:stop] epoch=1/micro_step=440/global_step=1030, RunningAvgSamplesPerSec=23.74145963213532, CurrSamplesPerSec=23.859162783768987, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:19:32,444] [INFO] [logging.py:96:log_dist] [Rank 0] step=1040, skipped=15, lr=[9.53500828830072e-06, 9.53500828830072e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:19:32,684] [INFO] [timer.py:199:stop] epoch=1/micro_step=480/global_step=1040, RunningAvgSamplesPerSec=23.74299188287979, CurrSamplesPerSec=23.936280834288564, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:19:59,309] [INFO] [logging.py:96:log_dist] [Rank 0] step=1050, skipped=15, lr=[9.532762777435901e-06, 9.532762777435901e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:19:59,550] [INFO] [timer.py:199:stop] epoch=1/micro_step=520/global_step=1050, RunningAvgSamplesPerSec=23.744184707642148, CurrSamplesPerSec=23.766278370124883, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:20:26,181] [INFO] [logging.py:96:log_dist] [Rank 0] step=1060, skipped=15, lr=[9.530495822929913e-06, 9.530495822929913e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:20:26,423] [INFO] [timer.py:199:stop] epoch=1/micro_step=560/global_step=1060, RunningAvgSamplesPerSec=23.74525344799614, CurrSamplesPerSec=23.81474786316597, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:20:53,013] [INFO] [logging.py:96:log_dist] [Rank 0] step=1070, skipped=15, lr=[9.528207435108627e-06, 9.528207435108627e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:20:53,253] [INFO] [timer.py:199:stop] epoch=1/micro_step=600/global_step=1070, RunningAvgSamplesPerSec=23.7466713652555, CurrSamplesPerSec=23.90328178807838, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:21:19,933] [INFO] [logging.py:96:log_dist] [Rank 0] step=1080, skipped=15, lr=[9.525897624395543e-06, 9.525897624395543e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:21:20,174] [INFO] [timer.py:199:stop] epoch=1/micro_step=640/global_step=1080, RunningAvgSamplesPerSec=23.747337497740197, CurrSamplesPerSec=23.74771299212271, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:21:46,815] [INFO] [logging.py:96:log_dist] [Rank 0] step=1090, skipped=15, lr=[9.523566401311742e-06, 9.523566401311742e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:21:47,056] [INFO] [timer.py:199:stop] epoch=1/micro_step=680/global_step=1090, RunningAvgSamplesPerSec=23.748304716888, CurrSamplesPerSec=23.883962053717404, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:22:13,656] [INFO] [logging.py:96:log_dist] [Rank 0] step=1100, skipped=15, lr=[9.521213776475836e-06, 9.521213776475836e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:22:13,897] [INFO] [timer.py:199:stop] epoch=1/micro_step=720/global_step=1100, RunningAvgSamplesPerSec=23.74955290559033, CurrSamplesPerSec=23.843941938254655, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:22:40,474] [INFO] [logging.py:96:log_dist] [Rank 0] step=1110, skipped=15, lr=[9.518839760603926e-06, 9.518839760603926e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:22:40,715] [INFO] [timer.py:199:stop] epoch=1/micro_step=760/global_step=1110, RunningAvgSamplesPerSec=23.750959198110284, CurrSamplesPerSec=23.9302655052809, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:22:56,499] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 16:22:58,892] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 16:23:06,701] [INFO] [logging.py:96:log_dist] [Rank 0] step=1120, skipped=17, lr=[9.516925153623893e-06, 9.516925153623893e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:23:06,943] [INFO] [timer.py:199:stop] epoch=1/micro_step=800/global_step=1120, RunningAvgSamplesPerSec=23.75697817131835, CurrSamplesPerSec=23.955324045773256, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:23:33,538] [INFO] [logging.py:96:log_dist] [Rank 0] step=1130, skipped=17, lr=[9.514512661202712e-06, 9.514512661202712e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:23:33,781] [INFO] [timer.py:199:stop] epoch=1/micro_step=840/global_step=1130, RunningAvgSamplesPerSec=23.758143465298158, CurrSamplesPerSec=23.857949829376864, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:24:00,365] [INFO] [logging.py:96:log_dist] [Rank 0] step=1140, skipped=17, lr=[9.512078808268798e-06, 9.512078808268798e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:24:00,608] [INFO] [timer.py:199:stop] epoch=1/micro_step=880/global_step=1140, RunningAvgSamplesPerSec=23.759359625009015, CurrSamplesPerSec=23.90387565642379, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:24:27,199] [INFO] [logging.py:96:log_dist] [Rank 0] step=1150, skipped=17, lr=[9.509623605908231e-06, 9.509623605908231e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:24:27,441] [INFO] [timer.py:199:stop] epoch=1/micro_step=920/global_step=1150, RunningAvgSamplesPerSec=23.760543468088276, CurrSamplesPerSec=23.820457934847408, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:24:54,029] [INFO] [logging.py:96:log_dist] [Rank 0] step=1160, skipped=17, lr=[9.507147065304347e-06, 9.507147065304347e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:24:54,271] [INFO] [timer.py:199:stop] epoch=1/micro_step=960/global_step=1160, RunningAvgSamplesPerSec=23.761715206158016, CurrSamplesPerSec=23.924268124606513, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:25:20,850] [INFO] [logging.py:96:log_dist] [Rank 0] step=1170, skipped=17, lr=[9.504649197737674e-06, 9.504649197737674e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:25:21,093] [INFO] [timer.py:199:stop] epoch=1/micro_step=1000/global_step=1170, RunningAvgSamplesPerSec=23.762908554709377, CurrSamplesPerSec=23.839092813205326, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:25:47,685] [INFO] [logging.py:96:log_dist] [Rank 0] step=1180, skipped=17, lr=[9.502130014585882e-06, 9.502130014585882e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:25:47,927] [INFO] [timer.py:199:stop] epoch=1/micro_step=1040/global_step=1180, RunningAvgSamplesPerSec=23.764015288531304, CurrSamplesPerSec=23.923997332345365, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:26:14,480] [INFO] [logging.py:96:log_dist] [Rank 0] step=1190, skipped=17, lr=[9.499589527323734e-06, 9.499589527323734e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:26:14,721] [INFO] [timer.py:199:stop] epoch=1/micro_step=1080/global_step=1190, RunningAvgSamplesPerSec=23.765373436226398, CurrSamplesPerSec=23.89155095762293, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:26:41,316] [INFO] [logging.py:96:log_dist] [Rank 0] step=1200, skipped=17, lr=[9.49702774752303e-06, 9.49702774752303e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:26:41,559] [INFO] [timer.py:199:stop] epoch=1/micro_step=1120/global_step=1200, RunningAvgSamplesPerSec=23.766420633082703, CurrSamplesPerSec=23.861434224119584, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:27:08,175] [INFO] [logging.py:96:log_dist] [Rank 0] step=1210, skipped=17, lr=[9.494444686852558e-06, 9.494444686852558e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:27:08,417] [INFO] [timer.py:199:stop] epoch=1/micro_step=1160/global_step=1210, RunningAvgSamplesPerSec=23.76729492145349, CurrSamplesPerSec=23.835259377545036, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:27:29,650] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 16:27:32,045] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 16:27:34,507] [INFO] [logging.py:96:log_dist] [Rank 0] step=1220, skipped=19, lr=[9.492362923993262e-06, 9.492362923993262e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:27:34,746] [INFO] [timer.py:199:stop] epoch=1/micro_step=1200/global_step=1220, RunningAvgSamplesPerSec=23.771974245266655, CurrSamplesPerSec=23.735408144696837, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:28:01,347] [INFO] [logging.py:96:log_dist] [Rank 0] step=1230, skipped=19, lr=[9.489741587471934e-06, 9.489741587471934e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:28:01,589] [INFO] [timer.py:199:stop] epoch=1/micro_step=1240/global_step=1230, RunningAvgSamplesPerSec=23.77287900448572, CurrSamplesPerSec=23.96411570925083, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:28:28,225] [INFO] [logging.py:96:log_dist] [Rank 0] step=1240, skipped=19, lr=[9.487099003268963e-06, 9.487099003268963e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:28:28,465] [INFO] [timer.py:199:stop] epoch=1/micro_step=1280/global_step=1240, RunningAvgSamplesPerSec=23.773563330873657, CurrSamplesPerSec=23.85438801892992, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:28:55,116] [INFO] [logging.py:96:log_dist] [Rank 0] step=1250, skipped=19, lr=[9.484435183421194e-06, 9.484435183421194e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:28:55,359] [INFO] [timer.py:199:stop] epoch=1/micro_step=1320/global_step=1250, RunningAvgSamplesPerSec=23.774091889709556, CurrSamplesPerSec=23.845297504968965, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:29:21,946] [INFO] [logging.py:96:log_dist] [Rank 0] step=1260, skipped=19, lr=[9.481750140062205e-06, 9.481750140062205e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:29:22,186] [INFO] [timer.py:199:stop] epoch=1/micro_step=1360/global_step=1260, RunningAvgSamplesPerSec=23.775056694348695, CurrSamplesPerSec=23.89211872494173, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:29:48,764] [INFO] [logging.py:96:log_dist] [Rank 0] step=1270, skipped=19, lr=[9.479043885422243e-06, 9.479043885422243e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:29:49,005] [INFO] [timer.py:199:stop] epoch=1/micro_step=1400/global_step=1270, RunningAvgSamplesPerSec=23.776102535332505, CurrSamplesPerSec=23.891482912414304, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:30:15,597] [INFO] [logging.py:96:log_dist] [Rank 0] step=1280, skipped=19, lr=[9.476316431828172e-06, 9.476316431828172e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:30:15,838] [INFO] [timer.py:199:stop] epoch=1/micro_step=1440/global_step=1280, RunningAvgSamplesPerSec=23.777012401594774, CurrSamplesPerSec=23.880755756334594, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:30:42,413] [INFO] [logging.py:96:log_dist] [Rank 0] step=1290, skipped=19, lr=[9.473567791703418e-06, 9.473567791703418e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:30:42,655] [INFO] [timer.py:199:stop] epoch=1/micro_step=1480/global_step=1290, RunningAvgSamplesPerSec=23.77802937732494, CurrSamplesPerSec=23.899097874801473, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:31:09,268] [INFO] [logging.py:96:log_dist] [Rank 0] step=1300, skipped=19, lr=[9.470797977567908e-06, 9.470797977567908e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:31:09,508] [INFO] [timer.py:199:stop] epoch=1/micro_step=1520/global_step=1300, RunningAvgSamplesPerSec=23.778778160724396, CurrSamplesPerSec=23.944424161371238, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:31:36,084] [INFO] [logging.py:96:log_dist] [Rank 0] step=1310, skipped=19, lr=[9.468007002038018e-06, 9.468007002038018e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:31:36,324] [INFO] [timer.py:199:stop] epoch=1/micro_step=1560/global_step=1310, RunningAvgSamplesPerSec=23.779757727458335, CurrSamplesPerSec=23.868272363613375, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:32:02,850] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 16:32:02,851] [INFO] [logging.py:96:log_dist] [Rank 0] step=1320, skipped=20, lr=[9.46547704157403e-06, 9.46547704157403e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:32:02,852] [INFO] [timer.py:199:stop] epoch=1/micro_step=1600/global_step=1320, RunningAvgSamplesPerSec=23.78269608346081, CurrSamplesPerSec=26.792770620397285, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:32:05,239] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 16:32:32,753] [INFO] [logging.py:96:log_dist] [Rank 0] step=1330, skipped=21, lr=[9.462929960011018e-06, 9.462929960011018e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:32:32,995] [INFO] [timer.py:199:stop] epoch=1/micro_step=1640/global_step=1330, RunningAvgSamplesPerSec=23.76472962173884, CurrSamplesPerSec=23.921153317815154, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:32:59,552] [INFO] [logging.py:96:log_dist] [Rank 0] step=1340, skipped=21, lr=[9.46007980051622e-06, 9.46007980051622e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:32:59,795] [INFO] [timer.py:199:stop] epoch=1/micro_step=1680/global_step=1340, RunningAvgSamplesPerSec=23.765916747943496, CurrSamplesPerSec=23.91313445174532, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:33:26,450] [INFO] [logging.py:96:log_dist] [Rank 0] step=1350, skipped=21, lr=[9.45720852844784e-06, 9.45720852844784e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:33:26,691] [INFO] [timer.py:199:stop] epoch=1/micro_step=1720/global_step=1350, RunningAvgSamplesPerSec=23.7665831192637, CurrSamplesPerSec=23.954483925353625, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:33:53,265] [INFO] [logging.py:96:log_dist] [Rank 0] step=1360, skipped=21, lr=[9.45431615688439e-06, 9.45431615688439e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:33:53,507] [INFO] [timer.py:199:stop] epoch=1/micro_step=1760/global_step=1360, RunningAvgSamplesPerSec=23.767677445765205, CurrSamplesPerSec=23.895043047679003, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:34:20,121] [INFO] [logging.py:96:log_dist] [Rank 0] step=1370, skipped=21, lr=[9.45140269900049e-06, 9.45140269900049e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:34:20,364] [INFO] [timer.py:199:stop] epoch=1/micro_step=1800/global_step=1370, RunningAvgSamplesPerSec=23.76846828598544, CurrSamplesPerSec=23.838089354208975, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:34:47,027] [INFO] [logging.py:96:log_dist] [Rank 0] step=1380, skipped=21, lr=[9.448468168066802e-06, 9.448468168066802e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:34:47,267] [INFO] [timer.py:199:stop] epoch=1/micro_step=1840/global_step=1380, RunningAvgSamplesPerSec=23.768954961513796, CurrSamplesPerSec=23.711433575813434, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:35:13,898] [INFO] [logging.py:96:log_dist] [Rank 0] step=1390, skipped=21, lr=[9.445512577449983e-06, 9.445512577449983e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:35:14,140] [INFO] [timer.py:199:stop] epoch=1/micro_step=1880/global_step=1390, RunningAvgSamplesPerSec=23.769616632281554, CurrSamplesPerSec=23.874065463575068, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:35:40,781] [INFO] [logging.py:96:log_dist] [Rank 0] step=1400, skipped=21, lr=[9.442535940612606e-06, 9.442535940612606e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:35:41,024] [INFO] [timer.py:199:stop] epoch=1/micro_step=1920/global_step=1400, RunningAvgSamplesPerSec=23.770192213727483, CurrSamplesPerSec=23.743881583567866, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:36:07,681] [INFO] [logging.py:96:log_dist] [Rank 0] step=1410, skipped=21, lr=[9.439538271113117e-06, 9.439538271113117e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:36:07,923] [INFO] [timer.py:199:stop] epoch=1/micro_step=1960/global_step=1410, RunningAvgSamplesPerSec=23.770666744146197, CurrSamplesPerSec=23.804152814890493, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:36:34,548] [INFO] [logging.py:96:log_dist] [Rank 0] step=1420, skipped=21, lr=[9.436519582605764e-06, 9.436519582605764e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:36:34,792] [INFO] [timer.py:199:stop] epoch=1/micro_step=2000/global_step=1420, RunningAvgSamplesPerSec=23.771303182537363, CurrSamplesPerSec=23.783165869037628, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:36:39,869] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 16:36:42,266] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 16:37:00,788] [INFO] [logging.py:96:log_dist] [Rank 0] step=1430, skipped=23, lr=[9.434089507350989e-06, 9.434089507350989e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:37:01,029] [INFO] [timer.py:199:stop] epoch=1/micro_step=2040/global_step=1430, RunningAvgSamplesPerSec=23.775849204160664, CurrSamplesPerSec=23.882075142265965, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:37:27,589] [INFO] [logging.py:96:log_dist] [Rank 0] step=1440, skipped=23, lr=[9.431033019343736e-06, 9.431033019343736e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:37:27,830] [INFO] [timer.py:199:stop] epoch=1/micro_step=2080/global_step=1440, RunningAvgSamplesPerSec=23.776849428180558, CurrSamplesPerSec=23.90867664339085, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:37:54,408] [INFO] [logging.py:96:log_dist] [Rank 0] step=1450, skipped=23, lr=[9.427955551069644e-06, 9.427955551069644e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:37:54,649] [INFO] [timer.py:199:stop] epoch=1/micro_step=2120/global_step=1450, RunningAvgSamplesPerSec=23.777713092762333, CurrSamplesPerSec=23.923796907375355, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:38:21,215] [INFO] [logging.py:96:log_dist] [Rank 0] step=1460, skipped=23, lr=[9.424857116546437e-06, 9.424857116546437e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:38:21,456] [INFO] [timer.py:199:stop] epoch=1/micro_step=2160/global_step=1460, RunningAvgSamplesPerSec=23.77865421745167, CurrSamplesPerSec=23.90826992184703, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:38:48,049] [INFO] [logging.py:96:log_dist] [Rank 0] step=1470, skipped=23, lr=[9.42173772988734e-06, 9.42173772988734e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:38:48,291] [INFO] [timer.py:199:stop] epoch=1/micro_step=2200/global_step=1470, RunningAvgSamplesPerSec=23.779412064207214, CurrSamplesPerSec=23.888238456425608, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:39:14,840] [INFO] [logging.py:96:log_dist] [Rank 0] step=1480, skipped=23, lr=[9.418597405301018e-06, 9.418597405301018e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:39:15,082] [INFO] [timer.py:199:stop] epoch=1/micro_step=2240/global_step=1480, RunningAvgSamplesPerSec=23.78041639074274, CurrSamplesPerSec=23.99493419784732, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:39:41,671] [INFO] [logging.py:96:log_dist] [Rank 0] step=1490, skipped=23, lr=[9.415436157091501e-06, 9.415436157091501e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:39:41,912] [INFO] [timer.py:199:stop] epoch=1/micro_step=2280/global_step=1490, RunningAvgSamplesPerSec=23.781179367908983, CurrSamplesPerSec=23.84701760208689, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:40:08,493] [INFO] [logging.py:96:log_dist] [Rank 0] step=1500, skipped=23, lr=[9.412253999658128e-06, 9.412253999658128e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:40:08,734] [INFO] [timer.py:199:stop] epoch=1/micro_step=2320/global_step=1500, RunningAvgSamplesPerSec=23.781988534463125, CurrSamplesPerSec=23.878251236296837, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:40:35,287] [INFO] [logging.py:96:log_dist] [Rank 0] step=1510, skipped=23, lr=[9.40905094749548e-06, 9.40905094749548e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:40:35,527] [INFO] [timer.py:199:stop] epoch=1/micro_step=2360/global_step=1510, RunningAvgSamplesPerSec=23.782962756689493, CurrSamplesPerSec=23.866343369624825, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:41:02,096] [INFO] [logging.py:96:log_dist] [Rank 0] step=1520, skipped=23, lr=[9.40582701519331e-06, 9.40582701519331e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:41:02,337] [INFO] [timer.py:199:stop] epoch=1/micro_step=2400/global_step=1520, RunningAvgSamplesPerSec=23.78382600467676, CurrSamplesPerSec=23.9046526255178, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:41:12,776] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 16:41:15,171] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 16:41:28,342] [INFO] [logging.py:96:log_dist] [Rank 0] step=1530, skipped=25, lr=[9.403232845516154e-06, 9.403232845516154e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:41:28,585] [INFO] [timer.py:199:stop] epoch=1/micro_step=2440/global_step=1530, RunningAvgSamplesPerSec=23.787905914759275, CurrSamplesPerSec=23.866419759496672, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:41:55,218] [INFO] [logging.py:96:log_dist] [Rank 0] step=1540, skipped=25, lr=[9.399971366032568e-06, 9.399971366032568e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:41:55,460] [INFO] [timer.py:199:stop] epoch=1/micro_step=2480/global_step=1540, RunningAvgSamplesPerSec=23.788339061882755, CurrSamplesPerSec=23.822434485643154, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:42:22,101] [INFO] [logging.py:96:log_dist] [Rank 0] step=1550, skipped=25, lr=[9.396689047766535e-06, 9.396689047766535e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:42:22,343] [INFO] [timer.py:199:stop] epoch=1/micro_step=2520/global_step=1550, RunningAvgSamplesPerSec=23.788717228190258, CurrSamplesPerSec=23.874929680987155, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:42:48,976] [INFO] [logging.py:96:log_dist] [Rank 0] step=1560, skipped=25, lr=[9.393385905668858e-06, 9.393385905668858e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:42:49,219] [INFO] [timer.py:199:stop] epoch=1/micro_step=2560/global_step=1560, RunningAvgSamplesPerSec=23.789161894517697, CurrSamplesPerSec=23.791142072143934, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:43:15,864] [INFO] [logging.py:96:log_dist] [Rank 0] step=1570, skipped=25, lr=[9.390061954785201e-06, 9.390061954785201e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:43:16,108] [INFO] [timer.py:199:stop] epoch=1/micro_step=2600/global_step=1570, RunningAvgSamplesPerSec=23.78950910417737, CurrSamplesPerSec=23.799490772354602, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:43:42,760] [INFO] [logging.py:96:log_dist] [Rank 0] step=1580, skipped=25, lr=[9.386717210256001e-06, 9.386717210256001e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:43:43,002] [INFO] [timer.py:199:stop] epoch=1/micro_step=2640/global_step=1580, RunningAvgSamplesPerSec=23.78982117702505, CurrSamplesPerSec=23.939726233127686, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:44:09,607] [INFO] [logging.py:96:log_dist] [Rank 0] step=1590, skipped=25, lr=[9.38335168731642e-06, 9.38335168731642e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:44:09,849] [INFO] [timer.py:199:stop] epoch=1/micro_step=2680/global_step=1590, RunningAvgSamplesPerSec=23.79038335036457, CurrSamplesPerSec=23.84435918221453, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:44:36,446] [INFO] [logging.py:96:log_dist] [Rank 0] step=1600, skipped=25, lr=[9.379965401296254e-06, 9.379965401296254e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:44:36,689] [INFO] [timer.py:199:stop] epoch=1/micro_step=2720/global_step=1600, RunningAvgSamplesPerSec=23.790988138663298, CurrSamplesPerSec=23.890953448876463, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:45:03,277] [INFO] [logging.py:96:log_dist] [Rank 0] step=1610, skipped=25, lr=[9.376558367619881e-06, 9.376558367619881e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:45:03,521] [INFO] [timer.py:199:stop] epoch=1/micro_step=2760/global_step=1610, RunningAvgSamplesPerSec=23.79161489062927, CurrSamplesPerSec=23.886905635871045, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:45:31,203] [INFO] [logging.py:96:log_dist] [Rank 0] step=1620, skipped=25, lr=[9.373130601806183e-06, 9.373130601806183e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:45:31,446] [INFO] [timer.py:199:stop] epoch=1/micro_step=2800/global_step=1620, RunningAvgSamplesPerSec=23.787023685221165, CurrSamplesPerSec=23.87165787575287, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:45:47,239] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 16:45:49,633] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 16:45:57,437] [INFO] [logging.py:96:log_dist] [Rank 0] step=1630, skipped=27, lr=[9.370373472505324e-06, 9.370373472505324e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:45:57,678] [INFO] [timer.py:199:stop] epoch=1/micro_step=2840/global_step=1630, RunningAvgSamplesPerSec=23.790951145888506, CurrSamplesPerSec=23.869942712590497, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:46:25,034] [INFO] [logging.py:96:log_dist] [Rank 0] step=1640, skipped=27, lr=[9.366908428253404e-06, 9.366908428253404e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:46:25,270] [INFO] [timer.py:199:stop] epoch=1/micro_step=2880/global_step=1640, RunningAvgSamplesPerSec=23.787944856331315, CurrSamplesPerSec=19.062965663640192, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:46:51,858] [INFO] [logging.py:96:log_dist] [Rank 0] step=1650, skipped=27, lr=[9.363422695819182e-06, 9.363422695819182e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:46:52,098] [INFO] [timer.py:199:stop] epoch=1/micro_step=2920/global_step=1650, RunningAvgSamplesPerSec=23.788612128601848, CurrSamplesPerSec=23.95515088636183, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:47:18,683] [INFO] [logging.py:96:log_dist] [Rank 0] step=1660, skipped=27, lr=[9.359916291080011e-06, 9.359916291080011e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:47:18,924] [INFO] [timer.py:199:stop] epoch=1/micro_step=2960/global_step=1660, RunningAvgSamplesPerSec=23.78930250624124, CurrSamplesPerSec=23.88195828255529, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:47:45,546] [INFO] [logging.py:96:log_dist] [Rank 0] step=1670, skipped=27, lr=[9.3563892300074e-06, 9.3563892300074e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:47:45,787] [INFO] [timer.py:199:stop] epoch=1/micro_step=3000/global_step=1670, RunningAvgSamplesPerSec=23.789814660472477, CurrSamplesPerSec=23.865085129046026, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:48:12,379] [INFO] [logging.py:96:log_dist] [Rank 0] step=1680, skipped=27, lr=[9.352841528666947e-06, 9.352841528666947e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:48:12,620] [INFO] [timer.py:199:stop] epoch=1/micro_step=3040/global_step=1680, RunningAvgSamplesPerSec=23.790426487340785, CurrSamplesPerSec=23.878763143816386, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:48:39,235] [INFO] [logging.py:96:log_dist] [Rank 0] step=1690, skipped=27, lr=[9.349273203218271e-06, 9.349273203218271e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:48:39,476] [INFO] [timer.py:199:stop] epoch=1/micro_step=3080/global_step=1690, RunningAvgSamplesPerSec=23.790911338189126, CurrSamplesPerSec=23.869547920324653, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:49:06,101] [INFO] [logging.py:96:log_dist] [Rank 0] step=1700, skipped=27, lr=[9.345684269914927e-06, 9.345684269914927e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:49:06,342] [INFO] [timer.py:199:stop] epoch=1/micro_step=3120/global_step=1700, RunningAvgSamplesPerSec=23.7913609247077, CurrSamplesPerSec=23.895813061146953, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:49:33,007] [INFO] [logging.py:96:log_dist] [Rank 0] step=1710, skipped=27, lr=[9.342074745104338e-06, 9.342074745104338e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:49:33,248] [INFO] [timer.py:199:stop] epoch=1/micro_step=3160/global_step=1710, RunningAvgSamplesPerSec=23.791613591759543, CurrSamplesPerSec=23.819086168892817, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:49:59,899] [INFO] [logging.py:96:log_dist] [Rank 0] step=1720, skipped=27, lr=[9.338444645227724e-06, 9.338444645227724e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:50:00,140] [INFO] [timer.py:199:stop] epoch=1/micro_step=3200/global_step=1720, RunningAvgSamplesPerSec=23.791913496107163, CurrSamplesPerSec=23.813581671570695, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:50:21,344] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 16:50:23,768] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 16:50:26,226] [INFO] [logging.py:96:log_dist] [Rank 0] step=1730, skipped=29, lr=[9.335525762387226e-06, 9.335525762387226e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:50:26,468] [INFO] [timer.py:199:stop] epoch=1/micro_step=3240/global_step=1730, RunningAvgSamplesPerSec=23.795175821185406, CurrSamplesPerSec=23.793837146492354, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:50:53,099] [INFO] [logging.py:96:log_dist] [Rank 0] step=1740, skipped=29, lr=[9.33185866912274e-06, 9.33185866912274e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:50:53,341] [INFO] [timer.py:199:stop] epoch=1/micro_step=3280/global_step=1740, RunningAvgSamplesPerSec=23.795622555485064, CurrSamplesPerSec=23.9154481407126, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:51:19,941] [INFO] [logging.py:96:log_dist] [Rank 0] step=1750, skipped=29, lr=[9.32817104732598e-06, 9.32817104732598e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:51:20,184] [INFO] [timer.py:199:stop] epoch=1/micro_step=3320/global_step=1750, RunningAvgSamplesPerSec=23.796190668774045, CurrSamplesPerSec=23.85114303322299, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:51:46,805] [INFO] [logging.py:96:log_dist] [Rank 0] step=1760, skipped=29, lr=[9.324462913793895e-06, 9.324462913793895e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:51:47,048] [INFO] [timer.py:199:stop] epoch=1/micro_step=3360/global_step=1760, RunningAvgSamplesPerSec=23.796629513091126, CurrSamplesPerSec=23.902015393215468, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:52:13,638] [INFO] [logging.py:96:log_dist] [Rank 0] step=1770, skipped=29, lr=[9.320734285416857e-06, 9.320734285416857e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:52:13,880] [INFO] [timer.py:199:stop] epoch=1/micro_step=3400/global_step=1770, RunningAvgSamplesPerSec=23.797236810638623, CurrSamplesPerSec=23.82109842821927, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:52:40,448] [INFO] [logging.py:96:log_dist] [Rank 0] step=1780, skipped=29, lr=[9.316985179178602e-06, 9.316985179178602e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:52:40,689] [INFO] [timer.py:199:stop] epoch=1/micro_step=3440/global_step=1780, RunningAvgSamplesPerSec=23.797888427326264, CurrSamplesPerSec=23.908872556088248, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:53:07,286] [INFO] [logging.py:96:log_dist] [Rank 0] step=1790, skipped=29, lr=[9.31321561215613e-06, 9.31321561215613e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:53:07,528] [INFO] [timer.py:199:stop] epoch=1/micro_step=3480/global_step=1790, RunningAvgSamplesPerSec=23.79838888302286, CurrSamplesPerSec=23.871415869975948, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:53:34,144] [INFO] [logging.py:96:log_dist] [Rank 0] step=1800, skipped=29, lr=[9.309425601519644e-06, 9.309425601519644e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:53:34,387] [INFO] [timer.py:199:stop] epoch=1/micro_step=3520/global_step=1800, RunningAvgSamplesPerSec=23.798837940482652, CurrSamplesPerSec=23.84375555988429, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:54:00,976] [INFO] [logging.py:96:log_dist] [Rank 0] step=1810, skipped=29, lr=[9.30561516453247e-06, 9.30561516453247e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:54:01,218] [INFO] [timer.py:199:stop] epoch=1/micro_step=3560/global_step=1810, RunningAvgSamplesPerSec=23.799373244285437, CurrSamplesPerSec=23.97483867573839, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:54:27,782] [INFO] [logging.py:96:log_dist] [Rank 0] step=1820, skipped=29, lr=[9.30178431855097e-06, 9.30178431855097e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:54:28,025] [INFO] [timer.py:199:stop] epoch=1/micro_step=3600/global_step=1820, RunningAvgSamplesPerSec=23.800064620066756, CurrSamplesPerSec=23.898070209737515, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:54:54,500] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 16:54:54,501] [INFO] [logging.py:96:log_dist] [Rank 0] step=1830, skipped=30, lr=[9.298319121897418e-06, 9.298319121897418e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:54:54,501] [INFO] [timer.py:199:stop] epoch=1/micro_step=3640/global_step=1830, RunningAvgSamplesPerSec=23.80228573258198, CurrSamplesPerSec=26.845089192628578, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:54:56,895] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 16:55:20,769] [INFO] [logging.py:96:log_dist] [Rank 0] step=1840, skipped=31, lr=[9.294837420876113e-06, 9.294837420876113e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:55:21,011] [INFO] [timer.py:199:stop] epoch=1/micro_step=3680/global_step=1840, RunningAvgSamplesPerSec=23.804351275797725, CurrSamplesPerSec=23.95676927627092, MemAllocated=11.44GB, MaxMemAllocated=31.66GB ***** Evaluating perplexity, Epoch 2/16 ***** ppl: 1.9489798545837402 saving the final model ... Beginning of Epoch 3/16, Total Micro Batches 3680 [2023-04-23 16:56:38,753] [INFO] [logging.py:96:log_dist] [Rank 0] step=1850, skipped=31, lr=[9.290949522837343e-06, 9.290949522837343e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:56:38,994] [INFO] [timer.py:199:stop] epoch=2/micro_step=40/global_step=1850, RunningAvgSamplesPerSec=23.80450909343205, CurrSamplesPerSec=23.92220002413292, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:57:05,605] [INFO] [logging.py:96:log_dist] [Rank 0] step=1860, skipped=31, lr=[9.287041282605565e-06, 9.287041282605565e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:57:05,845] [INFO] [timer.py:199:stop] epoch=2/micro_step=80/global_step=1860, RunningAvgSamplesPerSec=23.8049080148209, CurrSamplesPerSec=23.908951347969666, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:57:32,495] [INFO] [logging.py:96:log_dist] [Rank 0] step=1870, skipped=31, lr=[9.283112717982631e-06, 9.283112717982631e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:57:32,736] [INFO] [timer.py:199:stop] epoch=2/micro_step=120/global_step=1870, RunningAvgSamplesPerSec=23.805131831193282, CurrSamplesPerSec=23.86194328967933, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:57:59,362] [INFO] [logging.py:96:log_dist] [Rank 0] step=1880, skipped=31, lr=[9.279163846862974e-06, 9.279163846862974e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:57:59,603] [INFO] [timer.py:199:stop] epoch=2/micro_step=160/global_step=1880, RunningAvgSamplesPerSec=23.805444017809965, CurrSamplesPerSec=23.90452702983094, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:58:26,215] [INFO] [logging.py:96:log_dist] [Rank 0] step=1890, skipped=31, lr=[9.275194687233515e-06, 9.275194687233515e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:58:26,456] [INFO] [timer.py:199:stop] epoch=2/micro_step=200/global_step=1890, RunningAvgSamplesPerSec=23.805824172285654, CurrSamplesPerSec=23.83332513542093, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:58:53,065] [INFO] [logging.py:96:log_dist] [Rank 0] step=1900, skipped=31, lr=[9.271205257173593e-06, 9.271205257173593e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:58:53,306] [INFO] [timer.py:199:stop] epoch=2/micro_step=240/global_step=1900, RunningAvgSamplesPerSec=23.806226908213063, CurrSamplesPerSec=23.925057081888646, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:59:19,898] [INFO] [logging.py:96:log_dist] [Rank 0] step=1910, skipped=31, lr=[9.267195574854878e-06, 9.267195574854878e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:59:20,138] [INFO] [timer.py:199:stop] epoch=2/micro_step=280/global_step=1910, RunningAvgSamplesPerSec=23.806692999811833, CurrSamplesPerSec=23.93757220897487, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 16:59:46,769] [INFO] [logging.py:96:log_dist] [Rank 0] step=1920, skipped=31, lr=[9.263165658541286e-06, 9.263165658541286e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 16:59:47,010] [INFO] [timer.py:199:stop] epoch=2/micro_step=320/global_step=1920, RunningAvgSamplesPerSec=23.806965570122234, CurrSamplesPerSec=23.92312529498914, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:00:13,680] [INFO] [logging.py:96:log_dist] [Rank 0] step=1930, skipped=31, lr=[9.259115526588901e-06, 9.259115526588901e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:00:13,921] [INFO] [timer.py:199:stop] epoch=2/micro_step=360/global_step=1930, RunningAvgSamplesPerSec=23.807067728267555, CurrSamplesPerSec=23.77073796001571, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:00:19,004] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 17:00:21,401] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 17:00:39,946] [INFO] [logging.py:96:log_dist] [Rank 0] step=1940, skipped=33, lr=[9.255860878161135e-06, 9.255860878161135e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:00:40,188] [INFO] [timer.py:199:stop] epoch=2/micro_step=400/global_step=1940, RunningAvgSamplesPerSec=23.81009615068174, CurrSamplesPerSec=23.878055825106916, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:01:06,814] [INFO] [logging.py:96:log_dist] [Rank 0] step=1950, skipped=33, lr=[9.251774404610116e-06, 9.251774404610116e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:01:07,055] [INFO] [timer.py:199:stop] epoch=2/micro_step=440/global_step=1950, RunningAvgSamplesPerSec=23.810369386514715, CurrSamplesPerSec=23.886489027102957, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:01:34,109] [INFO] [logging.py:96:log_dist] [Rank 0] step=1960, skipped=33, lr=[9.247667767306936e-06, 9.247667767306936e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:01:34,351] [INFO] [timer.py:199:stop] epoch=2/micro_step=480/global_step=1960, RunningAvgSamplesPerSec=23.808753951978108, CurrSamplesPerSec=23.79512374312036, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:02:00,923] [INFO] [logging.py:96:log_dist] [Rank 0] step=1970, skipped=33, lr=[9.243540984957136e-06, 9.243540984957136e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:02:01,165] [INFO] [timer.py:199:stop] epoch=2/micro_step=520/global_step=1970, RunningAvgSamplesPerSec=23.809289398525117, CurrSamplesPerSec=23.943627519189835, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:02:28,043] [INFO] [logging.py:96:log_dist] [Rank 0] step=1980, skipped=33, lr=[9.239394076358021e-06, 9.239394076358021e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:02:28,285] [INFO] [timer.py:199:stop] epoch=2/micro_step=560/global_step=1980, RunningAvgSamplesPerSec=23.808831553836313, CurrSamplesPerSec=23.779200837370865, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:02:54,940] [INFO] [logging.py:96:log_dist] [Rank 0] step=1990, skipped=33, lr=[9.235227060398567e-06, 9.235227060398567e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:02:55,183] [INFO] [timer.py:199:stop] epoch=2/micro_step=600/global_step=1990, RunningAvgSamplesPerSec=23.80900240209914, CurrSamplesPerSec=23.75767119194654, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:03:21,832] [INFO] [logging.py:96:log_dist] [Rank 0] step=2000, skipped=33, lr=[9.231039956059337e-06, 9.231039956059337e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:03:22,076] [INFO] [timer.py:199:stop] epoch=2/micro_step=640/global_step=2000, RunningAvgSamplesPerSec=23.809165358389198, CurrSamplesPerSec=23.763414928871878, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:03:48,836] [INFO] [logging.py:96:log_dist] [Rank 0] step=2010, skipped=33, lr=[9.226832782412397e-06, 9.226832782412397e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:03:49,079] [INFO] [timer.py:199:stop] epoch=2/micro_step=680/global_step=2010, RunningAvgSamplesPerSec=23.808890916400355, CurrSamplesPerSec=23.732000320392732, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:04:15,751] [INFO] [logging.py:96:log_dist] [Rank 0] step=2020, skipped=33, lr=[9.222605558621231e-06, 9.222605558621231e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:04:15,994] [INFO] [timer.py:199:stop] epoch=2/micro_step=720/global_step=2020, RunningAvgSamplesPerSec=23.808958445369647, CurrSamplesPerSec=23.875267316087776, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:04:42,620] [INFO] [logging.py:96:log_dist] [Rank 0] step=2030, skipped=33, lr=[9.218358303940643e-06, 9.218358303940643e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:04:42,863] [INFO] [timer.py:199:stop] epoch=2/micro_step=760/global_step=2030, RunningAvgSamplesPerSec=23.809233588051615, CurrSamplesPerSec=23.80365676610268, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:04:53,318] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 17:04:55,719] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 17:05:08,927] [INFO] [logging.py:96:log_dist] [Rank 0] step=2040, skipped=35, lr=[9.214946090953271e-06, 9.214946090953271e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:05:09,167] [INFO] [timer.py:199:stop] epoch=2/micro_step=800/global_step=2040, RunningAvgSamplesPerSec=23.811967934542263, CurrSamplesPerSec=23.835092182296613, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:05:35,792] [INFO] [logging.py:96:log_dist] [Rank 0] step=2050, skipped=35, lr=[9.210662829485026e-06, 9.210662829485026e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:05:36,033] [INFO] [timer.py:199:stop] epoch=2/micro_step=840/global_step=2050, RunningAvgSamplesPerSec=23.81223826077831, CurrSamplesPerSec=23.87732306162916, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:06:02,649] [INFO] [logging.py:96:log_dist] [Rank 0] step=2060, skipped=35, lr=[9.206359591525938e-06, 9.206359591525938e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:06:02,890] [INFO] [timer.py:199:stop] epoch=2/micro_step=880/global_step=2060, RunningAvgSamplesPerSec=23.812525132335317, CurrSamplesPerSec=23.864423173774863, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:06:29,593] [INFO] [logging.py:96:log_dist] [Rank 0] step=2070, skipped=35, lr=[9.202036396677058e-06, 9.202036396677058e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:06:29,834] [INFO] [timer.py:199:stop] epoch=2/micro_step=920/global_step=2070, RunningAvgSamplesPerSec=23.81247322138704, CurrSamplesPerSec=23.799906462053517, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:06:56,554] [INFO] [logging.py:96:log_dist] [Rank 0] step=2080, skipped=35, lr=[9.197693264630336e-06, 9.197693264630336e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:06:56,795] [INFO] [timer.py:199:stop] epoch=2/micro_step=960/global_step=2080, RunningAvgSamplesPerSec=23.812369888915605, CurrSamplesPerSec=23.695480207486376, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:07:23,442] [INFO] [logging.py:96:log_dist] [Rank 0] step=2090, skipped=35, lr=[9.193330215168538e-06, 9.193330215168538e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:07:23,683] [INFO] [timer.py:199:stop] epoch=2/micro_step=1000/global_step=2090, RunningAvgSamplesPerSec=23.812586927891854, CurrSamplesPerSec=23.849064247769356, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:07:50,368] [INFO] [logging.py:96:log_dist] [Rank 0] step=2100, skipped=35, lr=[9.188947268165152e-06, 9.188947268165152e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:07:50,609] [INFO] [timer.py:199:stop] epoch=2/micro_step=1040/global_step=2100, RunningAvgSamplesPerSec=23.81263095571881, CurrSamplesPerSec=23.79588522076234, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:08:17,260] [INFO] [logging.py:96:log_dist] [Rank 0] step=2110, skipped=35, lr=[9.1845444435843e-06, 9.1845444435843e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:08:17,502] [INFO] [timer.py:199:stop] epoch=2/micro_step=1080/global_step=2110, RunningAvgSamplesPerSec=23.812798732352174, CurrSamplesPerSec=23.87616559936336, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:08:44,129] [INFO] [logging.py:96:log_dist] [Rank 0] step=2120, skipped=35, lr=[9.18012176148064e-06, 9.18012176148064e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:08:44,371] [INFO] [timer.py:199:stop] epoch=2/micro_step=1120/global_step=2120, RunningAvgSamplesPerSec=23.813061093894664, CurrSamplesPerSec=23.832309466587237, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:09:11,083] [INFO] [logging.py:96:log_dist] [Rank 0] step=2130, skipped=35, lr=[9.17567924199929e-06, 9.17567924199929e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:09:11,325] [INFO] [timer.py:199:stop] epoch=2/micro_step=1160/global_step=2130, RunningAvgSamplesPerSec=23.81296664189268, CurrSamplesPerSec=23.74146448329468, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:09:27,253] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 17:09:29,670] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 17:09:37,504] [INFO] [logging.py:96:log_dist] [Rank 0] step=2140, skipped=37, lr=[9.17211095709747e-06, 9.17211095709747e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:09:37,745] [INFO] [timer.py:199:stop] epoch=2/micro_step=1200/global_step=2140, RunningAvgSamplesPerSec=23.815078210805073, CurrSamplesPerSec=23.880507192696733, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:10:04,463] [INFO] [logging.py:96:log_dist] [Rank 0] step=2150, skipped=37, lr=[9.167632781390316e-06, 9.167632781390316e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:10:04,706] [INFO] [timer.py:199:stop] epoch=2/micro_step=1240/global_step=2150, RunningAvgSamplesPerSec=23.81495753868605, CurrSamplesPerSec=23.793124305313036, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:10:31,550] [INFO] [logging.py:96:log_dist] [Rank 0] step=2160, skipped=37, lr=[9.163134825192193e-06, 9.163134825192193e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:10:31,794] [INFO] [timer.py:199:stop] epoch=2/micro_step=1280/global_step=2160, RunningAvgSamplesPerSec=23.814323298074804, CurrSamplesPerSec=23.73925571344949, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:10:58,579] [INFO] [logging.py:96:log_dist] [Rank 0] step=2170, skipped=37, lr=[9.158617108991084e-06, 9.158617108991084e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:10:58,823] [INFO] [timer.py:199:stop] epoch=2/micro_step=1320/global_step=2170, RunningAvgSamplesPerSec=23.813908836767286, CurrSamplesPerSec=23.80830143643367, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:11:25,537] [INFO] [logging.py:96:log_dist] [Rank 0] step=2180, skipped=37, lr=[9.154079653364976e-06, 9.154079653364976e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:11:25,780] [INFO] [timer.py:199:stop] epoch=2/micro_step=1360/global_step=2180, RunningAvgSamplesPerSec=23.813794240125418, CurrSamplesPerSec=23.69215075220145, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:11:52,483] [INFO] [logging.py:96:log_dist] [Rank 0] step=2190, skipped=37, lr=[9.14952247898177e-06, 9.14952247898177e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:11:52,726] [INFO] [timer.py:199:stop] epoch=2/micro_step=1400/global_step=2190, RunningAvgSamplesPerSec=23.813718680918445, CurrSamplesPerSec=23.813655611486993, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:12:19,435] [INFO] [logging.py:96:log_dist] [Rank 0] step=2200, skipped=37, lr=[9.144945606599182e-06, 9.144945606599182e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:12:19,678] [INFO] [timer.py:199:stop] epoch=2/micro_step=1440/global_step=2200, RunningAvgSamplesPerSec=23.813630875732184, CurrSamplesPerSec=23.822354148841825, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:12:46,369] [INFO] [logging.py:96:log_dist] [Rank 0] step=2210, skipped=37, lr=[9.140349057064654e-06, 9.140349057064654e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:12:46,611] [INFO] [timer.py:199:stop] epoch=2/micro_step=1480/global_step=2210, RunningAvgSamplesPerSec=23.813634825406933, CurrSamplesPerSec=23.746914680041975, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:13:13,476] [INFO] [logging.py:96:log_dist] [Rank 0] step=2220, skipped=37, lr=[9.13573285131526e-06, 9.13573285131526e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:13:13,719] [INFO] [timer.py:199:stop] epoch=2/micro_step=1520/global_step=2220, RunningAvgSamplesPerSec=23.812985868614465, CurrSamplesPerSec=23.682930620825488, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:13:40,555] [INFO] [logging.py:96:log_dist] [Rank 0] step=2230, skipped=37, lr=[9.131097010377597e-06, 9.131097010377597e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:13:40,798] [INFO] [timer.py:199:stop] epoch=2/micro_step=1560/global_step=2230, RunningAvgSamplesPerSec=23.81243784220234, CurrSamplesPerSec=23.70999894538005, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:14:02,160] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 17:14:04,575] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 17:14:07,050] [INFO] [logging.py:96:log_dist] [Rank 0] step=2240, skipped=39, lr=[9.127374214478892e-06, 9.127374214478892e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:14:07,292] [INFO] [timer.py:199:stop] epoch=2/micro_step=1600/global_step=2240, RunningAvgSamplesPerSec=23.814226915474812, CurrSamplesPerSec=23.615701874780996, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:14:34,086] [INFO] [logging.py:96:log_dist] [Rank 0] step=2250, skipped=39, lr=[9.122703083474814e-06, 9.122703083474814e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:14:34,329] [INFO] [timer.py:199:stop] epoch=2/micro_step=1640/global_step=2250, RunningAvgSamplesPerSec=23.81384958848024, CurrSamplesPerSec=23.69283246145988, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:15:01,157] [INFO] [logging.py:96:log_dist] [Rank 0] step=2260, skipped=39, lr=[9.118012376632459e-06, 9.118012376632459e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:15:01,400] [INFO] [timer.py:199:stop] epoch=2/micro_step=1680/global_step=2260, RunningAvgSamplesPerSec=23.813324171887995, CurrSamplesPerSec=23.68704127310031, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:15:28,231] [INFO] [logging.py:96:log_dist] [Rank 0] step=2270, skipped=39, lr=[9.113302115317776e-06, 9.113302115317776e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:15:28,473] [INFO] [timer.py:199:stop] epoch=2/micro_step=1720/global_step=2270, RunningAvgSamplesPerSec=23.81278237201018, CurrSamplesPerSec=23.716367097127254, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:15:55,295] [INFO] [logging.py:96:log_dist] [Rank 0] step=2280, skipped=39, lr=[9.108572320985791e-06, 9.108572320985791e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:15:55,537] [INFO] [timer.py:199:stop] epoch=2/micro_step=1760/global_step=2280, RunningAvgSamplesPerSec=23.81226869272185, CurrSamplesPerSec=23.65951231352857, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:16:22,374] [INFO] [logging.py:96:log_dist] [Rank 0] step=2290, skipped=39, lr=[9.103823015180496e-06, 9.103823015180496e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:16:22,617] [INFO] [timer.py:199:stop] epoch=2/micro_step=1800/global_step=2290, RunningAvgSamplesPerSec=23.81169910066543, CurrSamplesPerSec=23.69257733867764, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:16:51,020] [INFO] [logging.py:96:log_dist] [Rank 0] step=2300, skipped=39, lr=[9.09905421953476e-06, 9.09905421953476e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:16:51,263] [INFO] [timer.py:199:stop] epoch=2/micro_step=1840/global_step=2300, RunningAvgSamplesPerSec=23.80628811275415, CurrSamplesPerSec=23.67839530375836, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:17:18,034] [INFO] [logging.py:96:log_dist] [Rank 0] step=2310, skipped=39, lr=[9.094265955770222e-06, 9.094265955770222e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:17:18,276] [INFO] [timer.py:199:stop] epoch=2/micro_step=1880/global_step=2310, RunningAvgSamplesPerSec=23.805975841449726, CurrSamplesPerSec=23.787800437499474, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:17:45,154] [INFO] [logging.py:96:log_dist] [Rank 0] step=2320, skipped=39, lr=[9.089458245697207e-06, 9.089458245697207e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:17:45,399] [INFO] [timer.py:199:stop] epoch=2/micro_step=1920/global_step=2320, RunningAvgSamplesPerSec=23.805273439645042, CurrSamplesPerSec=23.658129830371404, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:18:12,263] [INFO] [logging.py:96:log_dist] [Rank 0] step=2330, skipped=39, lr=[9.084631111214609e-06, 9.084631111214609e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:18:12,506] [INFO] [timer.py:199:stop] epoch=2/micro_step=1960/global_step=2330, RunningAvgSamplesPerSec=23.80463898960195, CurrSamplesPerSec=23.53324622623579, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:18:39,354] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 17:18:39,355] [INFO] [logging.py:96:log_dist] [Rank 0] step=2340, skipped=40, lr=[9.080270100480813e-06, 9.080270100480813e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:18:39,356] [INFO] [timer.py:199:stop] epoch=2/micro_step=2000/global_step=2340, RunningAvgSamplesPerSec=23.804990034282252, CurrSamplesPerSec=26.37865867104497, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:18:41,764] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 17:19:05,936] [INFO] [logging.py:96:log_dist] [Rank 0] step=2350, skipped=41, lr=[9.0758933898739e-06, 9.0758933898739e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:19:06,180] [INFO] [timer.py:199:stop] epoch=2/micro_step=2040/global_step=2350, RunningAvgSamplesPerSec=23.805431014766206, CurrSamplesPerSec=23.652912047549115, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:19:33,021] [INFO] [logging.py:96:log_dist] [Rank 0] step=2360, skipped=41, lr=[9.071011984299334e-06, 9.071011984299334e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:19:33,262] [INFO] [timer.py:199:stop] epoch=2/micro_step=2080/global_step=2360, RunningAvgSamplesPerSec=23.804886992147754, CurrSamplesPerSec=23.659996116536004, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:20:00,095] [INFO] [logging.py:96:log_dist] [Rank 0] step=2370, skipped=41, lr=[9.06611123833705e-06, 9.06611123833705e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:20:00,335] [INFO] [timer.py:199:stop] epoch=2/micro_step=2120/global_step=2370, RunningAvgSamplesPerSec=23.804395515325965, CurrSamplesPerSec=23.740118596475714, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:20:27,203] [INFO] [logging.py:96:log_dist] [Rank 0] step=2380, skipped=41, lr=[9.061191174309717e-06, 9.061191174309717e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:20:27,449] [INFO] [timer.py:199:stop] epoch=2/micro_step=2160/global_step=2380, RunningAvgSamplesPerSec=23.803752973967008, CurrSamplesPerSec=23.633035535376024, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:20:54,337] [INFO] [logging.py:96:log_dist] [Rank 0] step=2390, skipped=41, lr=[9.056251814628e-06, 9.056251814628e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:20:54,580] [INFO] [timer.py:199:stop] epoch=2/micro_step=2200/global_step=2390, RunningAvgSamplesPerSec=23.80304201268888, CurrSamplesPerSec=23.606173694042663, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:21:21,407] [INFO] [logging.py:96:log_dist] [Rank 0] step=2400, skipped=41, lr=[9.051293181790457e-06, 9.051293181790457e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:21:21,650] [INFO] [timer.py:199:stop] epoch=2/micro_step=2240/global_step=2400, RunningAvgSamplesPerSec=23.802580487998824, CurrSamplesPerSec=23.73990024545088, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:21:50,372] [INFO] [logging.py:96:log_dist] [Rank 0] step=2410, skipped=41, lr=[9.046315298383423e-06, 9.046315298383423e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:21:50,615] [INFO] [timer.py:199:stop] epoch=2/micro_step=2280/global_step=2410, RunningAvgSamplesPerSec=23.79891119734023, CurrSamplesPerSec=23.612234870242222, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:22:17,437] [INFO] [logging.py:96:log_dist] [Rank 0] step=2420, skipped=41, lr=[9.041318187080935e-06, 9.041318187080935e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:22:17,681] [INFO] [timer.py:199:stop] epoch=2/micro_step=2320/global_step=2420, RunningAvgSamplesPerSec=23.79846787027696, CurrSamplesPerSec=23.69853442512292, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:22:44,635] [INFO] [logging.py:96:log_dist] [Rank 0] step=2430, skipped=41, lr=[9.036301870644597e-06, 9.036301870644597e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:22:44,878] [INFO] [timer.py:199:stop] epoch=2/micro_step=2360/global_step=2430, RunningAvgSamplesPerSec=23.797739840776757, CurrSamplesPerSec=23.513735016463134, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:23:11,764] [INFO] [logging.py:96:log_dist] [Rank 0] step=2440, skipped=41, lr=[9.031266371923498e-06, 9.031266371923498e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:23:12,006] [INFO] [timer.py:199:stop] epoch=2/micro_step=2400/global_step=2440, RunningAvgSamplesPerSec=23.79707919120614, CurrSamplesPerSec=23.64342254463692, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:23:17,129] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 17:23:19,542] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 17:23:38,285] [INFO] [logging.py:96:log_dist] [Rank 0] step=2450, skipped=43, lr=[9.027224177111965e-06, 9.027224177111965e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:23:38,526] [INFO] [timer.py:199:stop] epoch=2/micro_step=2440/global_step=2450, RunningAvgSamplesPerSec=23.798654123806465, CurrSamplesPerSec=23.726875737886594, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:24:05,377] [INFO] [logging.py:96:log_dist] [Rank 0] step=2460, skipped=43, lr=[9.022154208136836e-06, 9.022154208136836e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:24:05,620] [INFO] [timer.py:199:stop] epoch=2/micro_step=2480/global_step=2460, RunningAvgSamplesPerSec=23.798196114739454, CurrSamplesPerSec=23.67475953096301, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:24:32,495] [INFO] [logging.py:96:log_dist] [Rank 0] step=2470, skipped=43, lr=[9.017065121318897e-06, 9.017065121318897e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:24:32,739] [INFO] [timer.py:199:stop] epoch=2/micro_step=2520/global_step=2470, RunningAvgSamplesPerSec=23.797603432938786, CurrSamplesPerSec=23.67316022813538, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:24:59,599] [INFO] [logging.py:96:log_dist] [Rank 0] step=2480, skipped=43, lr=[9.011956939838697e-06, 9.011956939838697e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:24:59,842] [INFO] [timer.py:199:stop] epoch=2/micro_step=2560/global_step=2480, RunningAvgSamplesPerSec=23.79706824072393, CurrSamplesPerSec=23.707823246018403, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:25:26,759] [INFO] [logging.py:96:log_dist] [Rank 0] step=2490, skipped=43, lr=[9.00682968696377e-06, 9.00682968696377e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:25:27,001] [INFO] [timer.py:199:stop] epoch=2/micro_step=2600/global_step=2490, RunningAvgSamplesPerSec=23.796342097844185, CurrSamplesPerSec=23.57751511585492, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:25:53,887] [INFO] [logging.py:96:log_dist] [Rank 0] step=2500, skipped=43, lr=[9.001683386048514e-06, 9.001683386048514e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:25:54,129] [INFO] [timer.py:199:stop] epoch=2/micro_step=2640/global_step=2500, RunningAvgSamplesPerSec=23.795724436969, CurrSamplesPerSec=23.58554253113601, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:26:21,022] [INFO] [logging.py:96:log_dist] [Rank 0] step=2510, skipped=43, lr=[8.99651806053409e-06, 8.99651806053409e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:26:21,264] [INFO] [timer.py:199:stop] epoch=2/micro_step=2680/global_step=2510, RunningAvgSamplesPerSec=23.795102932834226, CurrSamplesPerSec=23.692552244930273, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:26:48,124] [INFO] [logging.py:96:log_dist] [Rank 0] step=2520, skipped=43, lr=[8.99133373394832e-06, 8.99133373394832e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:26:48,366] [INFO] [timer.py:199:stop] epoch=2/micro_step=2720/global_step=2520, RunningAvgSamplesPerSec=23.794600832754828, CurrSamplesPerSec=23.6776601235045, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:27:15,207] [INFO] [logging.py:96:log_dist] [Rank 0] step=2530, skipped=43, lr=[8.986130429905564e-06, 8.986130429905564e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:27:15,452] [INFO] [timer.py:199:stop] epoch=2/micro_step=2760/global_step=2530, RunningAvgSamplesPerSec=23.79413863587575, CurrSamplesPerSec=23.6620149564025, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:27:42,315] [INFO] [logging.py:96:log_dist] [Rank 0] step=2540, skipped=43, lr=[8.980908172106638e-06, 8.980908172106638e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:27:42,559] [INFO] [timer.py:199:stop] epoch=2/micro_step=2800/global_step=2540, RunningAvgSamplesPerSec=23.79363154476087, CurrSamplesPerSec=23.658974314454102, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:27:53,106] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 17:27:55,531] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 17:28:08,877] [INFO] [logging.py:96:log_dist] [Rank 0] step=2550, skipped=45, lr=[8.976716735145111e-06, 8.976716735145111e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:28:09,121] [INFO] [timer.py:199:stop] epoch=2/micro_step=2840/global_step=2550, RunningAvgSamplesPerSec=23.794992115283858, CurrSamplesPerSec=23.614507315483685, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:28:41,483] [INFO] [logging.py:96:log_dist] [Rank 0] step=2560, skipped=45, lr=[8.97146042058662e-06, 8.97146042058662e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:28:41,726] [INFO] [timer.py:199:stop] epoch=2/micro_step=2880/global_step=2560, RunningAvgSamplesPerSec=23.777959937045527, CurrSamplesPerSec=23.64648626945171, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:29:08,573] [INFO] [logging.py:96:log_dist] [Rank 0] step=2570, skipped=45, lr=[8.966185219093166e-06, 8.966185219093166e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:29:08,815] [INFO] [timer.py:199:stop] epoch=2/micro_step=2920/global_step=2570, RunningAvgSamplesPerSec=23.777576193082837, CurrSamplesPerSec=23.747992414031724, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:29:36,295] [INFO] [logging.py:96:log_dist] [Rank 0] step=2580, skipped=45, lr=[8.960891154693049e-06, 8.960891154693049e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:29:36,536] [INFO] [timer.py:199:stop] epoch=2/micro_step=2960/global_step=2580, RunningAvgSamplesPerSec=23.775855090793456, CurrSamplesPerSec=23.730777183732137, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:30:03,385] [INFO] [logging.py:96:log_dist] [Rank 0] step=2590, skipped=45, lr=[8.955578251500488e-06, 8.955578251500488e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:30:03,624] [INFO] [timer.py:199:stop] epoch=2/micro_step=3000/global_step=2590, RunningAvgSamplesPerSec=23.77547419442855, CurrSamplesPerSec=23.701468051965758, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:30:30,489] [INFO] [logging.py:96:log_dist] [Rank 0] step=2600, skipped=45, lr=[8.950246533715508e-06, 8.950246533715508e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:30:30,732] [INFO] [timer.py:199:stop] epoch=2/micro_step=3040/global_step=2600, RunningAvgSamplesPerSec=23.775071039967656, CurrSamplesPerSec=23.676584586921102, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:30:57,629] [INFO] [logging.py:96:log_dist] [Rank 0] step=2610, skipped=45, lr=[8.944896025623841e-06, 8.944896025623841e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:30:57,874] [INFO] [timer.py:199:stop] epoch=2/micro_step=3080/global_step=2610, RunningAvgSamplesPerSec=23.774515899999574, CurrSamplesPerSec=23.63008970149385, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:31:24,790] [INFO] [logging.py:96:log_dist] [Rank 0] step=2620, skipped=45, lr=[8.939526751596799e-06, 8.939526751596799e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:31:25,032] [INFO] [timer.py:199:stop] epoch=2/micro_step=3120/global_step=2620, RunningAvgSamplesPerSec=23.77392048222759, CurrSamplesPerSec=23.515005918271093, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:31:51,837] [INFO] [logging.py:96:log_dist] [Rank 0] step=2630, skipped=45, lr=[8.93413873609118e-06, 8.93413873609118e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:31:52,080] [INFO] [timer.py:199:stop] epoch=2/micro_step=3160/global_step=2630, RunningAvgSamplesPerSec=23.773694163724443, CurrSamplesPerSec=23.729031862271693, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:32:18,956] [INFO] [logging.py:96:log_dist] [Rank 0] step=2640, skipped=45, lr=[8.928732003649139e-06, 8.928732003649139e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:32:19,199] [INFO] [timer.py:199:stop] epoch=2/micro_step=3200/global_step=2640, RunningAvgSamplesPerSec=23.77321107964837, CurrSamplesPerSec=23.68203428271573, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:32:35,195] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 17:32:37,616] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 17:32:45,529] [INFO] [logging.py:96:log_dist] [Rank 0] step=2650, skipped=47, lr=[8.924393158048048e-06, 8.924393158048048e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:32:45,772] [INFO] [timer.py:199:stop] epoch=2/micro_step=3240/global_step=2650, RunningAvgSamplesPerSec=23.774561968431122, CurrSamplesPerSec=23.66175006831384, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:33:12,622] [INFO] [logging.py:96:log_dist] [Rank 0] step=2660, skipped=47, lr=[8.918952797238763e-06, 8.918952797238763e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:33:12,865] [INFO] [timer.py:199:stop] epoch=2/micro_step=3280/global_step=2660, RunningAvgSamplesPerSec=23.774179084138492, CurrSamplesPerSec=23.648342385587622, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:33:39,748] [INFO] [logging.py:96:log_dist] [Rank 0] step=2670, skipped=47, lr=[8.913493788664306e-06, 8.913493788664306e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:33:39,992] [INFO] [timer.py:199:stop] epoch=2/micro_step=3320/global_step=2670, RunningAvgSamplesPerSec=23.7736820320858, CurrSamplesPerSec=23.57080528450173, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:34:06,846] [INFO] [logging.py:96:log_dist] [Rank 0] step=2680, skipped=47, lr=[8.90801615719021e-06, 8.90801615719021e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:34:07,087] [INFO] [timer.py:199:stop] epoch=2/micro_step=3360/global_step=2680, RunningAvgSamplesPerSec=23.773268745753914, CurrSamplesPerSec=23.70867965749546, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:34:34,915] [INFO] [logging.py:96:log_dist] [Rank 0] step=2690, skipped=47, lr=[8.90251992776683e-06, 8.90251992776683e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:34:35,158] [INFO] [timer.py:199:stop] epoch=2/micro_step=3400/global_step=2690, RunningAvgSamplesPerSec=23.770037396018086, CurrSamplesPerSec=23.851272307079224, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:35:01,857] [INFO] [logging.py:96:log_dist] [Rank 0] step=2700, skipped=47, lr=[8.897005125429239e-06, 8.897005125429239e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:35:02,098] [INFO] [timer.py:199:stop] epoch=2/micro_step=3440/global_step=2700, RunningAvgSamplesPerSec=23.77018429929141, CurrSamplesPerSec=23.79870163490734, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:35:32,776] [INFO] [logging.py:96:log_dist] [Rank 0] step=2710, skipped=47, lr=[8.891471775297104e-06, 8.891471775297104e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:35:33,017] [INFO] [timer.py:199:stop] epoch=2/micro_step=3480/global_step=2710, RunningAvgSamplesPerSec=23.757673956145155, CurrSamplesPerSec=23.936063128645497, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:35:59,693] [INFO] [logging.py:96:log_dist] [Rank 0] step=2720, skipped=47, lr=[8.88591990257458e-06, 8.88591990257458e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:35:59,935] [INFO] [timer.py:199:stop] epoch=2/micro_step=3520/global_step=2720, RunningAvgSamplesPerSec=23.757970717963392, CurrSamplesPerSec=23.826394915955593, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:36:27,259] [INFO] [logging.py:96:log_dist] [Rank 0] step=2730, skipped=47, lr=[8.880349532550186e-06, 8.880349532550186e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:36:27,500] [INFO] [timer.py:199:stop] epoch=2/micro_step=3560/global_step=2730, RunningAvgSamplesPerSec=23.756364337842875, CurrSamplesPerSec=23.795731233817673, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:36:54,162] [INFO] [logging.py:96:log_dist] [Rank 0] step=2740, skipped=47, lr=[8.874760690596703e-06, 8.874760690596703e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:36:54,403] [INFO] [timer.py:199:stop] epoch=2/micro_step=3600/global_step=2740, RunningAvgSamplesPerSec=23.756654942313766, CurrSamplesPerSec=23.896319340135644, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:37:15,619] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 17:37:18,027] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 17:37:20,485] [INFO] [logging.py:96:log_dist] [Rank 0] step=2750, skipped=49, lr=[8.870276334349195e-06, 8.870276334349195e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:37:20,729] [INFO] [timer.py:199:stop] epoch=2/micro_step=3640/global_step=2750, RunningAvgSamplesPerSec=23.758815172788623, CurrSamplesPerSec=23.783005725241566, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:37:47,677] [INFO] [logging.py:96:log_dist] [Rank 0] step=2760, skipped=49, lr=[8.864654307131244e-06, 8.864654307131244e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:37:47,920] [INFO] [timer.py:199:stop] epoch=2/micro_step=3680/global_step=2760, RunningAvgSamplesPerSec=23.758194410620348, CurrSamplesPerSec=23.747586939278435, MemAllocated=11.44GB, MaxMemAllocated=31.66GB ***** Evaluating perplexity, Epoch 3/16 ***** ppl: 1.9098479747772217 saving the final model ... Beginning of Epoch 4/16, Total Micro Batches 3680 [2023-04-23 17:39:05,950] [INFO] [logging.py:96:log_dist] [Rank 0] step=2770, skipped=49, lr=[8.859013879475229e-06, 8.859013879475229e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:39:06,194] [INFO] [timer.py:199:stop] epoch=3/micro_step=40/global_step=2770, RunningAvgSamplesPerSec=23.758213203693636, CurrSamplesPerSec=23.793343636511256, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:39:32,856] [INFO] [logging.py:96:log_dist] [Rank 0] step=2780, skipped=49, lr=[8.853355077073033e-06, 8.853355077073033e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:39:33,098] [INFO] [timer.py:199:stop] epoch=3/micro_step=80/global_step=2780, RunningAvgSamplesPerSec=23.75849348384534, CurrSamplesPerSec=23.84758325474028, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:39:59,783] [INFO] [logging.py:96:log_dist] [Rank 0] step=2790, skipped=49, lr=[8.84767792570024e-06, 8.84767792570024e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:40:00,027] [INFO] [timer.py:199:stop] epoch=3/micro_step=120/global_step=2790, RunningAvgSamplesPerSec=23.758691421253303, CurrSamplesPerSec=23.828444368774765, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:40:26,736] [INFO] [logging.py:96:log_dist] [Rank 0] step=2800, skipped=49, lr=[8.84198245121601e-06, 8.84198245121601e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:40:26,979] [INFO] [timer.py:199:stop] epoch=3/micro_step=160/global_step=2800, RunningAvgSamplesPerSec=23.758836992805243, CurrSamplesPerSec=23.725722330282952, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:40:53,634] [INFO] [logging.py:96:log_dist] [Rank 0] step=2810, skipped=49, lr=[8.836268679562968e-06, 8.836268679562968e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:40:53,876] [INFO] [timer.py:199:stop] epoch=3/micro_step=200/global_step=2810, RunningAvgSamplesPerSec=23.759125958303724, CurrSamplesPerSec=23.818775482679097, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:41:20,592] [INFO] [logging.py:96:log_dist] [Rank 0] step=2820, skipped=49, lr=[8.830536636767077e-06, 8.830536636767077e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:41:20,834] [INFO] [timer.py:199:stop] epoch=3/micro_step=240/global_step=2820, RunningAvgSamplesPerSec=23.75921547321205, CurrSamplesPerSec=23.74616893698357, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:41:47,578] [INFO] [logging.py:96:log_dist] [Rank 0] step=2830, skipped=49, lr=[8.824786348937526e-06, 8.824786348937526e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:41:47,821] [INFO] [timer.py:199:stop] epoch=3/micro_step=280/global_step=2830, RunningAvgSamplesPerSec=23.759253296253608, CurrSamplesPerSec=23.866500393780807, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:42:14,481] [INFO] [logging.py:96:log_dist] [Rank 0] step=2840, skipped=49, lr=[8.81901784226661e-06, 8.81901784226661e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:42:14,725] [INFO] [timer.py:199:stop] epoch=3/micro_step=320/global_step=2840, RunningAvgSamplesPerSec=23.759537441179326, CurrSamplesPerSec=23.753746185102724, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:42:41,385] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 17:42:41,386] [INFO] [logging.py:96:log_dist] [Rank 0] step=2850, skipped=50, lr=[8.813810630868224e-06, 8.813810630868224e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:42:41,387] [INFO] [timer.py:199:stop] epoch=3/micro_step=360/global_step=2850, RunningAvgSamplesPerSec=23.760572009173664, CurrSamplesPerSec=26.674458872652146, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:42:43,792] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 17:43:07,795] [INFO] [logging.py:96:log_dist] [Rank 0] step=2860, skipped=51, lr=[8.80858870270232e-06, 8.80858870270232e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:43:08,036] [INFO] [timer.py:199:stop] epoch=3/micro_step=400/global_step=2860, RunningAvgSamplesPerSec=23.76163081489728, CurrSamplesPerSec=23.917220993552647, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:43:36,449] [INFO] [logging.py:96:log_dist] [Rank 0] step=2870, skipped=51, lr=[8.802769323324506e-06, 8.802769323324506e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:43:36,691] [INFO] [timer.py:199:stop] epoch=3/micro_step=440/global_step=2870, RunningAvgSamplesPerSec=23.757006489555387, CurrSamplesPerSec=23.84962575959554, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:44:03,398] [INFO] [logging.py:96:log_dist] [Rank 0] step=2880, skipped=51, lr=[8.796931825391859e-06, 8.796931825391859e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:44:03,641] [INFO] [timer.py:199:stop] epoch=3/micro_step=480/global_step=2880, RunningAvgSamplesPerSec=23.75717495400155, CurrSamplesPerSec=23.837579189096548, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:44:30,510] [INFO] [logging.py:96:log_dist] [Rank 0] step=2890, skipped=51, lr=[8.791076235493908e-06, 8.791076235493908e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:44:30,750] [INFO] [timer.py:199:stop] epoch=3/micro_step=520/global_step=2890, RunningAvgSamplesPerSec=23.7572055505345, CurrSamplesPerSec=23.91901755365185, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:44:57,496] [INFO] [logging.py:96:log_dist] [Rank 0] step=2900, skipped=51, lr=[8.785202580302596e-06, 8.785202580302596e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:44:57,738] [INFO] [timer.py:199:stop] epoch=3/micro_step=560/global_step=2900, RunningAvgSamplesPerSec=23.757244920918655, CurrSamplesPerSec=23.808191632401826, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:45:24,423] [INFO] [logging.py:96:log_dist] [Rank 0] step=2910, skipped=51, lr=[8.77931088657215e-06, 8.77931088657215e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:45:24,665] [INFO] [timer.py:199:stop] epoch=3/micro_step=600/global_step=2910, RunningAvgSamplesPerSec=23.75746790144199, CurrSamplesPerSec=23.746311779914564, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:45:51,312] [INFO] [logging.py:96:log_dist] [Rank 0] step=2920, skipped=51, lr=[8.773401181138962e-06, 8.773401181138962e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:45:51,553] [INFO] [timer.py:199:stop] epoch=3/micro_step=640/global_step=2920, RunningAvgSamplesPerSec=23.757776977973876, CurrSamplesPerSec=23.82234780648585, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:46:18,260] [INFO] [logging.py:96:log_dist] [Rank 0] step=2930, skipped=51, lr=[8.767473490921465e-06, 8.767473490921465e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:46:18,501] [INFO] [timer.py:199:stop] epoch=3/micro_step=680/global_step=2930, RunningAvgSamplesPerSec=23.757918768414914, CurrSamplesPerSec=23.808567504249766, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:46:45,173] [INFO] [logging.py:96:log_dist] [Rank 0] step=2940, skipped=51, lr=[8.761527842920014e-06, 8.761527842920014e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:46:45,415] [INFO] [timer.py:199:stop] epoch=3/micro_step=720/global_step=2940, RunningAvgSamplesPerSec=23.75814174922708, CurrSamplesPerSec=23.78686663426976, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:47:12,106] [INFO] [logging.py:96:log_dist] [Rank 0] step=2950, skipped=51, lr=[8.75556426421676e-06, 8.75556426421676e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:47:12,347] [INFO] [timer.py:199:stop] epoch=3/micro_step=760/global_step=2950, RunningAvgSamplesPerSec=23.758342647325204, CurrSamplesPerSec=23.82604808782165, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:47:17,444] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 17:47:19,843] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 17:47:38,502] [INFO] [logging.py:96:log_dist] [Rank 0] step=2960, skipped=53, lr=[8.750780509400208e-06, 8.750780509400208e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:47:38,743] [INFO] [timer.py:199:stop] epoch=3/micro_step=800/global_step=2960, RunningAvgSamplesPerSec=23.760243878847188, CurrSamplesPerSec=23.813336616845053, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:48:05,436] [INFO] [logging.py:96:log_dist] [Rank 0] step=2970, skipped=53, lr=[8.744784723941351e-06, 8.744784723941351e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:48:05,679] [INFO] [timer.py:199:stop] epoch=3/micro_step=840/global_step=2970, RunningAvgSamplesPerSec=23.760411937413267, CurrSamplesPerSec=23.77831404789422, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:48:32,346] [INFO] [logging.py:96:log_dist] [Rank 0] step=2980, skipped=53, lr=[8.73877108404482e-06, 8.73877108404482e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:48:32,588] [INFO] [timer.py:199:stop] epoch=3/micro_step=880/global_step=2980, RunningAvgSamplesPerSec=23.76062847079754, CurrSamplesPerSec=23.804197143660815, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:48:59,254] [INFO] [logging.py:96:log_dist] [Rank 0] step=2990, skipped=53, lr=[8.73273961710247e-06, 8.73273961710247e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:48:59,497] [INFO] [timer.py:199:stop] epoch=3/micro_step=920/global_step=2990, RunningAvgSamplesPerSec=23.760858186230273, CurrSamplesPerSec=23.77217784518439, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:49:26,190] [INFO] [logging.py:96:log_dist] [Rank 0] step=3000, skipped=53, lr=[8.72669035058735e-06, 8.72669035058735e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:49:26,432] [INFO] [timer.py:199:stop] epoch=3/micro_step=960/global_step=3000, RunningAvgSamplesPerSec=23.761010175265383, CurrSamplesPerSec=23.780222517693506, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:49:53,099] [INFO] [logging.py:96:log_dist] [Rank 0] step=3010, skipped=53, lr=[8.720623312053589e-06, 8.720623312053589e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:49:53,339] [INFO] [timer.py:199:stop] epoch=3/micro_step=1000/global_step=3010, RunningAvgSamplesPerSec=23.76122871577392, CurrSamplesPerSec=23.827640619767035, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:50:20,016] [INFO] [logging.py:96:log_dist] [Rank 0] step=3020, skipped=53, lr=[8.714538529136264e-06, 8.714538529136264e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:50:20,258] [INFO] [timer.py:199:stop] epoch=3/micro_step=1040/global_step=3020, RunningAvgSamplesPerSec=23.76143093892886, CurrSamplesPerSec=23.78425743288347, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:50:46,954] [INFO] [logging.py:96:log_dist] [Rank 0] step=3030, skipped=53, lr=[8.708436029551283e-06, 8.708436029551283e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:50:47,196] [INFO] [timer.py:199:stop] epoch=3/micro_step=1080/global_step=3030, RunningAvgSamplesPerSec=23.761564774775188, CurrSamplesPerSec=23.78381278763958, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:51:13,847] [INFO] [logging.py:96:log_dist] [Rank 0] step=3040, skipped=53, lr=[8.702315841095247e-06, 8.702315841095247e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:51:14,089] [INFO] [timer.py:199:stop] epoch=3/micro_step=1120/global_step=3040, RunningAvgSamplesPerSec=23.76181948531435, CurrSamplesPerSec=23.805704420092837, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:51:40,788] [INFO] [logging.py:96:log_dist] [Rank 0] step=3050, skipped=53, lr=[8.696177991645331e-06, 8.696177991645331e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:51:41,031] [INFO] [timer.py:199:stop] epoch=3/micro_step=1160/global_step=3050, RunningAvgSamplesPerSec=23.761935928317072, CurrSamplesPerSec=23.851511785302193, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:51:51,521] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 17:51:53,924] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 17:52:07,129] [INFO] [logging.py:96:log_dist] [Rank 0] step=3060, skipped=55, lr=[8.691255014954624e-06, 8.691255014954624e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:52:07,371] [INFO] [timer.py:199:stop] epoch=3/micro_step=1200/global_step=3060, RunningAvgSamplesPerSec=23.763786442431474, CurrSamplesPerSec=23.84317102844024, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:52:34,051] [INFO] [logging.py:96:log_dist] [Rank 0] step=3070, skipped=55, lr=[8.685085446222903e-06, 8.685085446222903e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:52:34,294] [INFO] [timer.py:199:stop] epoch=3/micro_step=1240/global_step=3070, RunningAvgSamplesPerSec=23.763957783019404, CurrSamplesPerSec=23.826773477700012, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:53:00,989] [INFO] [logging.py:96:log_dist] [Rank 0] step=3080, skipped=55, lr=[8.67889829498095e-06, 8.67889829498095e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:53:01,231] [INFO] [timer.py:199:stop] epoch=3/micro_step=1280/global_step=3080, RunningAvgSamplesPerSec=23.76407554025958, CurrSamplesPerSec=23.80325360976457, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:53:27,912] [INFO] [logging.py:96:log_dist] [Rank 0] step=3090, skipped=55, lr=[8.672693589410954e-06, 8.672693589410954e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:53:28,153] [INFO] [timer.py:199:stop] epoch=3/micro_step=1320/global_step=3090, RunningAvgSamplesPerSec=23.764240866449427, CurrSamplesPerSec=23.84787138775568, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:53:54,831] [INFO] [logging.py:96:log_dist] [Rank 0] step=3100, skipped=55, lr=[8.666471357775062e-06, 8.666471357775062e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:53:55,072] [INFO] [timer.py:199:stop] epoch=3/micro_step=1360/global_step=3100, RunningAvgSamplesPerSec=23.764405911966733, CurrSamplesPerSec=23.838516978304042, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:54:21,739] [INFO] [logging.py:96:log_dist] [Rank 0] step=3110, skipped=55, lr=[8.660231628415247e-06, 8.660231628415247e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:54:21,980] [INFO] [timer.py:199:stop] epoch=3/micro_step=1400/global_step=3110, RunningAvgSamplesPerSec=23.76461291348838, CurrSamplesPerSec=23.811253855290342, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:54:48,678] [INFO] [logging.py:96:log_dist] [Rank 0] step=3120, skipped=55, lr=[8.653974429753188e-06, 8.653974429753188e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:54:48,920] [INFO] [timer.py:199:stop] epoch=3/micro_step=1440/global_step=3120, RunningAvgSamplesPerSec=23.764742912013464, CurrSamplesPerSec=23.712816011704316, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:55:15,615] [INFO] [logging.py:96:log_dist] [Rank 0] step=3130, skipped=55, lr=[8.647699790290138e-06, 8.647699790290138e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:55:15,857] [INFO] [timer.py:199:stop] epoch=3/micro_step=1480/global_step=3130, RunningAvgSamplesPerSec=23.764870919827054, CurrSamplesPerSec=23.8129986186027, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:55:42,525] [INFO] [logging.py:96:log_dist] [Rank 0] step=3140, skipped=55, lr=[8.641407738606786e-06, 8.641407738606786e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:55:42,767] [INFO] [timer.py:199:stop] epoch=3/micro_step=1520/global_step=3140, RunningAvgSamplesPerSec=23.765068895433355, CurrSamplesPerSec=23.826587367629678, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:56:09,424] [INFO] [logging.py:96:log_dist] [Rank 0] step=3150, skipped=55, lr=[8.635098303363138e-06, 8.635098303363138e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:56:09,666] [INFO] [timer.py:199:stop] epoch=3/micro_step=1560/global_step=3150, RunningAvgSamplesPerSec=23.765290987510035, CurrSamplesPerSec=23.854226914553163, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:56:25,535] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 17:56:27,940] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 17:56:35,787] [INFO] [logging.py:96:log_dist] [Rank 0] step=3160, skipped=57, lr=[8.630038258314879e-06, 8.630038258314879e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:56:36,027] [INFO] [timer.py:199:stop] epoch=3/micro_step=1600/global_step=3160, RunningAvgSamplesPerSec=23.767022749413314, CurrSamplesPerSec=23.762731255460768, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:57:02,705] [INFO] [logging.py:96:log_dist] [Rank 0] step=3170, skipped=57, lr=[8.623697605138568e-06, 8.623697605138568e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:57:02,947] [INFO] [timer.py:199:stop] epoch=3/micro_step=1640/global_step=3170, RunningAvgSamplesPerSec=23.767190023618028, CurrSamplesPerSec=23.817063683597855, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:57:29,613] [INFO] [logging.py:96:log_dist] [Rank 0] step=3180, skipped=57, lr=[8.617339649070793e-06, 8.617339649070793e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:57:29,853] [INFO] [timer.py:199:stop] epoch=3/micro_step=1680/global_step=3180, RunningAvgSamplesPerSec=23.767394731377678, CurrSamplesPerSec=23.831147901480275, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:57:56,542] [INFO] [logging.py:96:log_dist] [Rank 0] step=3190, skipped=57, lr=[8.610964419071754e-06, 8.610964419071754e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:57:56,784] [INFO] [timer.py:199:stop] epoch=3/micro_step=1720/global_step=3190, RunningAvgSamplesPerSec=23.767519327045985, CurrSamplesPerSec=23.780530092261408, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:58:23,464] [INFO] [logging.py:96:log_dist] [Rank 0] step=3200, skipped=57, lr=[8.604571944180328e-06, 8.604571944180328e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:58:23,707] [INFO] [timer.py:199:stop] epoch=3/micro_step=1760/global_step=3200, RunningAvgSamplesPerSec=23.767673608911885, CurrSamplesPerSec=23.7815244973605, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:58:50,377] [INFO] [logging.py:96:log_dist] [Rank 0] step=3210, skipped=57, lr=[8.598162253513937e-06, 8.598162253513937e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:58:50,619] [INFO] [timer.py:199:stop] epoch=3/micro_step=1800/global_step=3210, RunningAvgSamplesPerSec=23.76784687467399, CurrSamplesPerSec=23.766379371099298, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:59:17,318] [INFO] [logging.py:96:log_dist] [Rank 0] step=3220, skipped=57, lr=[8.591735376268429e-06, 8.591735376268429e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:59:17,560] [INFO] [timer.py:199:stop] epoch=3/micro_step=1840/global_step=3220, RunningAvgSamplesPerSec=23.767937743468362, CurrSamplesPerSec=23.83868633862253, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 17:59:44,218] [INFO] [logging.py:96:log_dist] [Rank 0] step=3230, skipped=57, lr=[8.585291341717932e-06, 8.585291341717932e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 17:59:44,461] [INFO] [timer.py:199:stop] epoch=3/micro_step=1880/global_step=3230, RunningAvgSamplesPerSec=23.768130964934414, CurrSamplesPerSec=23.850990449490915, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:00:11,120] [INFO] [logging.py:96:log_dist] [Rank 0] step=3240, skipped=57, lr=[8.578830179214721e-06, 8.578830179214721e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:00:11,362] [INFO] [timer.py:199:stop] epoch=3/micro_step=1920/global_step=3240, RunningAvgSamplesPerSec=23.768347182688036, CurrSamplesPerSec=23.84627191375257, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:00:38,022] [INFO] [logging.py:96:log_dist] [Rank 0] step=3250, skipped=57, lr=[8.572351918189096e-06, 8.572351918189096e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:00:38,263] [INFO] [timer.py:199:stop] epoch=3/micro_step=1960/global_step=3250, RunningAvgSamplesPerSec=23.76855547993185, CurrSamplesPerSec=23.81715455069824, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:00:59,534] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 18:01:01,931] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 18:01:04,383] [INFO] [logging.py:96:log_dist] [Rank 0] step=3260, skipped=59, lr=[8.567157018259339e-06, 8.567157018259339e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:01:04,626] [INFO] [timer.py:199:stop] epoch=3/micro_step=2000/global_step=3260, RunningAvgSamplesPerSec=23.770243806168313, CurrSamplesPerSec=23.79676066386675, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:01:31,284] [INFO] [logging.py:96:log_dist] [Rank 0] step=3270, skipped=59, lr=[8.560648054306227e-06, 8.560648054306227e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:01:31,527] [INFO] [timer.py:199:stop] epoch=3/micro_step=2040/global_step=3270, RunningAvgSamplesPerSec=23.770453507427785, CurrSamplesPerSec=23.839395561094058, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:01:58,202] [INFO] [logging.py:96:log_dist] [Rank 0] step=3280, skipped=59, lr=[8.554122074649432e-06, 8.554122074649432e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:01:58,444] [INFO] [timer.py:199:stop] epoch=3/micro_step=2080/global_step=3280, RunningAvgSamplesPerSec=23.77062742660437, CurrSamplesPerSec=23.84783537074799, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:02:25,120] [INFO] [logging.py:96:log_dist] [Rank 0] step=3290, skipped=59, lr=[8.547579109014494e-06, 8.547579109014494e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:02:25,361] [INFO] [timer.py:199:stop] epoch=3/micro_step=2120/global_step=3290, RunningAvgSamplesPerSec=23.77078091667253, CurrSamplesPerSec=23.89859573341771, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:02:52,024] [INFO] [logging.py:96:log_dist] [Rank 0] step=3300, skipped=59, lr=[8.541019187204314e-06, 8.541019187204314e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:02:52,266] [INFO] [timer.py:199:stop] epoch=3/micro_step=2160/global_step=3300, RunningAvgSamplesPerSec=23.77096841792543, CurrSamplesPerSec=23.865377928668124, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:03:18,946] [INFO] [logging.py:96:log_dist] [Rank 0] step=3310, skipped=59, lr=[8.534442339099036e-06, 8.534442339099036e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:03:19,187] [INFO] [timer.py:199:stop] epoch=3/micro_step=2200/global_step=3310, RunningAvgSamplesPerSec=23.77111658789739, CurrSamplesPerSec=23.821208351336647, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:03:45,859] [INFO] [logging.py:96:log_dist] [Rank 0] step=3320, skipped=59, lr=[8.527848594655894e-06, 8.527848594655894e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:03:46,101] [INFO] [timer.py:199:stop] epoch=3/micro_step=2240/global_step=3320, RunningAvgSamplesPerSec=23.77130340346255, CurrSamplesPerSec=23.829032408973625, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:04:12,778] [INFO] [logging.py:96:log_dist] [Rank 0] step=3330, skipped=59, lr=[8.521237983909091e-06, 8.521237983909091e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:04:13,021] [INFO] [timer.py:199:stop] epoch=3/micro_step=2280/global_step=3330, RunningAvgSamplesPerSec=23.771465003838227, CurrSamplesPerSec=23.80225738670305, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:04:39,731] [INFO] [logging.py:96:log_dist] [Rank 0] step=3340, skipped=59, lr=[8.51461053696965e-06, 8.51461053696965e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:04:39,973] [INFO] [timer.py:199:stop] epoch=3/micro_step=2320/global_step=3340, RunningAvgSamplesPerSec=23.771533919880454, CurrSamplesPerSec=23.748822314857527, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:05:06,645] [INFO] [logging.py:96:log_dist] [Rank 0] step=3350, skipped=59, lr=[8.507966284025285e-06, 8.507966284025285e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:05:06,887] [INFO] [timer.py:199:stop] epoch=3/micro_step=2360/global_step=3350, RunningAvgSamplesPerSec=23.771694778053856, CurrSamplesPerSec=23.83204286786256, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:05:33,517] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 18:05:33,518] [INFO] [logging.py:96:log_dist] [Rank 0] step=3360, skipped=60, lr=[8.501972112252983e-06, 8.501972112252983e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:05:33,518] [INFO] [timer.py:199:stop] epoch=3/micro_step=2400/global_step=3360, RunningAvgSamplesPerSec=23.772600188089505, CurrSamplesPerSec=26.628781811809574, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:05:35,919] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 18:05:59,917] [INFO] [logging.py:96:log_dist] [Rank 0] step=3370, skipped=61, lr=[8.495964374245339e-06, 8.495964374245339e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:06:00,159] [INFO] [timer.py:199:stop] epoch=3/micro_step=2440/global_step=3370, RunningAvgSamplesPerSec=23.77347636977231, CurrSamplesPerSec=23.811395370146883, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:06:26,921] [INFO] [logging.py:96:log_dist] [Rank 0] step=3380, skipped=61, lr=[8.489273225736897e-06, 8.489273225736897e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:06:27,164] [INFO] [timer.py:199:stop] epoch=3/micro_step=2480/global_step=3380, RunningAvgSamplesPerSec=23.773402714198284, CurrSamplesPerSec=23.621027963286295, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:06:53,941] [INFO] [logging.py:96:log_dist] [Rank 0] step=3390, skipped=61, lr=[8.48256538663381e-06, 8.48256538663381e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:06:54,184] [INFO] [timer.py:199:stop] epoch=3/micro_step=2520/global_step=3390, RunningAvgSamplesPerSec=23.77327202108748, CurrSamplesPerSec=23.595029173456812, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:07:20,870] [INFO] [logging.py:96:log_dist] [Rank 0] step=3400, skipped=61, lr=[8.475840887489972e-06, 8.475840887489972e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:07:21,113] [INFO] [timer.py:199:stop] epoch=3/micro_step=2560/global_step=3400, RunningAvgSamplesPerSec=23.77338703315664, CurrSamplesPerSec=23.834285868525992, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:07:47,802] [INFO] [logging.py:96:log_dist] [Rank 0] step=3410, skipped=61, lr=[8.469099758935167e-06, 8.469099758935167e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:07:48,044] [INFO] [timer.py:199:stop] epoch=3/micro_step=2600/global_step=3410, RunningAvgSamplesPerSec=23.77348103532029, CurrSamplesPerSec=23.83629011397907, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:08:14,732] [INFO] [logging.py:96:log_dist] [Rank 0] step=3420, skipped=61, lr=[8.462342031674923e-06, 8.462342031674923e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:08:14,973] [INFO] [timer.py:199:stop] epoch=3/micro_step=2640/global_step=3420, RunningAvgSamplesPerSec=23.7735773624109, CurrSamplesPerSec=23.816218441468106, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:08:41,646] [INFO] [logging.py:96:log_dist] [Rank 0] step=3430, skipped=61, lr=[8.455567736490373e-06, 8.455567736490373e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:08:41,886] [INFO] [timer.py:199:stop] epoch=3/micro_step=2680/global_step=3430, RunningAvgSamplesPerSec=23.773729343943725, CurrSamplesPerSec=23.83950988747177, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:09:08,545] [INFO] [logging.py:96:log_dist] [Rank 0] step=3440, skipped=61, lr=[8.448776904238119e-06, 8.448776904238119e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:09:08,786] [INFO] [timer.py:199:stop] epoch=3/micro_step=2720/global_step=3440, RunningAvgSamplesPerSec=23.773916245204813, CurrSamplesPerSec=23.874405197600424, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:09:35,417] [INFO] [logging.py:96:log_dist] [Rank 0] step=3450, skipped=61, lr=[8.441969565850084e-06, 8.441969565850084e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:09:35,658] [INFO] [timer.py:199:stop] epoch=3/micro_step=2760/global_step=3450, RunningAvgSamplesPerSec=23.774161289274563, CurrSamplesPerSec=23.89563650671286, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:10:02,327] [INFO] [logging.py:96:log_dist] [Rank 0] step=3460, skipped=61, lr=[8.435145752333386e-06, 8.435145752333386e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:10:02,569] [INFO] [timer.py:199:stop] epoch=3/micro_step=2800/global_step=3460, RunningAvgSamplesPerSec=23.77430671984523, CurrSamplesPerSec=23.783108975599927, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:10:07,670] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 18:10:10,072] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 18:10:28,663] [INFO] [logging.py:96:log_dist] [Rank 0] step=3470, skipped=63, lr=[8.429674860312075e-06, 8.429674860312075e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:10:28,904] [INFO] [timer.py:199:stop] epoch=3/micro_step=2840/global_step=3470, RunningAvgSamplesPerSec=23.775928225238047, CurrSamplesPerSec=23.865689831820678, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:10:55,535] [INFO] [logging.py:96:log_dist] [Rank 0] step=3480, skipped=63, lr=[8.422821469941139e-06, 8.422821469941139e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:10:55,775] [INFO] [timer.py:199:stop] epoch=3/micro_step=2880/global_step=3480, RunningAvgSamplesPerSec=23.77617291038019, CurrSamplesPerSec=23.891098038999292, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:11:22,415] [INFO] [logging.py:96:log_dist] [Rank 0] step=3490, skipped=63, lr=[8.415951691660225e-06, 8.415951691660225e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:11:22,656] [INFO] [timer.py:199:stop] epoch=3/micro_step=2920/global_step=3490, RunningAvgSamplesPerSec=23.77640215607889, CurrSamplesPerSec=23.879763657149322, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:11:49,281] [INFO] [logging.py:96:log_dist] [Rank 0] step=3500, skipped=63, lr=[8.409065556760853e-06, 8.409065556760853e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:11:49,522] [INFO] [timer.py:199:stop] epoch=3/micro_step=2960/global_step=3500, RunningAvgSamplesPerSec=23.776656522958802, CurrSamplesPerSec=23.902213325077295, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:12:16,197] [INFO] [logging.py:96:log_dist] [Rank 0] step=3510, skipped=63, lr=[8.402163096609049e-06, 8.402163096609049e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:12:16,438] [INFO] [timer.py:199:stop] epoch=3/micro_step=3000/global_step=3510, RunningAvgSamplesPerSec=23.77679538868489, CurrSamplesPerSec=23.81618463303078, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:12:43,098] [INFO] [logging.py:96:log_dist] [Rank 0] step=3520, skipped=63, lr=[8.3952443426452e-06, 8.3952443426452e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:12:43,338] [INFO] [timer.py:199:stop] epoch=3/micro_step=3040/global_step=3520, RunningAvgSamplesPerSec=23.776968723149892, CurrSamplesPerSec=23.9195376048068, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:13:09,999] [INFO] [logging.py:96:log_dist] [Rank 0] step=3530, skipped=63, lr=[8.388309326383907e-06, 8.388309326383907e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:13:10,240] [INFO] [timer.py:199:stop] epoch=3/micro_step=3080/global_step=3530, RunningAvgSamplesPerSec=23.777127470822627, CurrSamplesPerSec=23.818965697766245, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:13:36,945] [INFO] [logging.py:96:log_dist] [Rank 0] step=3540, skipped=63, lr=[8.38135807941385e-06, 8.38135807941385e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:13:37,186] [INFO] [timer.py:199:stop] epoch=3/micro_step=3120/global_step=3540, RunningAvgSamplesPerSec=23.777179198193366, CurrSamplesPerSec=23.766339391444273, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:14:03,916] [INFO] [logging.py:96:log_dist] [Rank 0] step=3550, skipped=63, lr=[8.374390633397635e-06, 8.374390633397635e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:14:04,156] [INFO] [timer.py:199:stop] epoch=3/micro_step=3160/global_step=3550, RunningAvgSamplesPerSec=23.777191084009985, CurrSamplesPerSec=23.79902234755743, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:14:30,841] [INFO] [logging.py:96:log_dist] [Rank 0] step=3560, skipped=63, lr=[8.367407020071657e-06, 8.367407020071657e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:14:31,081] [INFO] [timer.py:199:stop] epoch=3/micro_step=3200/global_step=3560, RunningAvgSamplesPerSec=23.777292984977752, CurrSamplesPerSec=23.87024624455643, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:14:41,543] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 18:14:43,942] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 18:14:57,146] [INFO] [logging.py:96:log_dist] [Rank 0] step=3570, skipped=65, lr=[8.361808510321717e-06, 8.361808510321717e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:14:57,386] [INFO] [timer.py:199:stop] epoch=3/micro_step=3240/global_step=3570, RunningAvgSamplesPerSec=23.778924278255197, CurrSamplesPerSec=23.885339177290785, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:15:25,300] [INFO] [logging.py:96:log_dist] [Rank 0] step=3580, skipped=65, lr=[8.354795876048845e-06, 8.354795876048845e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:15:25,505] [INFO] [timer.py:199:stop] epoch=3/micro_step=3280/global_step=3580, RunningAvgSamplesPerSec=23.777333788318487, CurrSamplesPerSec=18.91806592413434, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:15:52,275] [INFO] [logging.py:96:log_dist] [Rank 0] step=3590, skipped=65, lr=[8.34776716371942e-06, 8.34776716371942e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:15:52,518] [INFO] [timer.py:199:stop] epoch=3/micro_step=3320/global_step=3590, RunningAvgSamplesPerSec=23.77742660137374, CurrSamplesPerSec=23.830028759152132, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:16:19,192] [INFO] [logging.py:96:log_dist] [Rank 0] step=3600, skipped=65, lr=[8.340722405348902e-06, 8.340722405348902e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:16:19,434] [INFO] [timer.py:199:stop] epoch=3/micro_step=3360/global_step=3600, RunningAvgSamplesPerSec=23.777547530988407, CurrSamplesPerSec=23.837676563323917, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:16:46,139] [INFO] [logging.py:96:log_dist] [Rank 0] step=3610, skipped=65, lr=[8.333661633025837e-06, 8.333661633025837e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:16:46,380] [INFO] [timer.py:199:stop] epoch=3/micro_step=3400/global_step=3610, RunningAvgSamplesPerSec=23.777603469222658, CurrSamplesPerSec=23.81767018179337, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:17:13,073] [INFO] [logging.py:96:log_dist] [Rank 0] step=3620, skipped=65, lr=[8.326584878911717e-06, 8.326584878911717e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:17:13,317] [INFO] [timer.py:199:stop] epoch=3/micro_step=3440/global_step=3620, RunningAvgSamplesPerSec=23.77766520692515, CurrSamplesPerSec=23.72955627082526, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:17:40,018] [INFO] [logging.py:96:log_dist] [Rank 0] step=3630, skipped=65, lr=[8.319492175240829e-06, 8.319492175240829e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:17:40,260] [INFO] [timer.py:199:stop] epoch=3/micro_step=3480/global_step=3630, RunningAvgSamplesPerSec=23.777711597609166, CurrSamplesPerSec=23.777298849938383, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:18:06,969] [INFO] [logging.py:96:log_dist] [Rank 0] step=3640, skipped=65, lr=[8.312383554320108e-06, 8.312383554320108e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:18:07,211] [INFO] [timer.py:199:stop] epoch=3/micro_step=3520/global_step=3640, RunningAvgSamplesPerSec=23.77774431228413, CurrSamplesPerSec=23.853718178175647, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:18:33,888] [INFO] [logging.py:96:log_dist] [Rank 0] step=3650, skipped=65, lr=[8.305259048528994e-06, 8.305259048528994e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:18:34,131] [INFO] [timer.py:199:stop] epoch=3/micro_step=3560/global_step=3650, RunningAvgSamplesPerSec=23.77785124672129, CurrSamplesPerSec=23.748755080226395, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:19:00,805] [INFO] [logging.py:96:log_dist] [Rank 0] step=3660, skipped=65, lr=[8.298118690319277e-06, 8.298118690319277e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:19:01,048] [INFO] [timer.py:199:stop] epoch=3/micro_step=3600/global_step=3660, RunningAvgSamplesPerSec=23.77795934613172, CurrSamplesPerSec=23.790256500245494, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:19:16,904] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 18:19:19,301] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 18:19:27,140] [INFO] [logging.py:96:log_dist] [Rank 0] step=3670, skipped=67, lr=[8.29239501186382e-06, 8.29239501186382e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:19:27,382] [INFO] [timer.py:199:stop] epoch=3/micro_step=3640/global_step=3670, RunningAvgSamplesPerSec=23.779469828421615, CurrSamplesPerSec=23.793592497959995, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:19:54,059] [INFO] [logging.py:96:log_dist] [Rank 0] step=3680, skipped=67, lr=[8.285226201309532e-06, 8.285226201309532e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:19:54,301] [INFO] [timer.py:199:stop] epoch=3/micro_step=3680/global_step=3680, RunningAvgSamplesPerSec=23.779569839047998, CurrSamplesPerSec=23.824030763375905, MemAllocated=11.44GB, MaxMemAllocated=31.66GB ***** Evaluating perplexity, Epoch 4/16 ***** ppl: 1.8753175735473633 saving the final model ... Beginning of Epoch 5/16, Total Micro Batches 3680 [2023-04-23 18:21:12,604] [INFO] [logging.py:96:log_dist] [Rank 0] step=3690, skipped=67, lr=[8.278041629585339e-06, 8.278041629585339e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:21:12,845] [INFO] [timer.py:199:stop] epoch=4/micro_step=40/global_step=3690, RunningAvgSamplesPerSec=23.779388067080035, CurrSamplesPerSec=23.794324349503093, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:21:40,551] [INFO] [logging.py:96:log_dist] [Rank 0] step=3700, skipped=67, lr=[8.270841329416625e-06, 8.270841329416625e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:21:40,792] [INFO] [timer.py:199:stop] epoch=4/micro_step=80/global_step=3700, RunningAvgSamplesPerSec=23.778225734151512, CurrSamplesPerSec=23.763421239886608, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:22:07,485] [INFO] [logging.py:96:log_dist] [Rank 0] step=3710, skipped=67, lr=[8.263625333600423e-06, 8.263625333600423e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:22:07,728] [INFO] [timer.py:199:stop] epoch=4/micro_step=120/global_step=3710, RunningAvgSamplesPerSec=23.778291911212815, CurrSamplesPerSec=23.744100007836995, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:22:34,434] [INFO] [logging.py:96:log_dist] [Rank 0] step=3720, skipped=67, lr=[8.256393675005265e-06, 8.256393675005265e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:22:34,678] [INFO] [timer.py:199:stop] epoch=4/micro_step=160/global_step=3720, RunningAvgSamplesPerSec=23.778329181269218, CurrSamplesPerSec=23.780968294429034, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:23:01,365] [INFO] [logging.py:96:log_dist] [Rank 0] step=3730, skipped=67, lr=[8.249146386571017e-06, 8.249146386571017e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:23:01,606] [INFO] [timer.py:199:stop] epoch=4/micro_step=200/global_step=3730, RunningAvgSamplesPerSec=23.77841787054744, CurrSamplesPerSec=23.789883314691227, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:23:28,284] [INFO] [logging.py:96:log_dist] [Rank 0] step=3740, skipped=67, lr=[8.241883501308742e-06, 8.241883501308742e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:23:28,526] [INFO] [timer.py:199:stop] epoch=4/micro_step=240/global_step=3740, RunningAvgSamplesPerSec=23.77852136794432, CurrSamplesPerSec=23.84247852284268, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:23:55,213] [INFO] [logging.py:96:log_dist] [Rank 0] step=3750, skipped=67, lr=[8.234605052300547e-06, 8.234605052300547e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:23:55,455] [INFO] [timer.py:199:stop] epoch=4/micro_step=280/global_step=3750, RunningAvgSamplesPerSec=23.778605180782154, CurrSamplesPerSec=23.836882774878923, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:24:22,136] [INFO] [logging.py:96:log_dist] [Rank 0] step=3760, skipped=67, lr=[8.227311072699427e-06, 8.227311072699427e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:24:22,378] [INFO] [timer.py:199:stop] epoch=4/micro_step=320/global_step=3760, RunningAvgSamplesPerSec=23.778696429973493, CurrSamplesPerSec=23.798507523555106, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:24:43,652] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 18:24:46,057] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 18:24:48,512] [INFO] [logging.py:96:log_dist] [Rank 0] step=3770, skipped=69, lr=[8.221464729315606e-06, 8.221464729315606e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:24:48,753] [INFO] [timer.py:199:stop] epoch=4/micro_step=360/global_step=3770, RunningAvgSamplesPerSec=23.78008487465785, CurrSamplesPerSec=23.78319326227138, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:25:15,417] [INFO] [logging.py:96:log_dist] [Rank 0] step=3780, skipped=69, lr=[8.214142878418489e-06, 8.214142878418489e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:25:15,659] [INFO] [timer.py:199:stop] epoch=4/micro_step=400/global_step=3780, RunningAvgSamplesPerSec=23.780204620679395, CurrSamplesPerSec=23.840742141365453, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:25:42,325] [INFO] [logging.py:96:log_dist] [Rank 0] step=3790, skipped=69, lr=[8.2068055901327e-06, 8.2068055901327e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:25:42,568] [INFO] [timer.py:199:stop] epoch=4/micro_step=440/global_step=3790, RunningAvgSamplesPerSec=23.780326947935116, CurrSamplesPerSec=23.83284479791236, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:26:09,246] [INFO] [logging.py:96:log_dist] [Rank 0] step=3800, skipped=69, lr=[8.19945289787925e-06, 8.19945289787925e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:26:09,489] [INFO] [timer.py:199:stop] epoch=4/micro_step=480/global_step=3800, RunningAvgSamplesPerSec=23.780412672750426, CurrSamplesPerSec=23.742617322037916, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:26:36,186] [INFO] [logging.py:96:log_dist] [Rank 0] step=3810, skipped=69, lr=[8.19208483514931e-06, 8.19208483514931e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:26:36,429] [INFO] [timer.py:199:stop] epoch=4/micro_step=520/global_step=3810, RunningAvgSamplesPerSec=23.78045313425241, CurrSamplesPerSec=23.74489813068124, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:27:03,121] [INFO] [logging.py:96:log_dist] [Rank 0] step=3820, skipped=69, lr=[8.184701435504064e-06, 8.184701435504064e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:27:03,363] [INFO] [timer.py:199:stop] epoch=4/micro_step=560/global_step=3820, RunningAvgSamplesPerSec=23.78050333695069, CurrSamplesPerSec=23.78435647934884, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:27:30,050] [INFO] [logging.py:96:log_dist] [Rank 0] step=3830, skipped=69, lr=[8.177302732574555e-06, 8.177302732574555e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:27:30,290] [INFO] [timer.py:199:stop] epoch=4/micro_step=600/global_step=3830, RunningAvgSamplesPerSec=23.780578373156782, CurrSamplesPerSec=23.86746804769093, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:27:57,014] [INFO] [logging.py:96:log_dist] [Rank 0] step=3840, skipped=69, lr=[8.169888760061535e-06, 8.169888760061535e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:27:57,255] [INFO] [timer.py:199:stop] epoch=4/micro_step=640/global_step=3840, RunningAvgSamplesPerSec=23.780564336927913, CurrSamplesPerSec=23.751890299873132, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:28:24,009] [INFO] [logging.py:96:log_dist] [Rank 0] step=3850, skipped=69, lr=[8.1624595517353e-06, 8.1624595517353e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:28:24,249] [INFO] [timer.py:199:stop] epoch=4/micro_step=680/global_step=3850, RunningAvgSamplesPerSec=23.780484980009323, CurrSamplesPerSec=23.727890829347313, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:28:50,973] [INFO] [logging.py:96:log_dist] [Rank 0] step=3860, skipped=69, lr=[8.15501514143555e-06, 8.15501514143555e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:28:51,214] [INFO] [timer.py:199:stop] epoch=4/micro_step=720/global_step=3860, RunningAvgSamplesPerSec=23.780469506975635, CurrSamplesPerSec=23.836736723563586, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:29:17,845] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 18:29:17,846] [INFO] [logging.py:96:log_dist] [Rank 0] step=3870, skipped=70, lr=[8.148302202502738e-06, 8.148302202502738e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:29:17,847] [INFO] [timer.py:199:stop] epoch=4/micro_step=760/global_step=3870, RunningAvgSamplesPerSec=23.781214148611856, CurrSamplesPerSec=26.688533261112806, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:29:20,247] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 18:29:44,257] [INFO] [logging.py:96:log_dist] [Rank 0] step=3880, skipped=71, lr=[8.14157700220421e-06, 8.14157700220421e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:29:44,501] [INFO] [timer.py:199:stop] epoch=4/micro_step=800/global_step=3880, RunningAvgSamplesPerSec=23.781908906773047, CurrSamplesPerSec=23.760537455389596, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:30:11,210] [INFO] [logging.py:96:log_dist] [Rank 0] step=3890, skipped=71, lr=[8.134090206994642e-06, 8.134090206994642e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:30:11,452] [INFO] [timer.py:199:stop] epoch=4/micro_step=840/global_step=3890, RunningAvgSamplesPerSec=23.78193263242783, CurrSamplesPerSec=23.81534579191668, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:30:38,181] [INFO] [logging.py:96:log_dist] [Rank 0] step=3900, skipped=71, lr=[8.12658833903261e-06, 8.12658833903261e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:30:38,422] [INFO] [timer.py:199:stop] epoch=4/micro_step=880/global_step=3900, RunningAvgSamplesPerSec=23.78190514041435, CurrSamplesPerSec=23.67282828504635, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:31:05,132] [INFO] [logging.py:96:log_dist] [Rank 0] step=3910, skipped=71, lr=[8.119071432488774e-06, 8.119071432488774e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:31:05,374] [INFO] [timer.py:199:stop] epoch=4/micro_step=920/global_step=3910, RunningAvgSamplesPerSec=23.78192141985094, CurrSamplesPerSec=23.7410466331435, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:31:32,079] [INFO] [logging.py:96:log_dist] [Rank 0] step=3920, skipped=71, lr=[8.111539521602298e-06, 8.111539521602298e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:31:32,322] [INFO] [timer.py:199:stop] epoch=4/micro_step=960/global_step=3920, RunningAvgSamplesPerSec=23.781946568337325, CurrSamplesPerSec=23.70974554695018, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:31:58,994] [INFO] [logging.py:96:log_dist] [Rank 0] step=3930, skipped=71, lr=[8.103992640680684e-06, 8.103992640680684e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:31:59,235] [INFO] [timer.py:199:stop] epoch=4/micro_step=1000/global_step=3930, RunningAvgSamplesPerSec=23.782039950365156, CurrSamplesPerSec=23.83583928810225, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:32:28,401] [INFO] [logging.py:96:log_dist] [Rank 0] step=3940, skipped=71, lr=[8.096430824099625e-06, 8.096430824099625e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:32:28,596] [INFO] [timer.py:199:stop] epoch=4/micro_step=1040/global_step=3940, RunningAvgSamplesPerSec=23.777283822513144, CurrSamplesPerSec=13.300911408306144, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:32:55,303] [INFO] [logging.py:96:log_dist] [Rank 0] step=3950, skipped=71, lr=[8.088854106302846e-06, 8.088854106302846e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:32:55,544] [INFO] [timer.py:199:stop] epoch=4/micro_step=1080/global_step=3950, RunningAvgSamplesPerSec=23.777361757665677, CurrSamplesPerSec=23.748412609732323, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:33:22,212] [INFO] [logging.py:96:log_dist] [Rank 0] step=3960, skipped=71, lr=[8.081262521801946e-06, 8.081262521801946e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:33:22,454] [INFO] [timer.py:199:stop] epoch=4/micro_step=1120/global_step=3960, RunningAvgSamplesPerSec=23.777484754870166, CurrSamplesPerSec=23.75517139915319, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:33:49,328] [INFO] [logging.py:96:log_dist] [Rank 0] step=3970, skipped=71, lr=[8.073656105176237e-06, 8.073656105176237e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:33:49,569] [INFO] [timer.py:199:stop] epoch=4/micro_step=1160/global_step=3970, RunningAvgSamplesPerSec=23.77720000343105, CurrSamplesPerSec=23.839234659160844, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:33:54,660] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 18:33:57,070] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 18:34:15,700] [INFO] [logging.py:96:log_dist] [Rank 0] step=3980, skipped=73, lr=[8.067560316026245e-06, 8.067560316026245e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:34:15,941] [INFO] [timer.py:199:stop] epoch=4/micro_step=1200/global_step=3980, RunningAvgSamplesPerSec=23.778505989964376, CurrSamplesPerSec=23.650940604244006, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:34:42,643] [INFO] [logging.py:96:log_dist] [Rank 0] step=3990, skipped=73, lr=[8.059927288931308e-06, 8.059927288931308e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:34:42,885] [INFO] [timer.py:199:stop] epoch=4/micro_step=1240/global_step=3990, RunningAvgSamplesPerSec=23.778546900921864, CurrSamplesPerSec=23.794917034608876, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:35:09,609] [INFO] [logging.py:96:log_dist] [Rank 0] step=4000, skipped=73, lr=[8.052279526892564e-06, 8.052279526892564e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:35:09,851] [INFO] [timer.py:199:stop] epoch=4/micro_step=1280/global_step=4000, RunningAvgSamplesPerSec=23.778542221698547, CurrSamplesPerSec=23.735322097681294, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:35:36,600] [INFO] [logging.py:96:log_dist] [Rank 0] step=4010, skipped=73, lr=[8.044617064745209e-06, 8.044617064745209e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:35:36,843] [INFO] [timer.py:199:stop] epoch=4/micro_step=1320/global_step=4010, RunningAvgSamplesPerSec=23.77847456207823, CurrSamplesPerSec=23.709873292589066, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:36:03,577] [INFO] [logging.py:96:log_dist] [Rank 0] step=4020, skipped=73, lr=[8.036939937391404e-06, 8.036939937391404e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:36:03,818] [INFO] [timer.py:199:stop] epoch=4/micro_step=1360/global_step=4020, RunningAvgSamplesPerSec=23.778437859502066, CurrSamplesPerSec=23.73706204759413, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:36:30,513] [INFO] [logging.py:96:log_dist] [Rank 0] step=4030, skipped=73, lr=[8.029248179800113e-06, 8.029248179800113e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:36:30,754] [INFO] [timer.py:199:stop] epoch=4/micro_step=1400/global_step=4030, RunningAvgSamplesPerSec=23.7784881559593, CurrSamplesPerSec=23.824965372270942, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:36:57,432] [INFO] [logging.py:96:log_dist] [Rank 0] step=4040, skipped=73, lr=[8.021541827006927e-06, 8.021541827006927e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:36:57,674] [INFO] [timer.py:199:stop] epoch=4/micro_step=1440/global_step=4040, RunningAvgSamplesPerSec=23.778573963516596, CurrSamplesPerSec=23.752735185613194, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:37:24,365] [INFO] [logging.py:96:log_dist] [Rank 0] step=4050, skipped=73, lr=[8.013820914113932e-06, 8.013820914113932e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:37:24,607] [INFO] [timer.py:199:stop] epoch=4/micro_step=1480/global_step=4050, RunningAvgSamplesPerSec=23.778626876992394, CurrSamplesPerSec=23.799735542196128, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:37:51,302] [INFO] [logging.py:96:log_dist] [Rank 0] step=4060, skipped=73, lr=[8.006085476289527e-06, 8.006085476289527e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:37:51,544] [INFO] [timer.py:199:stop] epoch=4/micro_step=1520/global_step=4060, RunningAvgSamplesPerSec=23.778671945309153, CurrSamplesPerSec=23.75230853181898, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:38:18,231] [INFO] [logging.py:96:log_dist] [Rank 0] step=4070, skipped=73, lr=[7.99833554876827e-06, 7.99833554876827e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:38:18,474] [INFO] [timer.py:199:stop] epoch=4/micro_step=1560/global_step=4070, RunningAvgSamplesPerSec=23.778737974625155, CurrSamplesPerSec=23.823910242227168, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:38:28,950] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 18:38:31,352] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 18:38:44,576] [INFO] [logging.py:96:log_dist] [Rank 0] step=4080, skipped=75, lr=[7.992125197889296e-06, 7.992125197889296e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:38:44,819] [INFO] [timer.py:199:stop] epoch=4/micro_step=1600/global_step=4080, RunningAvgSamplesPerSec=23.78007159319039, CurrSamplesPerSec=23.76164166653684, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:39:11,489] [INFO] [logging.py:96:log_dist] [Rank 0] step=4090, skipped=75, lr=[7.984349277915372e-06, 7.984349277915372e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:39:11,730] [INFO] [timer.py:199:stop] epoch=4/micro_step=1640/global_step=4090, RunningAvgSamplesPerSec=23.780171566091482, CurrSamplesPerSec=23.75469210454613, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:39:38,399] [INFO] [logging.py:96:log_dist] [Rank 0] step=4100, skipped=75, lr=[7.976558967251974e-06, 7.976558967251974e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:39:38,640] [INFO] [timer.py:199:stop] epoch=4/micro_step=1680/global_step=4100, RunningAvgSamplesPerSec=23.780275121804603, CurrSamplesPerSec=23.8432832731971, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:40:05,311] [INFO] [logging.py:96:log_dist] [Rank 0] step=4110, skipped=75, lr=[7.968754301383604e-06, 7.968754301383604e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:40:05,555] [INFO] [timer.py:199:stop] epoch=4/micro_step=1720/global_step=4110, RunningAvgSamplesPerSec=23.780367989080656, CurrSamplesPerSec=23.832135965152148, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:40:32,224] [INFO] [logging.py:96:log_dist] [Rank 0] step=4120, skipped=75, lr=[7.960935315860156e-06, 7.960935315860156e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:40:32,465] [INFO] [timer.py:199:stop] epoch=4/micro_step=1760/global_step=4120, RunningAvgSamplesPerSec=23.780469841647115, CurrSamplesPerSec=23.840928472628917, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:40:59,143] [INFO] [logging.py:96:log_dist] [Rank 0] step=4130, skipped=75, lr=[7.953102046296745e-06, 7.953102046296745e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:40:59,385] [INFO] [timer.py:199:stop] epoch=4/micro_step=1800/global_step=4130, RunningAvgSamplesPerSec=23.78054663151962, CurrSamplesPerSec=23.798880979731866, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:41:26,111] [INFO] [logging.py:96:log_dist] [Rank 0] step=4140, skipped=75, lr=[7.94525452837355e-06, 7.94525452837355e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:41:26,353] [INFO] [timer.py:199:stop] epoch=4/micro_step=1840/global_step=4140, RunningAvgSamplesPerSec=23.780529479800585, CurrSamplesPerSec=23.74883282030303, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:41:53,052] [INFO] [logging.py:96:log_dist] [Rank 0] step=4150, skipped=75, lr=[7.937392797835654e-06, 7.937392797835654e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:41:53,295] [INFO] [timer.py:199:stop] epoch=4/micro_step=1880/global_step=4150, RunningAvgSamplesPerSec=23.780557710511466, CurrSamplesPerSec=23.763762039660058, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:42:20,012] [INFO] [logging.py:96:log_dist] [Rank 0] step=4160, skipped=75, lr=[7.929516890492869e-06, 7.929516890492869e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:42:20,253] [INFO] [timer.py:199:stop] epoch=4/micro_step=1920/global_step=4160, RunningAvgSamplesPerSec=23.780560567538767, CurrSamplesPerSec=23.67352349727817, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:42:47,026] [INFO] [logging.py:96:log_dist] [Rank 0] step=4170, skipped=75, lr=[7.921626842219596e-06, 7.921626842219596e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:42:47,268] [INFO] [timer.py:199:stop] epoch=4/micro_step=1960/global_step=4170, RunningAvgSamplesPerSec=23.78044757329895, CurrSamplesPerSec=23.706728218302004, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:43:03,203] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 18:43:05,612] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 18:43:13,461] [INFO] [logging.py:96:log_dist] [Rank 0] step=4180, skipped=77, lr=[7.915304646279738e-06, 7.915304646279738e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:43:13,702] [INFO] [timer.py:199:stop] epoch=4/micro_step=2000/global_step=4180, RunningAvgSamplesPerSec=23.781556851454695, CurrSamplesPerSec=23.822115255765496, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:43:40,403] [INFO] [logging.py:96:log_dist] [Rank 0] step=4190, skipped=77, lr=[7.907389234940544e-06, 7.907389234940544e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:43:40,642] [INFO] [timer.py:199:stop] epoch=4/micro_step=2040/global_step=4190, RunningAvgSamplesPerSec=23.781598879260496, CurrSamplesPerSec=23.782527413849117, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:44:07,315] [INFO] [logging.py:96:log_dist] [Rank 0] step=4200, skipped=77, lr=[7.899459783461304e-06, 7.899459783461304e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:44:07,558] [INFO] [timer.py:199:stop] epoch=4/micro_step=2080/global_step=4200, RunningAvgSamplesPerSec=23.78168212484104, CurrSamplesPerSec=23.797125626642096, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:44:34,296] [INFO] [logging.py:96:log_dist] [Rank 0] step=4210, skipped=77, lr=[7.891516327960301e-06, 7.891516327960301e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:44:34,535] [INFO] [timer.py:199:stop] epoch=4/micro_step=2120/global_step=4210, RunningAvgSamplesPerSec=23.781640083342992, CurrSamplesPerSec=23.753472933510654, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:45:01,277] [INFO] [logging.py:96:log_dist] [Rank 0] step=4220, skipped=77, lr=[7.883558904619604e-06, 7.883558904619604e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:45:01,518] [INFO] [timer.py:199:stop] epoch=4/micro_step=2160/global_step=4220, RunningAvgSamplesPerSec=23.78159165892467, CurrSamplesPerSec=23.782590625705176, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:45:28,265] [INFO] [logging.py:96:log_dist] [Rank 0] step=4230, skipped=77, lr=[7.875587549684912e-06, 7.875587549684912e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:45:28,505] [INFO] [timer.py:199:stop] epoch=4/micro_step=2200/global_step=4230, RunningAvgSamplesPerSec=23.78153524653221, CurrSamplesPerSec=23.801647452301026, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:45:55,238] [INFO] [logging.py:96:log_dist] [Rank 0] step=4240, skipped=77, lr=[7.867602299465374e-06, 7.867602299465374e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:45:55,479] [INFO] [timer.py:199:stop] epoch=4/micro_step=2240/global_step=4240, RunningAvgSamplesPerSec=23.781502018707602, CurrSamplesPerSec=23.78313636870262, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:46:22,186] [INFO] [logging.py:96:log_dist] [Rank 0] step=4250, skipped=77, lr=[7.859603190333436e-06, 7.859603190333436e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:46:22,428] [INFO] [timer.py:199:stop] epoch=4/micro_step=2280/global_step=4250, RunningAvgSamplesPerSec=23.78151450258674, CurrSamplesPerSec=23.808763891452408, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:46:49,136] [INFO] [logging.py:96:log_dist] [Rank 0] step=4260, skipped=77, lr=[7.851590258724668e-06, 7.851590258724668e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:46:49,381] [INFO] [timer.py:199:stop] epoch=4/micro_step=2320/global_step=4260, RunningAvgSamplesPerSec=23.781523427592713, CurrSamplesPerSec=23.658004726745595, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:47:16,052] [INFO] [logging.py:96:log_dist] [Rank 0] step=4270, skipped=77, lr=[7.843563541137601e-06, 7.843563541137601e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:47:16,294] [INFO] [timer.py:199:stop] epoch=4/micro_step=2360/global_step=4270, RunningAvgSamplesPerSec=23.781607988393542, CurrSamplesPerSec=23.797539124006303, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:47:37,534] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 18:47:39,935] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 18:47:42,386] [INFO] [logging.py:96:log_dist] [Rank 0] step=4280, skipped=79, lr=[7.83713226573068e-06, 7.83713226573068e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:47:42,629] [INFO] [timer.py:199:stop] epoch=4/micro_step=2400/global_step=4280, RunningAvgSamplesPerSec=23.78288590596522, CurrSamplesPerSec=23.80358288826483, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:48:09,336] [INFO] [logging.py:96:log_dist] [Rank 0] step=4290, skipped=79, lr=[7.829080825559315e-06, 7.829080825559315e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:48:09,578] [INFO] [timer.py:199:stop] epoch=4/micro_step=2440/global_step=4290, RunningAvgSamplesPerSec=23.782894857118233, CurrSamplesPerSec=23.817110173655614, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:48:36,257] [INFO] [logging.py:96:log_dist] [Rank 0] step=4300, skipped=79, lr=[7.821015701939074e-06, 7.821015701939074e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:48:36,499] [INFO] [timer.py:199:stop] epoch=4/micro_step=2480/global_step=4300, RunningAvgSamplesPerSec=23.782962258306064, CurrSamplesPerSec=23.79390252755549, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:49:03,167] [INFO] [logging.py:96:log_dist] [Rank 0] step=4310, skipped=79, lr=[7.812936931606227e-06, 7.812936931606227e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:49:03,408] [INFO] [timer.py:199:stop] epoch=4/micro_step=2520/global_step=4310, RunningAvgSamplesPerSec=23.783053494537537, CurrSamplesPerSec=23.880526312792863, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:49:30,071] [INFO] [logging.py:96:log_dist] [Rank 0] step=4320, skipped=79, lr=[7.804844551359194e-06, 7.804844551359194e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:49:30,313] [INFO] [timer.py:199:stop] epoch=4/micro_step=2560/global_step=4320, RunningAvgSamplesPerSec=23.783157117797483, CurrSamplesPerSec=23.77442854141713, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:49:57,056] [INFO] [logging.py:96:log_dist] [Rank 0] step=4330, skipped=79, lr=[7.796738598058392e-06, 7.796738598058392e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:49:57,297] [INFO] [timer.py:199:stop] epoch=4/micro_step=2600/global_step=4330, RunningAvgSamplesPerSec=23.78310112000742, CurrSamplesPerSec=23.727160961424147, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:50:24,056] [INFO] [logging.py:96:log_dist] [Rank 0] step=4340, skipped=79, lr=[7.788619108626064e-06, 7.788619108626064e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:50:24,299] [INFO] [timer.py:199:stop] epoch=4/micro_step=2640/global_step=4340, RunningAvgSamplesPerSec=23.783010368918863, CurrSamplesPerSec=23.78563151282686, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:50:50,978] [INFO] [logging.py:96:log_dist] [Rank 0] step=4350, skipped=79, lr=[7.78048612004611e-06, 7.78048612004611e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:50:51,220] [INFO] [timer.py:199:stop] epoch=4/micro_step=2680/global_step=4350, RunningAvgSamplesPerSec=23.78307748068538, CurrSamplesPerSec=23.836372661409268, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:51:17,891] [INFO] [logging.py:96:log_dist] [Rank 0] step=4360, skipped=79, lr=[7.77233966936391e-06, 7.77233966936391e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:51:18,133] [INFO] [timer.py:199:stop] epoch=4/micro_step=2720/global_step=4360, RunningAvgSamplesPerSec=23.783158284584058, CurrSamplesPerSec=23.72559231682245, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:51:44,810] [INFO] [logging.py:96:log_dist] [Rank 0] step=4370, skipped=79, lr=[7.764179793686174e-06, 7.764179793686174e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:51:45,052] [INFO] [timer.py:199:stop] epoch=4/micro_step=2760/global_step=4370, RunningAvgSamplesPerSec=23.783226164241267, CurrSamplesPerSec=23.806665038969395, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:52:11,707] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 18:52:11,708] [INFO] [logging.py:96:log_dist] [Rank 0] step=4380, skipped=80, lr=[7.756824457923e-06, 7.756824457923e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:52:11,709] [INFO] [timer.py:199:stop] epoch=4/micro_step=2800/global_step=4380, RunningAvgSamplesPerSec=23.783830713526097, CurrSamplesPerSec=26.626275194413584, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:52:14,116] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 18:52:38,185] [INFO] [logging.py:96:log_dist] [Rank 0] step=4390, skipped=81, lr=[7.749458305156224e-06, 7.749458305156224e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:52:38,425] [INFO] [timer.py:199:stop] epoch=4/micro_step=2840/global_step=4390, RunningAvgSamplesPerSec=23.784327293030866, CurrSamplesPerSec=23.825413672241467, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:53:05,091] [INFO] [logging.py:96:log_dist] [Rank 0] step=4400, skipped=81, lr=[7.741261037418517e-06, 7.741261037418517e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:53:05,333] [INFO] [timer.py:199:stop] epoch=4/micro_step=2880/global_step=4400, RunningAvgSamplesPerSec=23.784418450772673, CurrSamplesPerSec=23.78456089700253, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:53:32,060] [INFO] [logging.py:96:log_dist] [Rank 0] step=4410, skipped=81, lr=[7.733050486246999e-06, 7.733050486246999e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:53:32,303] [INFO] [timer.py:199:stop] epoch=4/micro_step=2920/global_step=4410, RunningAvgSamplesPerSec=23.78438944737176, CurrSamplesPerSec=23.7949845308708, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:53:58,981] [INFO] [logging.py:96:log_dist] [Rank 0] step=4420, skipped=81, lr=[7.72482668904035e-06, 7.72482668904035e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:53:59,223] [INFO] [timer.py:199:stop] epoch=4/micro_step=2960/global_step=4420, RunningAvgSamplesPerSec=23.784462521408805, CurrSamplesPerSec=23.87861233060678, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:54:25,934] [INFO] [logging.py:96:log_dist] [Rank 0] step=4430, skipped=81, lr=[7.716589683257589e-06, 7.716589683257589e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:54:26,176] [INFO] [timer.py:199:stop] epoch=4/micro_step=3000/global_step=4430, RunningAvgSamplesPerSec=23.784465033729912, CurrSamplesPerSec=23.79411765488, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:54:52,870] [INFO] [logging.py:96:log_dist] [Rank 0] step=4440, skipped=81, lr=[7.708339506417893e-06, 7.708339506417893e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:54:53,112] [INFO] [timer.py:199:stop] epoch=4/micro_step=3040/global_step=4440, RunningAvgSamplesPerSec=23.784498307179014, CurrSamplesPerSec=23.800258860125272, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:55:19,815] [INFO] [logging.py:96:log_dist] [Rank 0] step=4450, skipped=81, lr=[7.700076196100432e-06, 7.700076196100432e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:55:20,055] [INFO] [timer.py:199:stop] epoch=4/micro_step=3080/global_step=4450, RunningAvgSamplesPerSec=23.78452359180732, CurrSamplesPerSec=23.8052230841854, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:55:48,415] [INFO] [logging.py:96:log_dist] [Rank 0] step=4460, skipped=81, lr=[7.691799789944201e-06, 7.691799789944201e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:55:48,657] [INFO] [timer.py:199:stop] epoch=4/micro_step=3120/global_step=4460, RunningAvgSamplesPerSec=23.78143594026278, CurrSamplesPerSec=23.83857837114144, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:56:15,348] [INFO] [logging.py:96:log_dist] [Rank 0] step=4470, skipped=81, lr=[7.683510325647853e-06, 7.683510325647853e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:56:15,590] [INFO] [timer.py:199:stop] epoch=4/micro_step=3160/global_step=4470, RunningAvgSamplesPerSec=23.781487306390876, CurrSamplesPerSec=23.797138284511465, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:56:42,303] [INFO] [logging.py:96:log_dist] [Rank 0] step=4480, skipped=81, lr=[7.675207840969509e-06, 7.675207840969509e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:56:42,544] [INFO] [timer.py:199:stop] epoch=4/micro_step=3200/global_step=4480, RunningAvgSamplesPerSec=23.781500611272204, CurrSamplesPerSec=23.712455725002233, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:56:47,652] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 18:56:50,063] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 18:57:08,762] [INFO] [logging.py:96:log_dist] [Rank 0] step=4490, skipped=83, lr=[7.668556503963122e-06, 7.668556503963122e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:57:09,004] [INFO] [timer.py:199:stop] epoch=4/micro_step=3240/global_step=4490, RunningAvgSamplesPerSec=23.782491761083268, CurrSamplesPerSec=23.746074409518265, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:57:35,913] [INFO] [logging.py:96:log_dist] [Rank 0] step=4500, skipped=83, lr=[7.660230677936862e-06, 7.660230677936862e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:57:36,156] [INFO] [timer.py:199:stop] epoch=4/micro_step=3280/global_step=4500, RunningAvgSamplesPerSec=23.78214913112578, CurrSamplesPerSec=23.61517625380111, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:58:02,864] [INFO] [logging.py:96:log_dist] [Rank 0] step=4510, skipped=83, lr=[7.651891937566321e-06, 7.651891937566321e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:58:03,106] [INFO] [timer.py:199:stop] epoch=4/micro_step=3320/global_step=4510, RunningAvgSamplesPerSec=23.78217107802006, CurrSamplesPerSec=23.80385940481186, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:58:29,799] [INFO] [logging.py:96:log_dist] [Rank 0] step=4520, skipped=83, lr=[7.643540320834075e-06, 7.643540320834075e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:58:30,040] [INFO] [timer.py:199:stop] epoch=4/micro_step=3360/global_step=4520, RunningAvgSamplesPerSec=23.782227629878136, CurrSamplesPerSec=23.798885199642637, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:58:56,764] [INFO] [logging.py:96:log_dist] [Rank 0] step=4530, skipped=83, lr=[7.635175865781353e-06, 7.635175865781353e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:58:57,007] [INFO] [timer.py:199:stop] epoch=4/micro_step=3400/global_step=4530, RunningAvgSamplesPerSec=23.782213363772275, CurrSamplesPerSec=23.748076451982275, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:59:23,827] [INFO] [logging.py:96:log_dist] [Rank 0] step=4540, skipped=83, lr=[7.62679861050786e-06, 7.62679861050786e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:59:24,039] [INFO] [timer.py:199:stop] epoch=4/micro_step=3440/global_step=4540, RunningAvgSamplesPerSec=23.78207482046127, CurrSamplesPerSec=23.53256748041133, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 18:59:50,780] [INFO] [logging.py:96:log_dist] [Rank 0] step=4550, skipped=83, lr=[7.6184085931716046e-06, 7.6184085931716046e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 18:59:51,023] [INFO] [timer.py:199:stop] epoch=4/micro_step=3480/global_step=4550, RunningAvgSamplesPerSec=23.782031180948355, CurrSamplesPerSec=23.748099562522945, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:00:17,792] [INFO] [logging.py:96:log_dist] [Rank 0] step=4560, skipped=83, lr=[7.610005851988726e-06, 7.610005851988726e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:00:18,035] [INFO] [timer.py:199:stop] epoch=4/micro_step=3520/global_step=4560, RunningAvgSamplesPerSec=23.781927751417697, CurrSamplesPerSec=23.7731294439762, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:00:44,741] [INFO] [logging.py:96:log_dist] [Rank 0] step=4570, skipped=83, lr=[7.601590425233322e-06, 7.601590425233322e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:00:44,982] [INFO] [timer.py:199:stop] epoch=4/micro_step=3560/global_step=4570, RunningAvgSamplesPerSec=23.78194896163577, CurrSamplesPerSec=23.771430514588808, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:01:11,664] [INFO] [logging.py:96:log_dist] [Rank 0] step=4580, skipped=83, lr=[7.593162351237268e-06, 7.593162351237268e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:01:11,907] [INFO] [timer.py:199:stop] epoch=4/micro_step=3600/global_step=4580, RunningAvgSamplesPerSec=23.78201659346857, CurrSamplesPerSec=23.727039321135674, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:01:22,382] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 19:01:24,787] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 19:01:38,002] [INFO] [logging.py:96:log_dist] [Rank 0] step=4590, skipped=85, lr=[7.586410811822965e-06, 7.586410811822965e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:01:38,244] [INFO] [timer.py:199:stop] epoch=4/micro_step=3640/global_step=4590, RunningAvgSamplesPerSec=23.78321767989, CurrSamplesPerSec=23.793181246677793, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:02:04,950] [INFO] [logging.py:96:log_dist] [Rank 0] step=4600, skipped=85, lr=[7.577960069573853e-06, 7.577960069573853e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:02:05,193] [INFO] [timer.py:199:stop] epoch=4/micro_step=3680/global_step=4600, RunningAvgSamplesPerSec=23.783238529483267, CurrSamplesPerSec=23.7321702687384, MemAllocated=11.44GB, MaxMemAllocated=31.66GB ***** Evaluating perplexity, Epoch 5/16 ***** ppl: 1.847437858581543 saving the final model ... Beginning of Epoch 6/16, Total Micro Batches 3680 [2023-04-23 19:03:23,463] [INFO] [logging.py:96:log_dist] [Rank 0] step=4610, skipped=85, lr=[7.569496787719269e-06, 7.569496787719269e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:03:23,706] [INFO] [timer.py:199:stop] epoch=5/micro_step=40/global_step=4610, RunningAvgSamplesPerSec=23.78319092142417, CurrSamplesPerSec=23.750282660929603, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:03:50,366] [INFO] [logging.py:96:log_dist] [Rank 0] step=4620, skipped=85, lr=[7.561021004809068e-06, 7.561021004809068e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:03:50,609] [INFO] [timer.py:199:stop] epoch=5/micro_step=80/global_step=4620, RunningAvgSamplesPerSec=23.783292031569534, CurrSamplesPerSec=23.77558668962662, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:04:17,406] [INFO] [logging.py:96:log_dist] [Rank 0] step=4630, skipped=85, lr=[7.552532759450048e-06, 7.552532759450048e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:04:17,651] [INFO] [timer.py:199:stop] epoch=5/micro_step=120/global_step=4630, RunningAvgSamplesPerSec=23.783139697913207, CurrSamplesPerSec=23.638612994729392, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:04:44,415] [INFO] [logging.py:96:log_dist] [Rank 0] step=4640, skipped=85, lr=[7.544032090305774e-06, 7.544032090305774e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:04:44,660] [INFO] [timer.py:199:stop] epoch=5/micro_step=160/global_step=4640, RunningAvgSamplesPerSec=23.783046629653192, CurrSamplesPerSec=23.75399632091764, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:05:11,373] [INFO] [logging.py:96:log_dist] [Rank 0] step=4650, skipped=85, lr=[7.5355190360964e-06, 7.5355190360964e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:05:11,615] [INFO] [timer.py:199:stop] epoch=5/micro_step=200/global_step=4650, RunningAvgSamplesPerSec=23.78305262826225, CurrSamplesPerSec=23.74896098748655, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:05:38,487] [INFO] [logging.py:96:log_dist] [Rank 0] step=4660, skipped=85, lr=[7.5269936355984914e-06, 7.5269936355984914e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:05:38,729] [INFO] [timer.py:199:stop] epoch=5/micro_step=240/global_step=4660, RunningAvgSamplesPerSec=23.78281013503576, CurrSamplesPerSec=23.806120326881196, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:06:05,443] [INFO] [logging.py:96:log_dist] [Rank 0] step=4670, skipped=85, lr=[7.518455927644855e-06, 7.518455927644855e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:06:05,683] [INFO] [timer.py:199:stop] epoch=5/micro_step=280/global_step=4670, RunningAvgSamplesPerSec=23.78281778942852, CurrSamplesPerSec=23.60911356761609, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:06:32,445] [INFO] [logging.py:96:log_dist] [Rank 0] step=4680, skipped=85, lr=[7.509905951124352e-06, 7.509905951124352e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:06:32,687] [INFO] [timer.py:199:stop] epoch=5/micro_step=320/global_step=4680, RunningAvgSamplesPerSec=23.782736157935254, CurrSamplesPerSec=23.82812497780833, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:06:48,545] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 19:06:50,951] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 19:06:58,805] [INFO] [logging.py:96:log_dist] [Rank 0] step=4690, skipped=87, lr=[7.5030571627088036e-06, 7.5030571627088036e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:06:59,052] [INFO] [timer.py:199:stop] epoch=5/micro_step=360/global_step=4690, RunningAvgSamplesPerSec=23.783857713150113, CurrSamplesPerSec=23.707632708113863, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:07:25,908] [INFO] [logging.py:96:log_dist] [Rank 0] step=4700, skipped=87, lr=[7.494485200946144e-06, 7.494485200946144e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:07:26,151] [INFO] [timer.py:199:stop] epoch=5/micro_step=400/global_step=4700, RunningAvgSamplesPerSec=23.78368295040644, CurrSamplesPerSec=22.971568309310506, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:07:52,860] [INFO] [logging.py:96:log_dist] [Rank 0] step=4710, skipped=87, lr=[7.485901079802167e-06, 7.485901079802167e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:07:53,103] [INFO] [timer.py:199:stop] epoch=5/micro_step=440/global_step=4710, RunningAvgSamplesPerSec=23.783687869948327, CurrSamplesPerSec=23.805501749969892, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:08:19,774] [INFO] [logging.py:96:log_dist] [Rank 0] step=4720, skipped=87, lr=[7.477304838377149e-06, 7.477304838377149e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:08:20,015] [INFO] [timer.py:199:stop] epoch=5/micro_step=480/global_step=4720, RunningAvgSamplesPerSec=23.78377004665426, CurrSamplesPerSec=23.832381406947555, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:08:46,705] [INFO] [logging.py:96:log_dist] [Rank 0] step=4730, skipped=87, lr=[7.468696515826568e-06, 7.468696515826568e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:08:46,946] [INFO] [timer.py:199:stop] epoch=5/micro_step=520/global_step=4730, RunningAvgSamplesPerSec=23.78381807724459, CurrSamplesPerSec=23.88600229342961, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:09:13,614] [INFO] [logging.py:96:log_dist] [Rank 0] step=4740, skipped=87, lr=[7.4600761513609355e-06, 7.4600761513609355e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:09:13,855] [INFO] [timer.py:199:stop] epoch=5/micro_step=560/global_step=4740, RunningAvgSamplesPerSec=23.78390222212436, CurrSamplesPerSec=23.810930700804878, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:09:40,530] [INFO] [logging.py:96:log_dist] [Rank 0] step=4750, skipped=87, lr=[7.451443784245611e-06, 7.451443784245611e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:09:40,772] [INFO] [timer.py:199:stop] epoch=5/micro_step=600/global_step=4750, RunningAvgSamplesPerSec=23.783977415240297, CurrSamplesPerSec=23.814249260028255, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:10:07,454] [INFO] [logging.py:96:log_dist] [Rank 0] step=4760, skipped=87, lr=[7.442799453800628e-06, 7.442799453800628e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:10:07,695] [INFO] [timer.py:199:stop] epoch=5/micro_step=640/global_step=4760, RunningAvgSamplesPerSec=23.784036048905325, CurrSamplesPerSec=23.88229399170956, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:10:34,369] [INFO] [logging.py:96:log_dist] [Rank 0] step=4770, skipped=87, lr=[7.434143199400509e-06, 7.434143199400509e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:10:34,613] [INFO] [timer.py:199:stop] epoch=5/micro_step=680/global_step=4770, RunningAvgSamplesPerSec=23.78410791713792, CurrSamplesPerSec=23.741353195133986, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:11:01,282] [INFO] [logging.py:96:log_dist] [Rank 0] step=4780, skipped=87, lr=[7.425475060474094e-06, 7.425475060474094e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:11:01,524] [INFO] [timer.py:199:stop] epoch=5/micro_step=720/global_step=4780, RunningAvgSamplesPerSec=23.784185249089333, CurrSamplesPerSec=23.794948673433975, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:11:22,839] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 19:11:25,255] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 19:11:27,717] [INFO] [logging.py:96:log_dist] [Rank 0] step=4790, skipped=89, lr=[7.4185320190047675e-06, 7.4185320190047675e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:11:27,960] [INFO] [timer.py:199:stop] epoch=5/micro_step=760/global_step=4790, RunningAvgSamplesPerSec=23.785142636999534, CurrSamplesPerSec=23.715019864006493, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:11:54,732] [INFO] [logging.py:96:log_dist] [Rank 0] step=4800, skipped=89, lr=[7.409842587464366e-06, 7.409842587464366e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:11:54,975] [INFO] [timer.py:199:stop] epoch=5/micro_step=800/global_step=4800, RunningAvgSamplesPerSec=23.785031747906288, CurrSamplesPerSec=23.68039848387837, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:12:21,666] [INFO] [logging.py:96:log_dist] [Rank 0] step=4810, skipped=89, lr=[7.401141382085827e-06, 7.401141382085827e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:12:21,911] [INFO] [timer.py:199:stop] epoch=5/micro_step=840/global_step=4810, RunningAvgSamplesPerSec=23.785064528513786, CurrSamplesPerSec=23.72888712959229, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:12:48,613] [INFO] [logging.py:96:log_dist] [Rank 0] step=4820, skipped=89, lr=[7.392428442502737e-06, 7.392428442502737e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:12:48,856] [INFO] [timer.py:199:stop] epoch=5/micro_step=880/global_step=4820, RunningAvgSamplesPerSec=23.78507824403918, CurrSamplesPerSec=23.766926474629766, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:13:15,542] [INFO] [logging.py:96:log_dist] [Rank 0] step=4830, skipped=89, lr=[7.383703808402133e-06, 7.383703808402133e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:13:15,785] [INFO] [timer.py:199:stop] epoch=5/micro_step=920/global_step=4830, RunningAvgSamplesPerSec=23.785118919789518, CurrSamplesPerSec=23.775879403127107, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:13:42,492] [INFO] [logging.py:96:log_dist] [Rank 0] step=4840, skipped=89, lr=[7.3749675195243195e-06, 7.3749675195243195e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:13:42,736] [INFO] [timer.py:199:stop] epoch=5/micro_step=960/global_step=4840, RunningAvgSamplesPerSec=23.785127766811073, CurrSamplesPerSec=23.80634834286156, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:14:09,476] [INFO] [logging.py:96:log_dist] [Rank 0] step=4850, skipped=89, lr=[7.366219615662688e-06, 7.366219615662688e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:14:09,719] [INFO] [timer.py:199:stop] epoch=5/micro_step=1000/global_step=4850, RunningAvgSamplesPerSec=23.78507707269332, CurrSamplesPerSec=23.816516382472187, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:14:36,462] [INFO] [logging.py:96:log_dist] [Rank 0] step=4860, skipped=89, lr=[7.3574601366635306e-06, 7.3574601366635306e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:14:36,705] [INFO] [timer.py:199:stop] epoch=5/micro_step=1040/global_step=4860, RunningAvgSamplesPerSec=23.785022938631805, CurrSamplesPerSec=23.83076073827249, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:15:03,466] [INFO] [logging.py:96:log_dist] [Rank 0] step=4870, skipped=89, lr=[7.348689122425873e-06, 7.348689122425873e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:15:03,707] [INFO] [timer.py:199:stop] epoch=5/micro_step=1080/global_step=4870, RunningAvgSamplesPerSec=23.78493568322757, CurrSamplesPerSec=23.744097907584504, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:15:30,399] [INFO] [logging.py:96:log_dist] [Rank 0] step=4880, skipped=89, lr=[7.339906612901275e-06, 7.339906612901275e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:15:30,642] [INFO] [timer.py:199:stop] epoch=5/micro_step=1120/global_step=4880, RunningAvgSamplesPerSec=23.784969278840954, CurrSamplesPerSec=23.801130404366276, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:15:57,280] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 19:15:57,281] [INFO] [logging.py:96:log_dist] [Rank 0] step=4890, skipped=90, lr=[7.331992558920968e-06, 7.331992558920968e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:15:57,282] [INFO] [timer.py:199:stop] epoch=5/micro_step=1160/global_step=4890, RunningAvgSamplesPerSec=23.785535605052498, CurrSamplesPerSec=26.72953610963159, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:15:59,684] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 19:16:23,673] [INFO] [logging.py:96:log_dist] [Rank 0] step=4900, skipped=91, lr=[7.324069255359752e-06, 7.324069255359752e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:16:23,914] [INFO] [timer.py:199:stop] epoch=5/micro_step=1200/global_step=4900, RunningAvgSamplesPerSec=23.78610706811741, CurrSamplesPerSec=23.892928956496668, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:16:50,575] [INFO] [logging.py:96:log_dist] [Rank 0] step=4910, skipped=91, lr=[7.315254772019035e-06, 7.315254772019035e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:16:50,816] [INFO] [timer.py:199:stop] epoch=5/micro_step=1240/global_step=4910, RunningAvgSamplesPerSec=23.786194250589173, CurrSamplesPerSec=23.767837668339553, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:17:17,505] [INFO] [logging.py:96:log_dist] [Rank 0] step=4920, skipped=91, lr=[7.306428945683295e-06, 7.306428945683295e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:17:17,748] [INFO] [timer.py:199:stop] epoch=5/micro_step=1280/global_step=4920, RunningAvgSamplesPerSec=23.786231228086848, CurrSamplesPerSec=23.79154692697066, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:17:44,440] [INFO] [logging.py:96:log_dist] [Rank 0] step=4930, skipped=91, lr=[7.297591816553761e-06, 7.297591816553761e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:17:44,681] [INFO] [timer.py:199:stop] epoch=5/micro_step=1320/global_step=4930, RunningAvgSamplesPerSec=23.786263146028958, CurrSamplesPerSec=23.844878110050896, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:18:11,354] [INFO] [logging.py:96:log_dist] [Rank 0] step=4940, skipped=91, lr=[7.288743424883148e-06, 7.288743424883148e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:18:11,596] [INFO] [timer.py:199:stop] epoch=5/micro_step=1360/global_step=4940, RunningAvgSamplesPerSec=23.786327293381344, CurrSamplesPerSec=23.839357452545162, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:18:38,318] [INFO] [logging.py:96:log_dist] [Rank 0] step=4950, skipped=91, lr=[7.2798838109754684e-06, 7.2798838109754684e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:18:38,561] [INFO] [timer.py:199:stop] epoch=5/micro_step=1400/global_step=4950, RunningAvgSamplesPerSec=23.786310108587262, CurrSamplesPerSec=23.722912679309026, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:19:05,318] [INFO] [logging.py:96:log_dist] [Rank 0] step=4960, skipped=91, lr=[7.271013015185852e-06, 7.271013015185852e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:19:05,560] [INFO] [timer.py:199:stop] epoch=5/micro_step=1440/global_step=4960, RunningAvgSamplesPerSec=23.786227502973293, CurrSamplesPerSec=23.684211520070555, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:19:32,278] [INFO] [logging.py:96:log_dist] [Rank 0] step=4970, skipped=91, lr=[7.262131077920364e-06, 7.262131077920364e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:19:32,520] [INFO] [timer.py:199:stop] epoch=5/micro_step=1480/global_step=4970, RunningAvgSamplesPerSec=23.786216168624048, CurrSamplesPerSec=23.82321251217295, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:19:59,209] [INFO] [logging.py:96:log_dist] [Rank 0] step=4980, skipped=91, lr=[7.253238039635817e-06, 7.253238039635817e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:19:59,451] [INFO] [timer.py:199:stop] epoch=5/micro_step=1520/global_step=4980, RunningAvgSamplesPerSec=23.786253147195247, CurrSamplesPerSec=23.815278180093003, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:20:26,134] [INFO] [logging.py:96:log_dist] [Rank 0] step=4990, skipped=91, lr=[7.244333940839585e-06, 7.244333940839585e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:20:26,376] [INFO] [timer.py:199:stop] epoch=5/micro_step=1560/global_step=4990, RunningAvgSamplesPerSec=23.78629784776984, CurrSamplesPerSec=23.717724961269404, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:20:31,473] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 19:20:33,877] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 19:20:52,491] [INFO] [logging.py:96:log_dist] [Rank 0] step=5000, skipped=93, lr=[7.237202725487315e-06, 7.237202725487315e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:20:52,733] [INFO] [timer.py:199:stop] epoch=5/micro_step=1600/global_step=5000, RunningAvgSamplesPerSec=23.78734546183774, CurrSamplesPerSec=23.77129157943265, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:21:19,407] [INFO] [logging.py:96:log_dist] [Rank 0] step=5010, skipped=93, lr=[7.228278820009342e-06, 7.228278820009342e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:21:19,651] [INFO] [timer.py:199:stop] epoch=5/micro_step=1640/global_step=5010, RunningAvgSamplesPerSec=23.78740554011194, CurrSamplesPerSec=23.844865401344286, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:21:46,354] [INFO] [logging.py:96:log_dist] [Rank 0] step=5020, skipped=93, lr=[7.219343967707773e-06, 7.219343967707773e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:21:46,597] [INFO] [timer.py:199:stop] epoch=5/micro_step=1680/global_step=5020, RunningAvgSamplesPerSec=23.787420806948344, CurrSamplesPerSec=23.776709148191443, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:22:13,258] [INFO] [logging.py:96:log_dist] [Rank 0] step=5030, skipped=93, lr=[7.2103982092804494e-06, 7.2103982092804494e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:22:13,502] [INFO] [timer.py:199:stop] epoch=5/micro_step=1720/global_step=5030, RunningAvgSamplesPerSec=23.787501367351297, CurrSamplesPerSec=23.81033299372906, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:22:40,172] [INFO] [logging.py:96:log_dist] [Rank 0] step=5040, skipped=93, lr=[7.201441585474881e-06, 7.201441585474881e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:22:40,414] [INFO] [timer.py:199:stop] epoch=5/micro_step=1760/global_step=5040, RunningAvgSamplesPerSec=23.787572605969423, CurrSamplesPerSec=23.860269818318802, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:23:07,107] [INFO] [logging.py:96:log_dist] [Rank 0] step=5050, skipped=93, lr=[7.1924741370880786e-06, 7.1924741370880786e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:23:07,347] [INFO] [timer.py:199:stop] epoch=5/micro_step=1800/global_step=5050, RunningAvgSamplesPerSec=23.78760651520922, CurrSamplesPerSec=23.826712145651097, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:23:34,056] [INFO] [logging.py:96:log_dist] [Rank 0] step=5060, skipped=93, lr=[7.183495904966351e-06, 7.183495904966351e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:23:34,297] [INFO] [timer.py:199:stop] epoch=5/micro_step=1840/global_step=5060, RunningAvgSamplesPerSec=23.78760600632678, CurrSamplesPerSec=23.78274654972015, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:24:00,992] [INFO] [logging.py:96:log_dist] [Rank 0] step=5070, skipped=93, lr=[7.174506930005133e-06, 7.174506930005133e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:24:01,233] [INFO] [timer.py:199:stop] epoch=5/micro_step=1880/global_step=5070, RunningAvgSamplesPerSec=23.787631904669222, CurrSamplesPerSec=23.862412073425716, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:24:27,902] [INFO] [logging.py:96:log_dist] [Rank 0] step=5080, skipped=93, lr=[7.165507253148784e-06, 7.165507253148784e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:24:28,145] [INFO] [timer.py:199:stop] epoch=5/micro_step=1920/global_step=5080, RunningAvgSamplesPerSec=23.787698533729348, CurrSamplesPerSec=23.85742185104519, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:24:54,837] [INFO] [logging.py:96:log_dist] [Rank 0] step=5090, skipped=93, lr=[7.156496915390418e-06, 7.156496915390418e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:24:55,078] [INFO] [timer.py:199:stop] epoch=5/micro_step=1960/global_step=5090, RunningAvgSamplesPerSec=23.787728646297186, CurrSamplesPerSec=23.759591070370703, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:25:05,573] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 19:25:07,983] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 19:25:21,231] [INFO] [logging.py:96:log_dist] [Rank 0] step=5100, skipped=95, lr=[7.149280996912648e-06, 7.149280996912648e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:25:21,473] [INFO] [timer.py:199:stop] epoch=5/micro_step=2000/global_step=5100, RunningAvgSamplesPerSec=23.788692566591195, CurrSamplesPerSec=23.778297197428138, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:25:48,179] [INFO] [logging.py:96:log_dist] [Rank 0] step=5110, skipped=95, lr=[7.1402515729881725e-06, 7.1402515729881725e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:25:48,422] [INFO] [timer.py:199:stop] epoch=5/micro_step=2040/global_step=5110, RunningAvgSamplesPerSec=23.788690559569556, CurrSamplesPerSec=23.766600313723725, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:26:15,174] [INFO] [logging.py:96:log_dist] [Rank 0] step=5120, skipped=95, lr=[7.1312116032001304e-06, 7.1312116032001304e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:26:15,418] [INFO] [timer.py:199:stop] epoch=5/micro_step=2080/global_step=5120, RunningAvgSamplesPerSec=23.788610846264753, CurrSamplesPerSec=23.73601259009695, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:26:42,099] [INFO] [logging.py:96:log_dist] [Rank 0] step=5130, skipped=95, lr=[7.1221611287251676e-06, 7.1221611287251676e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:26:42,341] [INFO] [timer.py:199:stop] epoch=5/micro_step=2120/global_step=5130, RunningAvgSamplesPerSec=23.788652834993893, CurrSamplesPerSec=23.81669388278393, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:27:09,012] [INFO] [logging.py:96:log_dist] [Rank 0] step=5140, skipped=95, lr=[7.113100190787772e-06, 7.113100190787772e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:27:09,254] [INFO] [timer.py:199:stop] epoch=5/micro_step=2160/global_step=5140, RunningAvgSamplesPerSec=23.788712659717962, CurrSamplesPerSec=23.865091494178806, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:27:35,955] [INFO] [logging.py:96:log_dist] [Rank 0] step=5150, skipped=95, lr=[7.104028830660098e-06, 7.104028830660098e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:27:36,197] [INFO] [timer.py:199:stop] epoch=5/micro_step=2200/global_step=5150, RunningAvgSamplesPerSec=23.78872094274376, CurrSamplesPerSec=23.829434323648147, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:28:02,881] [INFO] [logging.py:96:log_dist] [Rank 0] step=5160, skipped=95, lr=[7.0949470896617695e-06, 7.0949470896617695e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:28:03,123] [INFO] [timer.py:199:stop] epoch=5/micro_step=2240/global_step=5160, RunningAvgSamplesPerSec=23.788762183187604, CurrSamplesPerSec=23.74411891012614, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:28:29,858] [INFO] [logging.py:96:log_dist] [Rank 0] step=5170, skipped=95, lr=[7.085855009159696e-06, 7.085855009159696e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:28:30,100] [INFO] [timer.py:199:stop] epoch=5/micro_step=2280/global_step=5170, RunningAvgSamplesPerSec=23.788716729640267, CurrSamplesPerSec=23.821692447905804, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:28:56,797] [INFO] [logging.py:96:log_dist] [Rank 0] step=5180, skipped=95, lr=[7.076752630567883e-06, 7.076752630567883e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:28:57,039] [INFO] [timer.py:199:stop] epoch=5/micro_step=2320/global_step=5180, RunningAvgSamplesPerSec=23.788735846106114, CurrSamplesPerSec=23.805372971788948, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:29:23,791] [INFO] [logging.py:96:log_dist] [Rank 0] step=5190, skipped=95, lr=[7.067639995347243e-06, 7.067639995347243e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:29:24,033] [INFO] [timer.py:199:stop] epoch=5/micro_step=2360/global_step=5190, RunningAvgSamplesPerSec=23.788660293532924, CurrSamplesPerSec=23.740761075671728, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:29:39,923] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 19:29:42,322] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 19:29:50,150] [INFO] [logging.py:96:log_dist] [Rank 0] step=5200, skipped=97, lr=[7.060342530289537e-06, 7.060342530289537e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:29:50,392] [INFO] [timer.py:199:stop] epoch=5/micro_step=2400/global_step=5200, RunningAvgSamplesPerSec=23.789663596253973, CurrSamplesPerSec=23.869155263551715, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:30:17,069] [INFO] [logging.py:96:log_dist] [Rank 0] step=5210, skipped=97, lr=[7.051211537767513e-06, 7.051211537767513e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:30:17,310] [INFO] [timer.py:199:stop] epoch=5/micro_step=2440/global_step=5210, RunningAvgSamplesPerSec=23.789714135396853, CurrSamplesPerSec=23.811583355169564, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:30:43,990] [INFO] [logging.py:96:log_dist] [Rank 0] step=5220, skipped=97, lr=[7.042070404955154e-06, 7.042070404955154e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:30:44,233] [INFO] [timer.py:199:stop] epoch=5/micro_step=2480/global_step=5220, RunningAvgSamplesPerSec=23.78975685457971, CurrSamplesPerSec=23.856317200896715, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:31:10,925] [INFO] [logging.py:96:log_dist] [Rank 0] step=5230, skipped=97, lr=[7.032919173489897e-06, 7.032919173489897e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:31:11,169] [INFO] [timer.py:199:stop] epoch=5/micro_step=2520/global_step=5230, RunningAvgSamplesPerSec=23.78978439277621, CurrSamplesPerSec=23.722608689811832, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:31:37,874] [INFO] [logging.py:96:log_dist] [Rank 0] step=5240, skipped=97, lr=[7.0237578850551766e-06, 7.0237578850551766e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:31:38,119] [INFO] [timer.py:199:stop] epoch=5/micro_step=2560/global_step=5240, RunningAvgSamplesPerSec=23.789783009027044, CurrSamplesPerSec=23.798235351013513, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:32:04,828] [INFO] [logging.py:96:log_dist] [Rank 0] step=5250, skipped=97, lr=[7.014586581380237e-06, 7.014586581380237e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:32:05,070] [INFO] [timer.py:199:stop] epoch=5/micro_step=2600/global_step=5250, RunningAvgSamplesPerSec=23.789781669520448, CurrSamplesPerSec=23.779181879184225, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:32:31,775] [INFO] [logging.py:96:log_dist] [Rank 0] step=5260, skipped=97, lr=[7.005405304239943e-06, 7.005405304239943e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:32:32,016] [INFO] [timer.py:199:stop] epoch=5/micro_step=2640/global_step=5260, RunningAvgSamplesPerSec=23.789796401501228, CurrSamplesPerSec=23.841396431030685, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:32:58,727] [INFO] [logging.py:96:log_dist] [Rank 0] step=5270, skipped=97, lr=[6.996214095454584e-06, 6.996214095454584e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:32:58,972] [INFO] [timer.py:199:stop] epoch=5/micro_step=2680/global_step=5270, RunningAvgSamplesPerSec=23.789784648820376, CurrSamplesPerSec=23.65046550637866, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:33:25,742] [INFO] [logging.py:96:log_dist] [Rank 0] step=5280, skipped=97, lr=[6.9870129968896915e-06, 6.9870129968896915e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:33:25,985] [INFO] [timer.py:199:stop] epoch=5/micro_step=2720/global_step=5280, RunningAvgSamplesPerSec=23.789674675974734, CurrSamplesPerSec=23.685488378084816, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:33:52,688] [INFO] [logging.py:96:log_dist] [Rank 0] step=5290, skipped=97, lr=[6.977802050455843e-06, 6.977802050455843e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:33:52,931] [INFO] [timer.py:199:stop] epoch=5/micro_step=2760/global_step=5290, RunningAvgSamplesPerSec=23.78968684571954, CurrSamplesPerSec=23.796766992605626, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:34:14,185] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 19:34:16,589] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 19:34:19,041] [INFO] [logging.py:96:log_dist] [Rank 0] step=5300, skipped=99, lr=[6.970426231035663e-06, 6.970426231035663e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:34:19,282] [INFO] [timer.py:199:stop] epoch=5/micro_step=2800/global_step=5300, RunningAvgSamplesPerSec=23.790679709631043, CurrSamplesPerSec=23.808552722548413, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:34:45,953] [INFO] [logging.py:96:log_dist] [Rank 0] step=5310, skipped=99, lr=[6.9611976641954075e-06, 6.9611976641954075e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:34:46,195] [INFO] [timer.py:199:stop] epoch=5/micro_step=2840/global_step=5310, RunningAvgSamplesPerSec=23.790734883847723, CurrSamplesPerSec=23.831719148858348, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:35:12,893] [INFO] [logging.py:96:log_dist] [Rank 0] step=5320, skipped=99, lr=[6.951959367073837e-06, 6.951959367073837e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:35:13,134] [INFO] [timer.py:199:stop] epoch=5/micro_step=2880/global_step=5320, RunningAvgSamplesPerSec=23.790747644005545, CurrSamplesPerSec=23.80616255173321, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:35:39,821] [INFO] [logging.py:96:log_dist] [Rank 0] step=5330, skipped=99, lr=[6.942711381750969e-06, 6.942711381750969e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:35:40,063] [INFO] [timer.py:199:stop] epoch=5/micro_step=2920/global_step=5330, RunningAvgSamplesPerSec=23.790780928963372, CurrSamplesPerSec=23.81548735541588, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:36:06,762] [INFO] [logging.py:96:log_dist] [Rank 0] step=5340, skipped=99, lr=[6.933453750350946e-06, 6.933453750350946e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:36:07,004] [INFO] [timer.py:199:stop] epoch=5/micro_step=2960/global_step=5340, RunningAvgSamplesPerSec=23.790793122574954, CurrSamplesPerSec=23.82326748336075, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:36:33,733] [INFO] [logging.py:96:log_dist] [Rank 0] step=5350, skipped=99, lr=[6.924186515041852e-06, 6.924186515041852e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:36:33,974] [INFO] [timer.py:199:stop] epoch=5/micro_step=3000/global_step=5350, RunningAvgSamplesPerSec=23.79075672340701, CurrSamplesPerSec=23.806719933818258, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:37:00,657] [INFO] [logging.py:96:log_dist] [Rank 0] step=5360, skipped=99, lr=[6.914909718035512e-06, 6.914909718035512e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:37:00,898] [INFO] [timer.py:199:stop] epoch=5/micro_step=3040/global_step=5360, RunningAvgSamplesPerSec=23.790793077653824, CurrSamplesPerSec=23.88547094743835, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:37:27,584] [INFO] [logging.py:96:log_dist] [Rank 0] step=5370, skipped=99, lr=[6.905623401587307e-06, 6.905623401587307e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:37:27,825] [INFO] [timer.py:199:stop] epoch=5/micro_step=3080/global_step=5370, RunningAvgSamplesPerSec=23.790823683120415, CurrSamplesPerSec=23.79473142186298, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:37:54,502] [INFO] [logging.py:96:log_dist] [Rank 0] step=5380, skipped=99, lr=[6.89632760799598e-06, 6.89632760799598e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:37:54,743] [INFO] [timer.py:199:stop] epoch=5/micro_step=3120/global_step=5380, RunningAvgSamplesPerSec=23.79087301913237, CurrSamplesPerSec=23.840049776848772, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:38:21,426] [INFO] [logging.py:96:log_dist] [Rank 0] step=5390, skipped=99, lr=[6.887022379603437e-06, 6.887022379603437e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:38:21,667] [INFO] [timer.py:199:stop] epoch=5/micro_step=3160/global_step=5390, RunningAvgSamplesPerSec=23.790910470349147, CurrSamplesPerSec=23.824121683645917, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:38:48,308] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 19:38:48,309] [INFO] [logging.py:96:log_dist] [Rank 0] step=5400, skipped=100, lr=[6.878639642325329e-06, 6.878639642325329e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:38:48,309] [INFO] [timer.py:199:stop] epoch=5/micro_step=3200/global_step=5400, RunningAvgSamplesPerSec=23.791412604701268, CurrSamplesPerSec=26.56719628100462, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:38:50,714] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 19:39:14,744] [INFO] [logging.py:96:log_dist] [Rank 0] step=5410, skipped=101, lr=[6.870249328117662e-06, 6.870249328117662e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:39:14,986] [INFO] [timer.py:199:stop] epoch=5/micro_step=3240/global_step=5410, RunningAvgSamplesPerSec=23.791857318529885, CurrSamplesPerSec=23.772798902132237, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:39:42,011] [INFO] [logging.py:96:log_dist] [Rank 0] step=5420, skipped=101, lr=[6.860917907905735e-06, 6.860917907905735e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:39:42,252] [INFO] [timer.py:199:stop] epoch=5/micro_step=3280/global_step=5420, RunningAvgSamplesPerSec=23.79145698450564, CurrSamplesPerSec=23.81719470149835, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:40:09,008] [INFO] [logging.py:96:log_dist] [Rank 0] step=5430, skipped=101, lr=[6.851577214182135e-06, 6.851577214182135e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:40:09,252] [INFO] [timer.py:199:stop] epoch=5/micro_step=3320/global_step=5430, RunningAvgSamplesPerSec=23.79137455423845, CurrSamplesPerSec=23.77226415963216, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:40:36,026] [INFO] [logging.py:96:log_dist] [Rank 0] step=5440, skipped=101, lr=[6.8422272894932905e-06, 6.8422272894932905e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:40:36,270] [INFO] [timer.py:199:stop] epoch=5/micro_step=3360/global_step=5440, RunningAvgSamplesPerSec=23.791262199941954, CurrSamplesPerSec=23.76996335711875, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:41:02,997] [INFO] [logging.py:96:log_dist] [Rank 0] step=5450, skipped=101, lr=[6.832868176427671e-06, 6.832868176427671e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:41:03,239] [INFO] [timer.py:199:stop] epoch=5/micro_step=3400/global_step=5450, RunningAvgSamplesPerSec=23.791227953476206, CurrSamplesPerSec=23.77261573959069, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:41:29,941] [INFO] [logging.py:96:log_dist] [Rank 0] step=5460, skipped=101, lr=[6.823499917615605e-06, 6.823499917615605e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:41:30,181] [INFO] [timer.py:199:stop] epoch=5/micro_step=3440/global_step=5460, RunningAvgSamplesPerSec=23.791233015081726, CurrSamplesPerSec=23.846623567781425, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:41:56,883] [INFO] [logging.py:96:log_dist] [Rank 0] step=5470, skipped=101, lr=[6.814122555729078e-06, 6.814122555729078e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:41:57,123] [INFO] [timer.py:199:stop] epoch=5/micro_step=3480/global_step=5470, RunningAvgSamplesPerSec=23.79124027976433, CurrSamplesPerSec=23.777576862336005, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:42:23,876] [INFO] [logging.py:96:log_dist] [Rank 0] step=5480, skipped=101, lr=[6.8047361334815365e-06, 6.8047361334815365e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:42:24,143] [INFO] [timer.py:199:stop] epoch=5/micro_step=3520/global_step=5480, RunningAvgSamplesPerSec=23.791156587425736, CurrSamplesPerSec=23.644738745172802, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:42:51,302] [INFO] [logging.py:96:log_dist] [Rank 0] step=5490, skipped=101, lr=[6.795340693627699e-06, 6.795340693627699e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:42:51,545] [INFO] [timer.py:199:stop] epoch=5/micro_step=3560/global_step=5490, RunningAvgSamplesPerSec=23.790648271694348, CurrSamplesPerSec=23.72435306648906, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:43:18,297] [INFO] [logging.py:96:log_dist] [Rank 0] step=5500, skipped=101, lr=[6.78593627896336e-06, 6.78593627896336e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:43:18,539] [INFO] [timer.py:199:stop] epoch=5/micro_step=3600/global_step=5500, RunningAvgSamplesPerSec=23.790573681851917, CurrSamplesPerSec=23.690931719644443, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:43:23,655] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 19:43:26,065] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 19:43:44,707] [INFO] [logging.py:96:log_dist] [Rank 0] step=5510, skipped=103, lr=[6.778406314153203e-06, 6.778406314153203e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:43:44,949] [INFO] [timer.py:199:stop] epoch=5/micro_step=3640/global_step=5510, RunningAvgSamplesPerSec=23.79144316537623, CurrSamplesPerSec=23.792788989470953, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:44:11,646] [INFO] [logging.py:96:log_dist] [Rank 0] step=5520, skipped=103, lr=[6.768985852805721e-06, 6.768985852805721e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:44:11,888] [INFO] [timer.py:199:stop] epoch=5/micro_step=3680/global_step=5520, RunningAvgSamplesPerSec=23.791458637410944, CurrSamplesPerSec=23.822489453240472, MemAllocated=11.44GB, MaxMemAllocated=31.66GB ***** Evaluating perplexity, Epoch 6/16 ***** ppl: 1.8258322477340698 saving the final model ... Beginning of Epoch 7/16, Total Micro Batches 3680 [2023-04-23 19:45:30,124] [INFO] [logging.py:96:log_dist] [Rank 0] step=5530, skipped=103, lr=[6.759556536692814e-06, 6.759556536692814e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:45:30,367] [INFO] [timer.py:199:stop] epoch=6/micro_step=40/global_step=5530, RunningAvgSamplesPerSec=23.791426641564506, CurrSamplesPerSec=23.779662162563195, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:45:57,057] [INFO] [logging.py:96:log_dist] [Rank 0] step=5540, skipped=103, lr=[6.750118408764582e-06, 6.750118408764582e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:45:57,298] [INFO] [timer.py:199:stop] epoch=6/micro_step=80/global_step=5540, RunningAvgSamplesPerSec=23.79144915668875, CurrSamplesPerSec=23.799151057038337, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:46:24,013] [INFO] [logging.py:96:log_dist] [Rank 0] step=5550, skipped=103, lr=[6.740671512011257e-06, 6.740671512011257e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:46:24,255] [INFO] [timer.py:199:stop] epoch=6/micro_step=120/global_step=5550, RunningAvgSamplesPerSec=23.79143187433612, CurrSamplesPerSec=23.718347367633275, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:46:50,943] [INFO] [logging.py:96:log_dist] [Rank 0] step=5560, skipped=103, lr=[6.731215889463016e-06, 6.731215889463016e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:46:51,185] [INFO] [timer.py:199:stop] epoch=6/micro_step=160/global_step=5560, RunningAvgSamplesPerSec=23.79146124790863, CurrSamplesPerSec=23.865655882782832, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:47:17,895] [INFO] [logging.py:96:log_dist] [Rank 0] step=5570, skipped=103, lr=[6.721751584189783e-06, 6.721751584189783e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:47:18,136] [INFO] [timer.py:199:stop] epoch=6/micro_step=200/global_step=5570, RunningAvgSamplesPerSec=23.791452906637197, CurrSamplesPerSec=23.84733538117338, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:47:44,922] [INFO] [logging.py:96:log_dist] [Rank 0] step=5580, skipped=103, lr=[6.712278639301028e-06, 6.712278639301028e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:47:45,166] [INFO] [timer.py:199:stop] epoch=6/micro_step=240/global_step=5580, RunningAvgSamplesPerSec=23.79132127356396, CurrSamplesPerSec=23.66979947344298, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:48:11,891] [INFO] [logging.py:96:log_dist] [Rank 0] step=5590, skipped=103, lr=[6.702797097945577e-06, 6.702797097945577e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:48:12,132] [INFO] [timer.py:199:stop] epoch=6/micro_step=280/global_step=5590, RunningAvgSamplesPerSec=23.791294163910486, CurrSamplesPerSec=23.81815624793981, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:48:38,856] [INFO] [logging.py:96:log_dist] [Rank 0] step=5600, skipped=103, lr=[6.6933070033114125e-06, 6.6933070033114125e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:48:39,098] [INFO] [timer.py:199:stop] epoch=6/micro_step=320/global_step=5600, RunningAvgSamplesPerSec=23.791278280681347, CurrSamplesPerSec=23.707999129879447, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:48:49,595] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 19:48:51,997] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 19:49:05,238] [INFO] [logging.py:96:log_dist] [Rank 0] step=5610, skipped=105, lr=[6.685708798290605e-06, 6.685708798290605e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:49:05,478] [INFO] [timer.py:199:stop] epoch=6/micro_step=360/global_step=5610, RunningAvgSamplesPerSec=23.792173767437173, CurrSamplesPerSec=23.78948905792596, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:49:33,585] [INFO] [logging.py:96:log_dist] [Rank 0] step=5620, skipped=105, lr=[6.6762034167127e-06, 6.6762034167127e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:49:33,829] [INFO] [timer.py:199:stop] epoch=6/micro_step=400/global_step=5620, RunningAvgSamplesPerSec=23.79029077704224, CurrSamplesPerSec=23.733278139825536, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:50:00,536] [INFO] [logging.py:96:log_dist] [Rank 0] step=5630, skipped=105, lr=[6.666689602989063e-06, 6.666689602989063e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:50:00,778] [INFO] [timer.py:199:stop] epoch=6/micro_step=440/global_step=5630, RunningAvgSamplesPerSec=23.790290325021715, CurrSamplesPerSec=23.8249421118838, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:50:27,501] [INFO] [logging.py:96:log_dist] [Rank 0] step=5640, skipped=105, lr=[6.657167400454678e-06, 6.657167400454678e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:50:27,743] [INFO] [timer.py:199:stop] epoch=6/micro_step=480/global_step=5640, RunningAvgSamplesPerSec=23.790270231200342, CurrSamplesPerSec=23.80479032071122, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:50:54,441] [INFO] [logging.py:96:log_dist] [Rank 0] step=5650, skipped=105, lr=[6.6476368524827325e-06, 6.6476368524827325e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:50:54,684] [INFO] [timer.py:199:stop] epoch=6/micro_step=520/global_step=5650, RunningAvgSamplesPerSec=23.790282934740947, CurrSamplesPerSec=23.790983929469583, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:51:21,384] [INFO] [logging.py:96:log_dist] [Rank 0] step=5660, skipped=105, lr=[6.6380980024844336e-06, 6.6380980024844336e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:51:21,625] [INFO] [timer.py:199:stop] epoch=6/micro_step=560/global_step=5660, RunningAvgSamplesPerSec=23.790293676497754, CurrSamplesPerSec=23.830041452048555, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:51:48,296] [INFO] [logging.py:96:log_dist] [Rank 0] step=5670, skipped=105, lr=[6.6285508939087994e-06, 6.6285508939087994e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:51:48,539] [INFO] [timer.py:199:stop] epoch=6/micro_step=600/global_step=5670, RunningAvgSamplesPerSec=23.79034734518588, CurrSamplesPerSec=23.82399481834429, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:52:15,258] [INFO] [logging.py:96:log_dist] [Rank 0] step=5680, skipped=105, lr=[6.618995570242466e-06, 6.618995570242466e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:52:15,502] [INFO] [timer.py:199:stop] epoch=6/micro_step=640/global_step=5680, RunningAvgSamplesPerSec=23.790323224801416, CurrSamplesPerSec=23.767803997120968, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:52:42,170] [INFO] [logging.py:96:log_dist] [Rank 0] step=5690, skipped=105, lr=[6.60943207500949e-06, 6.60943207500949e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:52:42,413] [INFO] [timer.py:199:stop] epoch=6/micro_step=680/global_step=5690, RunningAvgSamplesPerSec=23.790376859435373, CurrSamplesPerSec=23.778457277820156, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:53:09,111] [INFO] [logging.py:96:log_dist] [Rank 0] step=5700, skipped=105, lr=[6.599860451771151e-06, 6.599860451771151e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:53:09,354] [INFO] [timer.py:199:stop] epoch=6/micro_step=720/global_step=5700, RunningAvgSamplesPerSec=23.790383948954616, CurrSamplesPerSec=23.732764057992952, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:53:25,246] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 19:53:27,648] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 19:53:35,484] [INFO] [logging.py:96:log_dist] [Rank 0] step=5710, skipped=107, lr=[6.592197330313436e-06, 6.592197330313436e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:53:35,726] [INFO] [timer.py:199:stop] epoch=6/micro_step=760/global_step=5710, RunningAvgSamplesPerSec=23.79128035801782, CurrSamplesPerSec=23.810518849958374, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:54:02,401] [INFO] [logging.py:96:log_dist] [Rank 0] step=5720, skipped=107, lr=[6.58261118655791e-06, 6.58261118655791e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:54:02,643] [INFO] [timer.py:199:stop] epoch=6/micro_step=800/global_step=5720, RunningAvgSamplesPerSec=23.79132290276499, CurrSamplesPerSec=23.81089479529238, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:54:29,391] [INFO] [logging.py:96:log_dist] [Rank 0] step=5730, skipped=107, lr=[6.573017036964921e-06, 6.573017036964921e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:54:29,634] [INFO] [timer.py:199:stop] epoch=6/micro_step=840/global_step=5730, RunningAvgSamplesPerSec=23.7912530729182, CurrSamplesPerSec=23.784506104297055, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:54:56,348] [INFO] [logging.py:96:log_dist] [Rank 0] step=5740, skipped=107, lr=[6.563414925235379e-06, 6.563414925235379e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:54:56,591] [INFO] [timer.py:199:stop] epoch=6/micro_step=880/global_step=5740, RunningAvgSamplesPerSec=23.79124989111836, CurrSamplesPerSec=23.8014997220274, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:55:23,340] [INFO] [logging.py:96:log_dist] [Rank 0] step=5750, skipped=107, lr=[6.553804895106455e-06, 6.553804895106455e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:55:23,582] [INFO] [timer.py:199:stop] epoch=6/micro_step=920/global_step=5750, RunningAvgSamplesPerSec=23.79120651983018, CurrSamplesPerSec=23.79617843429062, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:55:50,237] [INFO] [logging.py:96:log_dist] [Rank 0] step=5760, skipped=107, lr=[6.544186990351391e-06, 6.544186990351391e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:55:50,478] [INFO] [timer.py:199:stop] epoch=6/micro_step=960/global_step=5760, RunningAvgSamplesPerSec=23.791291814339385, CurrSamplesPerSec=23.84412620155325, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:56:17,142] [INFO] [logging.py:96:log_dist] [Rank 0] step=5770, skipped=107, lr=[6.5345612547792995e-06, 6.5345612547792995e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:56:17,384] [INFO] [timer.py:199:stop] epoch=6/micro_step=1000/global_step=5770, RunningAvgSamplesPerSec=23.79135666846186, CurrSamplesPerSec=23.86533549349511, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:56:44,067] [INFO] [logging.py:96:log_dist] [Rank 0] step=5780, skipped=107, lr=[6.524927732234955e-06, 6.524927732234955e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:56:44,309] [INFO] [timer.py:199:stop] epoch=6/micro_step=1040/global_step=5780, RunningAvgSamplesPerSec=23.791397255757122, CurrSamplesPerSec=23.781859495770682, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:57:11,011] [INFO] [logging.py:96:log_dist] [Rank 0] step=5790, skipped=107, lr=[6.515286466598613e-06, 6.515286466598613e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:57:11,252] [INFO] [timer.py:199:stop] epoch=6/micro_step=1080/global_step=5790, RunningAvgSamplesPerSec=23.791410992741863, CurrSamplesPerSec=23.8140992606273, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:57:38,028] [INFO] [logging.py:96:log_dist] [Rank 0] step=5800, skipped=107, lr=[6.505637501785784e-06, 6.505637501785784e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:57:38,271] [INFO] [timer.py:199:stop] epoch=6/micro_step=1120/global_step=5800, RunningAvgSamplesPerSec=23.79132225989106, CurrSamplesPerSec=23.79095651828648, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:57:59,564] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 19:58:01,963] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 19:58:04,416] [INFO] [logging.py:96:log_dist] [Rank 0] step=5810, skipped=109, lr=[6.4979128160620765e-06, 6.4979128160620765e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:58:04,659] [INFO] [timer.py:199:stop] epoch=6/micro_step=1160/global_step=5810, RunningAvgSamplesPerSec=23.79217973301812, CurrSamplesPerSec=23.79145625516104, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:58:31,437] [INFO] [logging.py:96:log_dist] [Rank 0] step=5820, skipped=109, lr=[6.4882501035104984e-06, 6.4882501035104984e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:58:31,681] [INFO] [timer.py:199:stop] epoch=6/micro_step=1200/global_step=5820, RunningAvgSamplesPerSec=23.792068751460544, CurrSamplesPerSec=23.73332220501892, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:58:58,374] [INFO] [logging.py:96:log_dist] [Rank 0] step=5830, skipped=109, lr=[6.478579814931817e-06, 6.478579814931817e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:58:58,616] [INFO] [timer.py:199:stop] epoch=6/micro_step=1240/global_step=5830, RunningAvgSamplesPerSec=23.792089528636545, CurrSamplesPerSec=23.829537977526495, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:59:26,447] [INFO] [logging.py:96:log_dist] [Rank 0] step=5840, skipped=109, lr=[6.4689019943737485e-06, 6.4689019943737485e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:59:26,689] [INFO] [timer.py:199:stop] epoch=6/micro_step=1280/global_step=5840, RunningAvgSamplesPerSec=23.790450702748277, CurrSamplesPerSec=16.88143671482838, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 19:59:53,404] [INFO] [logging.py:96:log_dist] [Rank 0] step=5850, skipped=109, lr=[6.459216685918317e-06, 6.459216685918317e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 19:59:53,647] [INFO] [timer.py:199:stop] epoch=6/micro_step=1320/global_step=5850, RunningAvgSamplesPerSec=23.790442076714253, CurrSamplesPerSec=23.757692218507206, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:00:20,319] [INFO] [logging.py:96:log_dist] [Rank 0] step=5860, skipped=109, lr=[6.4495239336816524e-06, 6.4495239336816524e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:00:20,562] [INFO] [timer.py:199:stop] epoch=6/micro_step=1360/global_step=5860, RunningAvgSamplesPerSec=23.79048722936549, CurrSamplesPerSec=23.766823364730495, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:00:47,257] [INFO] [logging.py:96:log_dist] [Rank 0] step=5870, skipped=109, lr=[6.439823781813792e-06, 6.439823781813792e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:00:47,499] [INFO] [timer.py:199:stop] epoch=6/micro_step=1400/global_step=5870, RunningAvgSamplesPerSec=23.790506362493517, CurrSamplesPerSec=23.806950072670784, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:01:14,196] [INFO] [logging.py:96:log_dist] [Rank 0] step=5880, skipped=109, lr=[6.430116274498481e-06, 6.430116274498481e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:01:14,437] [INFO] [timer.py:199:stop] epoch=6/micro_step=1440/global_step=5880, RunningAvgSamplesPerSec=23.79052020205425, CurrSamplesPerSec=23.758838222354946, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:01:41,246] [INFO] [logging.py:96:log_dist] [Rank 0] step=5890, skipped=109, lr=[6.420401455952961e-06, 6.420401455952961e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:01:41,489] [INFO] [timer.py:199:stop] epoch=6/micro_step=1480/global_step=5890, RunningAvgSamplesPerSec=23.790370713172223, CurrSamplesPerSec=23.767463081405825, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:02:08,245] [INFO] [logging.py:96:log_dist] [Rank 0] step=5900, skipped=109, lr=[6.41067937042778e-06, 6.41067937042778e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:02:08,488] [INFO] [timer.py:199:stop] epoch=6/micro_step=1520/global_step=5900, RunningAvgSamplesPerSec=23.79029511474756, CurrSamplesPerSec=23.80090881927087, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:02:35,185] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 20:02:35,186] [INFO] [logging.py:96:log_dist] [Rank 0] step=5910, skipped=110, lr=[6.401923316787266e-06, 6.401923316787266e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:02:35,187] [INFO] [timer.py:199:stop] epoch=6/micro_step=1560/global_step=5910, RunningAvgSamplesPerSec=23.79067074286964, CurrSamplesPerSec=26.617868662333272, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:02:37,590] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 20:03:01,583] [INFO] [logging.py:96:log_dist] [Rank 0] step=5920, skipped=111, lr=[6.393161445068132e-06, 6.393161445068132e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:03:01,825] [INFO] [timer.py:199:stop] epoch=6/micro_step=1600/global_step=5920, RunningAvgSamplesPerSec=23.79113012398796, CurrSamplesPerSec=23.81058009866958, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:03:28,485] [INFO] [logging.py:96:log_dist] [Rank 0] step=5930, skipped=111, lr=[6.383419247693801e-06, 6.383419247693801e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:03:28,728] [INFO] [timer.py:199:stop] epoch=6/micro_step=1640/global_step=5930, RunningAvgSamplesPerSec=23.79119431233151, CurrSamplesPerSec=23.839617863390924, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:03:55,447] [INFO] [logging.py:96:log_dist] [Rank 0] step=5940, skipped=111, lr=[6.3736699517920576e-06, 6.3736699517920576e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:03:55,689] [INFO] [timer.py:199:stop] epoch=6/micro_step=1680/global_step=5940, RunningAvgSamplesPerSec=23.791174392312392, CurrSamplesPerSec=23.69382791108503, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:04:22,378] [INFO] [logging.py:96:log_dist] [Rank 0] step=5950, skipped=111, lr=[6.36391360177049e-06, 6.36391360177049e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:04:22,622] [INFO] [timer.py:199:stop] epoch=6/micro_step=1720/global_step=5950, RunningAvgSamplesPerSec=23.791198354923544, CurrSamplesPerSec=23.809325618971787, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:04:49,311] [INFO] [logging.py:96:log_dist] [Rank 0] step=5960, skipped=111, lr=[6.354150242068816e-06, 6.354150242068816e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:04:49,554] [INFO] [timer.py:199:stop] epoch=6/micro_step=1760/global_step=5960, RunningAvgSamplesPerSec=23.791223956794273, CurrSamplesPerSec=23.806812833369975, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:05:16,280] [INFO] [logging.py:96:log_dist] [Rank 0] step=5970, skipped=111, lr=[6.344379917158692e-06, 6.344379917158692e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:05:16,521] [INFO] [timer.py:199:stop] epoch=6/micro_step=1800/global_step=5970, RunningAvgSamplesPerSec=23.791196819465416, CurrSamplesPerSec=23.814891532351503, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:05:43,205] [INFO] [logging.py:96:log_dist] [Rank 0] step=5980, skipped=111, lr=[6.334602671543494e-06, 6.334602671543494e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:05:43,447] [INFO] [timer.py:199:stop] epoch=6/micro_step=1840/global_step=5980, RunningAvgSamplesPerSec=23.791232296770676, CurrSamplesPerSec=23.776327963698947, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:06:10,138] [INFO] [logging.py:96:log_dist] [Rank 0] step=5990, skipped=111, lr=[6.32481854975812e-06, 6.32481854975812e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:06:10,380] [INFO] [timer.py:199:stop] epoch=6/micro_step=1880/global_step=5990, RunningAvgSamplesPerSec=23.791255603400742, CurrSamplesPerSec=23.82434370117498, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:06:37,072] [INFO] [logging.py:96:log_dist] [Rank 0] step=6000, skipped=111, lr=[6.315027596368792e-06, 6.315027596368792e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:06:37,314] [INFO] [timer.py:199:stop] epoch=6/micro_step=1920/global_step=6000, RunningAvgSamplesPerSec=23.791276145452038, CurrSamplesPerSec=23.826705800974405, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:07:04,021] [INFO] [logging.py:96:log_dist] [Rank 0] step=6010, skipped=111, lr=[6.305229855972851e-06, 6.305229855972851e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:07:04,258] [INFO] [timer.py:199:stop] epoch=6/micro_step=1960/global_step=6010, RunningAvgSamplesPerSec=23.791278635615136, CurrSamplesPerSec=23.778118162700924, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:07:09,359] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 20:07:11,771] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 20:07:30,387] [INFO] [logging.py:96:log_dist] [Rank 0] step=6020, skipped=113, lr=[6.297386807000486e-06, 6.297386807000486e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:07:30,629] [INFO] [timer.py:199:stop] epoch=6/micro_step=2000/global_step=6020, RunningAvgSamplesPerSec=23.792123115248508, CurrSamplesPerSec=23.809676184164275, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:07:57,340] [INFO] [logging.py:96:log_dist] [Rank 0] step=6030, skipped=113, lr=[6.287576962476481e-06, 6.287576962476481e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:07:57,581] [INFO] [timer.py:199:stop] epoch=6/micro_step=2040/global_step=6030, RunningAvgSamplesPerSec=23.792117454200962, CurrSamplesPerSec=23.73363276435545, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:08:24,297] [INFO] [logging.py:96:log_dist] [Rank 0] step=6040, skipped=113, lr=[6.277760455982226e-06, 6.277760455982226e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:08:24,540] [INFO] [timer.py:199:stop] epoch=6/micro_step=2080/global_step=6040, RunningAvgSamplesPerSec=23.792096487995707, CurrSamplesPerSec=23.796653075820892, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:08:51,281] [INFO] [logging.py:96:log_dist] [Rank 0] step=6050, skipped=113, lr=[6.267937332231452e-06, 6.267937332231452e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:08:51,524] [INFO] [timer.py:199:stop] epoch=6/micro_step=2120/global_step=6050, RunningAvgSamplesPerSec=23.79203947750226, CurrSamplesPerSec=23.67338570422195, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:09:18,294] [INFO] [logging.py:96:log_dist] [Rank 0] step=6060, skipped=113, lr=[6.2581076359680326e-06, 6.2581076359680326e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:09:18,538] [INFO] [timer.py:199:stop] epoch=6/micro_step=2160/global_step=6060, RunningAvgSamplesPerSec=23.7919398449718, CurrSamplesPerSec=23.72278269663956, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:09:45,284] [INFO] [logging.py:96:log_dist] [Rank 0] step=6070, skipped=113, lr=[6.248271411965781e-06, 6.248271411965781e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:09:45,528] [INFO] [timer.py:199:stop] epoch=6/micro_step=2200/global_step=6070, RunningAvgSamplesPerSec=23.791878281940598, CurrSamplesPerSec=23.806318784988072, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:10:12,207] [INFO] [logging.py:96:log_dist] [Rank 0] step=6080, skipped=113, lr=[6.238428705028236e-06, 6.238428705028236e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:10:12,449] [INFO] [timer.py:199:stop] epoch=6/micro_step=2240/global_step=6080, RunningAvgSamplesPerSec=23.791915401884783, CurrSamplesPerSec=23.847892574281616, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:10:39,143] [INFO] [logging.py:96:log_dist] [Rank 0] step=6090, skipped=113, lr=[6.2285795599884765e-06, 6.2285795599884765e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:10:39,385] [INFO] [timer.py:199:stop] epoch=6/micro_step=2280/global_step=6090, RunningAvgSamplesPerSec=23.79192822898743, CurrSamplesPerSec=23.808474591003325, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:11:06,036] [INFO] [logging.py:96:log_dist] [Rank 0] step=6100, skipped=113, lr=[6.2187240217089e-06, 6.2187240217089e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:11:06,278] [INFO] [timer.py:199:stop] epoch=6/micro_step=2320/global_step=6100, RunningAvgSamplesPerSec=23.792001498413274, CurrSamplesPerSec=23.797112968786195, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:11:32,967] [INFO] [logging.py:96:log_dist] [Rank 0] step=6110, skipped=113, lr=[6.208862135081026e-06, 6.208862135081026e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:11:33,212] [INFO] [timer.py:199:stop] epoch=6/micro_step=2360/global_step=6110, RunningAvgSamplesPerSec=23.79201536066331, CurrSamplesPerSec=23.717419009174, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:11:43,713] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 20:11:46,116] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 20:11:59,382] [INFO] [logging.py:96:log_dist] [Rank 0] step=6120, skipped=115, lr=[6.200968085153518e-06, 6.200968085153518e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:11:59,623] [INFO] [timer.py:199:stop] epoch=6/micro_step=2400/global_step=6120, RunningAvgSamplesPerSec=23.7927874665862, CurrSamplesPerSec=23.721738693962035, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:12:26,361] [INFO] [logging.py:96:log_dist] [Rank 0] step=6130, skipped=115, lr=[6.191094884717509e-06, 6.191094884717509e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:12:26,603] [INFO] [timer.py:199:stop] epoch=6/micro_step=2440/global_step=6130, RunningAvgSamplesPerSec=23.792738460549113, CurrSamplesPerSec=23.759837123223093, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:12:53,339] [INFO] [logging.py:96:log_dist] [Rank 0] step=6140, skipped=115, lr=[6.181215461782642e-06, 6.181215461782642e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:12:53,580] [INFO] [timer.py:199:stop] epoch=6/micro_step=2480/global_step=6140, RunningAvgSamplesPerSec=23.79269274366629, CurrSamplesPerSec=23.774754917221376, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:13:20,301] [INFO] [logging.py:96:log_dist] [Rank 0] step=6150, skipped=115, lr=[6.171329861349227e-06, 6.171329861349227e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:13:20,542] [INFO] [timer.py:199:stop] epoch=6/micro_step=2520/global_step=6150, RunningAvgSamplesPerSec=23.792668391797633, CurrSamplesPerSec=23.831613360240677, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:13:47,229] [INFO] [logging.py:96:log_dist] [Rank 0] step=6160, skipped=115, lr=[6.161438128445718e-06, 6.161438128445718e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:13:47,471] [INFO] [timer.py:199:stop] epoch=6/micro_step=2560/global_step=6160, RunningAvgSamplesPerSec=23.792694749709092, CurrSamplesPerSec=23.79485586645226, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:14:14,190] [INFO] [logging.py:96:log_dist] [Rank 0] step=6170, skipped=115, lr=[6.1515403081284995e-06, 6.1515403081284995e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:14:14,432] [INFO] [timer.py:199:stop] epoch=6/micro_step=2600/global_step=6170, RunningAvgSamplesPerSec=23.792672027834257, CurrSamplesPerSec=23.775203435364258, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:14:41,129] [INFO] [logging.py:96:log_dist] [Rank 0] step=6180, skipped=115, lr=[6.1416364454816845e-06, 6.1416364454816845e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:14:41,371] [INFO] [timer.py:199:stop] epoch=6/micro_step=2640/global_step=6180, RunningAvgSamplesPerSec=23.792681942190466, CurrSamplesPerSec=23.779805408730518, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:15:08,092] [INFO] [logging.py:96:log_dist] [Rank 0] step=6190, skipped=115, lr=[6.131726585616906e-06, 6.131726585616906e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:15:08,336] [INFO] [timer.py:199:stop] epoch=6/micro_step=2680/global_step=6190, RunningAvgSamplesPerSec=23.792656197205815, CurrSamplesPerSec=23.686932584722765, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:15:35,033] [INFO] [logging.py:96:log_dist] [Rank 0] step=6200, skipped=115, lr=[6.121810773673119e-06, 6.121810773673119e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:15:35,274] [INFO] [timer.py:199:stop] epoch=6/micro_step=2720/global_step=6200, RunningAvgSamplesPerSec=23.79266909042901, CurrSamplesPerSec=23.82808055987528, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:16:02,002] [INFO] [logging.py:96:log_dist] [Rank 0] step=6210, skipped=115, lr=[6.111889054816386e-06, 6.111889054816386e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:16:02,244] [INFO] [timer.py:199:stop] epoch=6/micro_step=2760/global_step=6210, RunningAvgSamplesPerSec=23.792638310494993, CurrSamplesPerSec=23.785024537693133, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:16:18,146] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 20:16:20,556] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 20:16:28,441] [INFO] [logging.py:96:log_dist] [Rank 0] step=6220, skipped=117, lr=[6.103947457122454e-06, 6.103947457122454e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:16:28,682] [INFO] [timer.py:199:stop] epoch=6/micro_step=2800/global_step=6220, RunningAvgSamplesPerSec=23.793362388683345, CurrSamplesPerSec=23.774978121318032, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:16:55,430] [INFO] [logging.py:96:log_dist] [Rank 0] step=6230, skipped=117, lr=[6.09401521972665e-06, 6.09401521972665e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:16:55,672] [INFO] [timer.py:199:stop] epoch=6/micro_step=2840/global_step=6230, RunningAvgSamplesPerSec=23.79329752566771, CurrSamplesPerSec=23.80244522746865, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:17:22,368] [INFO] [logging.py:96:log_dist] [Rank 0] step=6240, skipped=117, lr=[6.0840772020253585e-06, 6.0840772020253585e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:17:22,612] [INFO] [timer.py:199:stop] epoch=6/micro_step=2880/global_step=6240, RunningAvgSamplesPerSec=23.793305598984116, CurrSamplesPerSec=23.8129479196939, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:17:49,294] [INFO] [logging.py:96:log_dist] [Rank 0] step=6250, skipped=117, lr=[6.074133449285791e-06, 6.074133449285791e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:17:49,535] [INFO] [timer.py:199:stop] epoch=6/micro_step=2920/global_step=6250, RunningAvgSamplesPerSec=23.79333745676212, CurrSamplesPerSec=23.89466869441874, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:18:16,215] [INFO] [logging.py:96:log_dist] [Rank 0] step=6260, skipped=117, lr=[6.0641840068012806e-06, 6.0641840068012806e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:18:16,457] [INFO] [timer.py:199:stop] epoch=6/micro_step=2960/global_step=6260, RunningAvgSamplesPerSec=23.793367916351983, CurrSamplesPerSec=23.821459909980263, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:18:43,146] [INFO] [logging.py:96:log_dist] [Rank 0] step=6270, skipped=117, lr=[6.054228919891073e-06, 6.054228919891073e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:18:43,390] [INFO] [timer.py:199:stop] epoch=6/micro_step=3000/global_step=6270, RunningAvgSamplesPerSec=23.79338490523288, CurrSamplesPerSec=23.801265467781608, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:19:10,095] [INFO] [logging.py:96:log_dist] [Rank 0] step=6280, skipped=117, lr=[6.044268233900129e-06, 6.044268233900129e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:19:10,336] [INFO] [timer.py:199:stop] epoch=6/micro_step=3040/global_step=6280, RunningAvgSamplesPerSec=23.793386694177567, CurrSamplesPerSec=23.784752673459785, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:19:37,085] [INFO] [logging.py:96:log_dist] [Rank 0] step=6290, skipped=117, lr=[6.034301994198915e-06, 6.034301994198915e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:19:37,328] [INFO] [timer.py:199:stop] epoch=6/micro_step=3080/global_step=6290, RunningAvgSamplesPerSec=23.793326322685093, CurrSamplesPerSec=23.77637218889728, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:20:03,994] [INFO] [logging.py:96:log_dist] [Rank 0] step=6300, skipped=117, lr=[6.024330246183186e-06, 6.024330246183186e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:20:04,235] [INFO] [timer.py:199:stop] epoch=6/micro_step=3120/global_step=6300, RunningAvgSamplesPerSec=23.793379491454925, CurrSamplesPerSec=23.838066068188823, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:20:31,039] [INFO] [logging.py:96:log_dist] [Rank 0] step=6310, skipped=117, lr=[6.014353035273795e-06, 6.014353035273795e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:20:31,281] [INFO] [timer.py:199:stop] epoch=6/micro_step=3160/global_step=6310, RunningAvgSamplesPerSec=23.793248576960334, CurrSamplesPerSec=23.78086295589507, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:20:52,530] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 20:20:54,936] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 20:20:57,397] [INFO] [logging.py:96:log_dist] [Rank 0] step=6320, skipped=119, lr=[6.006367363801546e-06, 6.006367363801546e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:20:57,640] [INFO] [timer.py:199:stop] epoch=6/micro_step=3200/global_step=6320, RunningAvgSamplesPerSec=23.794065855933596, CurrSamplesPerSec=23.71227558575674, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:21:24,352] [INFO] [logging.py:96:log_dist] [Rank 0] step=6330, skipped=119, lr=[5.996380434223397e-06, 5.996380434223397e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:21:24,589] [INFO] [timer.py:199:stop] epoch=6/micro_step=3240/global_step=6330, RunningAvgSamplesPerSec=23.794061875904355, CurrSamplesPerSec=23.595020877604778, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:21:51,295] [INFO] [logging.py:96:log_dist] [Rank 0] step=6340, skipped=119, lr=[5.986388169061683e-06, 5.986388169061683e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:21:51,536] [INFO] [timer.py:199:stop] epoch=6/micro_step=3280/global_step=6340, RunningAvgSamplesPerSec=23.79406293427073, CurrSamplesPerSec=23.82842956238176, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:22:18,235] [INFO] [logging.py:96:log_dist] [Rank 0] step=6350, skipped=119, lr=[5.976390613830709e-06, 5.976390613830709e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:22:18,477] [INFO] [timer.py:199:stop] epoch=6/micro_step=3320/global_step=6350, RunningAvgSamplesPerSec=23.794066987495775, CurrSamplesPerSec=23.75754082810114, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:22:45,178] [INFO] [logging.py:96:log_dist] [Rank 0] step=6360, skipped=119, lr=[5.966387814068873e-06, 5.966387814068873e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:22:45,420] [INFO] [timer.py:199:stop] epoch=6/micro_step=3360/global_step=6360, RunningAvgSamplesPerSec=23.79406816154406, CurrSamplesPerSec=23.775763580376708, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:23:12,115] [INFO] [logging.py:96:log_dist] [Rank 0] step=6370, skipped=119, lr=[5.9563798153384655e-06, 5.9563798153384655e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:23:12,356] [INFO] [timer.py:199:stop] epoch=6/micro_step=3400/global_step=6370, RunningAvgSamplesPerSec=23.79407756528809, CurrSamplesPerSec=23.75544258756406, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:23:39,107] [INFO] [logging.py:96:log_dist] [Rank 0] step=6380, skipped=119, lr=[5.946366663225457e-06, 5.946366663225457e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:23:39,349] [INFO] [timer.py:199:stop] epoch=6/micro_step=3440/global_step=6380, RunningAvgSamplesPerSec=23.79400688212201, CurrSamplesPerSec=23.780437397718426, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:24:06,081] [INFO] [logging.py:96:log_dist] [Rank 0] step=6390, skipped=119, lr=[5.9363484033392915e-06, 5.9363484033392915e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:24:06,323] [INFO] [timer.py:199:stop] epoch=6/micro_step=3480/global_step=6390, RunningAvgSamplesPerSec=23.7939637142524, CurrSamplesPerSec=23.874203479356744, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:24:33,018] [INFO] [logging.py:96:log_dist] [Rank 0] step=6400, skipped=119, lr=[5.9263250813126764e-06, 5.9263250813126764e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:24:33,260] [INFO] [timer.py:199:stop] epoch=6/micro_step=3520/global_step=6400, RunningAvgSamplesPerSec=23.793971925402293, CurrSamplesPerSec=23.739421566874334, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:24:59,999] [INFO] [logging.py:96:log_dist] [Rank 0] step=6410, skipped=119, lr=[5.916296742801381e-06, 5.916296742801381e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:25:00,242] [INFO] [timer.py:199:stop] epoch=6/micro_step=3560/global_step=6410, RunningAvgSamplesPerSec=23.793918297884538, CurrSamplesPerSec=23.75778473581629, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:25:26,936] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 20:25:26,937] [INFO] [logging.py:96:log_dist] [Rank 0] step=6420, skipped=120, lr=[5.9072669867997216e-06, 5.9072669867997216e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:25:26,937] [INFO] [timer.py:199:stop] epoch=6/micro_step=3600/global_step=6420, RunningAvgSamplesPerSec=23.794265197241963, CurrSamplesPerSec=26.7153626186196, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:25:29,344] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 20:25:53,368] [INFO] [logging.py:96:log_dist] [Rank 0] step=6430, skipped=121, lr=[5.898233237760258e-06, 5.898233237760258e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:25:53,609] [INFO] [timer.py:199:stop] epoch=6/micro_step=3640/global_step=6430, RunningAvgSamplesPerSec=23.794637441223696, CurrSamplesPerSec=23.826898257670734, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:26:20,370] [INFO] [logging.py:96:log_dist] [Rank 0] step=6440, skipped=121, lr=[5.888191096174236e-06, 5.888191096174236e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:26:20,613] [INFO] [timer.py:199:stop] epoch=6/micro_step=3680/global_step=6440, RunningAvgSamplesPerSec=23.79455994566676, CurrSamplesPerSec=23.639691329659158, MemAllocated=11.44GB, MaxMemAllocated=31.66GB ***** Evaluating perplexity, Epoch 7/16 ***** ppl: 1.8117164373397827 saving the final model ... Beginning of Epoch 8/16, Total Micro Batches 3680 [2023-04-23 20:27:39,067] [INFO] [logging.py:96:log_dist] [Rank 0] step=6450, skipped=121, lr=[5.8781441118020675e-06, 5.8781441118020675e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:27:39,307] [INFO] [timer.py:199:stop] epoch=7/micro_step=40/global_step=6450, RunningAvgSamplesPerSec=23.79439711693707, CurrSamplesPerSec=23.73117369278307, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:28:06,059] [INFO] [logging.py:96:log_dist] [Rank 0] step=6460, skipped=121, lr=[5.868092330407301e-06, 5.868092330407301e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:28:06,302] [INFO] [timer.py:199:stop] epoch=7/micro_step=80/global_step=6460, RunningAvgSamplesPerSec=23.794330731814064, CurrSamplesPerSec=23.709247142702072, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:28:33,037] [INFO] [logging.py:96:log_dist] [Rank 0] step=6470, skipped=121, lr=[5.858035797775332e-06, 5.858035797775332e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:28:33,279] [INFO] [timer.py:199:stop] epoch=7/micro_step=120/global_step=6470, RunningAvgSamplesPerSec=23.79428725100643, CurrSamplesPerSec=23.732946606950765, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:29:00,053] [INFO] [logging.py:96:log_dist] [Rank 0] step=6480, skipped=121, lr=[5.847974559713202e-06, 5.847974559713202e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:29:00,294] [INFO] [timer.py:199:stop] epoch=7/micro_step=160/global_step=6480, RunningAvgSamplesPerSec=23.794187797824335, CurrSamplesPerSec=23.78111366313842, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:29:26,990] [INFO] [logging.py:96:log_dist] [Rank 0] step=6490, skipped=121, lr=[5.8379086620493845e-06, 5.8379086620493845e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:29:27,232] [INFO] [timer.py:199:stop] epoch=7/micro_step=200/global_step=6490, RunningAvgSamplesPerSec=23.79419248465195, CurrSamplesPerSec=23.887105443191437, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:29:53,980] [INFO] [logging.py:96:log_dist] [Rank 0] step=6500, skipped=121, lr=[5.827838150633576e-06, 5.827838150633576e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:29:54,224] [INFO] [timer.py:199:stop] epoch=7/micro_step=240/global_step=6500, RunningAvgSamplesPerSec=23.794127858006583, CurrSamplesPerSec=23.681827445094797, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:30:20,944] [INFO] [logging.py:96:log_dist] [Rank 0] step=6510, skipped=121, lr=[5.817763071336488e-06, 5.817763071336488e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:30:21,187] [INFO] [timer.py:199:stop] epoch=7/micro_step=280/global_step=6510, RunningAvgSamplesPerSec=23.794098673224852, CurrSamplesPerSec=23.806975409330107, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:30:47,854] [INFO] [logging.py:96:log_dist] [Rank 0] step=6520, skipped=121, lr=[5.807683470049641e-06, 5.807683470049641e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:30:48,097] [INFO] [timer.py:199:stop] epoch=7/micro_step=320/global_step=6520, RunningAvgSamplesPerSec=23.79414456014495, CurrSamplesPerSec=23.76111373564154, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:30:53,198] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 20:30:55,609] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 20:31:14,277] [INFO] [logging.py:96:log_dist] [Rank 0] step=6530, skipped=123, lr=[5.799616564039792e-06, 5.799616564039792e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:31:14,521] [INFO] [timer.py:199:stop] epoch=7/micro_step=360/global_step=6530, RunningAvgSamplesPerSec=23.794850756842013, CurrSamplesPerSec=23.631364891123887, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:31:41,272] [INFO] [logging.py:96:log_dist] [Rank 0] step=6540, skipped=123, lr=[5.7895289388836205e-06, 5.7895289388836205e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:31:41,513] [INFO] [timer.py:199:stop] epoch=7/micro_step=400/global_step=6540, RunningAvgSamplesPerSec=23.794787190810023, CurrSamplesPerSec=23.769355076646043, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:32:08,224] [INFO] [logging.py:96:log_dist] [Rank 0] step=6550, skipped=123, lr=[5.779436920342852e-06, 5.779436920342852e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:32:08,466] [INFO] [timer.py:199:stop] epoch=7/micro_step=440/global_step=6550, RunningAvgSamplesPerSec=23.794771767998043, CurrSamplesPerSec=23.6541771953799, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:32:35,183] [INFO] [logging.py:96:log_dist] [Rank 0] step=6560, skipped=123, lr=[5.769340554386167e-06, 5.769340554386167e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:32:35,426] [INFO] [timer.py:199:stop] epoch=7/micro_step=480/global_step=6560, RunningAvgSamplesPerSec=23.79474742434493, CurrSamplesPerSec=23.775504562673085, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:33:02,128] [INFO] [logging.py:96:log_dist] [Rank 0] step=6570, skipped=123, lr=[5.759239887002041e-06, 5.759239887002041e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:33:02,374] [INFO] [timer.py:199:stop] epoch=7/micro_step=520/global_step=6570, RunningAvgSamplesPerSec=23.794744336342273, CurrSamplesPerSec=23.756893235365908, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:33:29,079] [INFO] [logging.py:96:log_dist] [Rank 0] step=6580, skipped=123, lr=[5.749134964198547e-06, 5.749134964198547e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:33:29,322] [INFO] [timer.py:199:stop] epoch=7/micro_step=560/global_step=6580, RunningAvgSamplesPerSec=23.79473592343417, CurrSamplesPerSec=23.727085460408624, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:33:56,034] [INFO] [logging.py:96:log_dist] [Rank 0] step=6590, skipped=123, lr=[5.739025832003138e-06, 5.739025832003138e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:33:56,276] [INFO] [timer.py:199:stop] epoch=7/micro_step=600/global_step=6590, RunningAvgSamplesPerSec=23.794719278810668, CurrSamplesPerSec=23.801297123491313, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:34:22,981] [INFO] [logging.py:96:log_dist] [Rank 0] step=6600, skipped=123, lr=[5.728912536462445e-06, 5.728912536462445e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:34:23,221] [INFO] [timer.py:199:stop] epoch=7/micro_step=640/global_step=6600, RunningAvgSamplesPerSec=23.794717727929218, CurrSamplesPerSec=23.76329712387886, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:34:49,918] [INFO] [logging.py:96:log_dist] [Rank 0] step=6610, skipped=123, lr=[5.71879512364206e-06, 5.71879512364206e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:34:50,160] [INFO] [timer.py:199:stop] epoch=7/micro_step=680/global_step=6610, RunningAvgSamplesPerSec=23.794726392039944, CurrSamplesPerSec=23.74684325473494, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:35:16,926] [INFO] [logging.py:96:log_dist] [Rank 0] step=6620, skipped=123, lr=[5.708673639626328e-06, 5.708673639626328e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:35:17,168] [INFO] [timer.py:199:stop] epoch=7/micro_step=720/global_step=6620, RunningAvgSamplesPerSec=23.794638382007598, CurrSamplesPerSec=23.70157059556899, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:35:27,693] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 20:35:30,097] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 20:35:43,300] [INFO] [logging.py:96:log_dist] [Rank 0] step=6630, skipped=125, lr=[5.700573552133617e-06, 5.700573552133617e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:35:43,541] [INFO] [timer.py:199:stop] epoch=7/micro_step=760/global_step=6630, RunningAvgSamplesPerSec=23.795400372503874, CurrSamplesPerSec=23.86711153448991, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:36:10,215] [INFO] [logging.py:96:log_dist] [Rank 0] step=6640, skipped=125, lr=[5.690444856157873e-06, 5.690444856157873e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:36:10,458] [INFO] [timer.py:199:stop] epoch=7/micro_step=800/global_step=6640, RunningAvgSamplesPerSec=23.795432423657786, CurrSamplesPerSec=23.754879195315066, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:36:37,192] [INFO] [logging.py:96:log_dist] [Rank 0] step=6650, skipped=125, lr=[5.6803122181209365e-06, 5.6803122181209365e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:36:37,436] [INFO] [timer.py:199:stop] epoch=7/micro_step=840/global_step=6650, RunningAvgSamplesPerSec=23.795382806779315, CurrSamplesPerSec=23.74612062285172, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:37:04,109] [INFO] [logging.py:96:log_dist] [Rank 0] step=6660, skipped=125, lr=[5.6701756841765045e-06, 5.6701756841765045e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:37:04,351] [INFO] [timer.py:199:stop] epoch=7/micro_step=880/global_step=6660, RunningAvgSamplesPerSec=23.795418041659406, CurrSamplesPerSec=23.851821206764065, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:37:31,061] [INFO] [logging.py:96:log_dist] [Rank 0] step=6670, skipped=125, lr=[5.660035300496018e-06, 5.660035300496018e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:37:31,305] [INFO] [timer.py:199:stop] epoch=7/micro_step=920/global_step=6670, RunningAvgSamplesPerSec=23.79540327937113, CurrSamplesPerSec=23.704509160706294, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:37:58,069] [INFO] [logging.py:96:log_dist] [Rank 0] step=6680, skipped=125, lr=[5.649891113268454e-06, 5.649891113268454e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:37:58,311] [INFO] [timer.py:199:stop] epoch=7/micro_step=960/global_step=6680, RunningAvgSamplesPerSec=23.795319764567576, CurrSamplesPerSec=23.693737982576987, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:38:25,020] [INFO] [logging.py:96:log_dist] [Rank 0] step=6690, skipped=125, lr=[5.639743168700117e-06, 5.639743168700117e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:38:25,260] [INFO] [timer.py:199:stop] epoch=7/micro_step=1000/global_step=6690, RunningAvgSamplesPerSec=23.795309975103265, CurrSamplesPerSec=23.691770183114784, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:38:52,029] [INFO] [logging.py:96:log_dist] [Rank 0] step=6700, skipped=125, lr=[5.629591513014424e-06, 5.629591513014424e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:38:52,271] [INFO] [timer.py:199:stop] epoch=7/micro_step=1040/global_step=6700, RunningAvgSamplesPerSec=23.79521967836017, CurrSamplesPerSec=23.77427062085782, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:39:18,984] [INFO] [logging.py:96:log_dist] [Rank 0] step=6710, skipped=125, lr=[5.619436192451693e-06, 5.619436192451693e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:39:19,226] [INFO] [timer.py:199:stop] epoch=7/micro_step=1080/global_step=6710, RunningAvgSamplesPerSec=23.795207189558912, CurrSamplesPerSec=23.808918047117693, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:39:45,905] [INFO] [logging.py:96:log_dist] [Rank 0] step=6720, skipped=125, lr=[5.60927725326894e-06, 5.60927725326894e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:39:46,149] [INFO] [timer.py:199:stop] epoch=7/micro_step=1120/global_step=6720, RunningAvgSamplesPerSec=23.79523590707735, CurrSamplesPerSec=23.723168455817767, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:40:02,084] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 20:40:04,496] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 20:40:12,345] [INFO] [logging.py:96:log_dist] [Rank 0] step=6730, skipped=127, lr=[5.601147527611565e-06, 5.601147527611565e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:40:12,587] [INFO] [timer.py:199:stop] epoch=7/micro_step=1160/global_step=6730, RunningAvgSamplesPerSec=23.795899712713684, CurrSamplesPerSec=23.738592322923896, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:40:39,498] [INFO] [logging.py:96:log_dist] [Rank 0] step=6740, skipped=127, lr=[5.590982191532927e-06, 5.590982191532927e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:40:39,740] [INFO] [timer.py:199:stop] epoch=7/micro_step=1200/global_step=6740, RunningAvgSamplesPerSec=23.79569112310567, CurrSamplesPerSec=23.69986723117403, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:41:06,505] [INFO] [logging.py:96:log_dist] [Rank 0] step=6750, skipped=127, lr=[5.5808133664409226e-06, 5.5808133664409226e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:41:06,750] [INFO] [timer.py:199:stop] epoch=7/micro_step=1240/global_step=6750, RunningAvgSamplesPerSec=23.795607641577174, CurrSamplesPerSec=23.656793375069192, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:41:33,558] [INFO] [logging.py:96:log_dist] [Rank 0] step=6760, skipped=127, lr=[5.570641098654079e-06, 5.570641098654079e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:41:33,802] [INFO] [timer.py:199:stop] epoch=7/micro_step=1280/global_step=6760, RunningAvgSamplesPerSec=23.795473158597108, CurrSamplesPerSec=23.7344553661962, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:42:00,653] [INFO] [logging.py:96:log_dist] [Rank 0] step=6770, skipped=127, lr=[5.560465434506603e-06, 5.560465434506603e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:42:00,896] [INFO] [timer.py:199:stop] epoch=7/micro_step=1320/global_step=6770, RunningAvgSamplesPerSec=23.7952794219975, CurrSamplesPerSec=23.73357191087223, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:42:27,638] [INFO] [logging.py:96:log_dist] [Rank 0] step=6780, skipped=127, lr=[5.550286420348174e-06, 5.550286420348174e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:42:27,880] [INFO] [timer.py:199:stop] epoch=7/micro_step=1360/global_step=6780, RunningAvgSamplesPerSec=23.795227688756295, CurrSamplesPerSec=23.65461700663925, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:42:54,618] [INFO] [logging.py:96:log_dist] [Rank 0] step=6790, skipped=127, lr=[5.540104102543729e-06, 5.540104102543729e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:42:54,860] [INFO] [timer.py:199:stop] epoch=7/micro_step=1400/global_step=6790, RunningAvgSamplesPerSec=23.79517708885778, CurrSamplesPerSec=23.723887594734983, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:43:21,586] [INFO] [logging.py:96:log_dist] [Rank 0] step=6800, skipped=127, lr=[5.5299185274732536e-06, 5.5299185274732536e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:43:21,828] [INFO] [timer.py:199:stop] epoch=7/micro_step=1440/global_step=6800, RunningAvgSamplesPerSec=23.795144867986522, CurrSamplesPerSec=23.804612997783465, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:43:48,659] [INFO] [logging.py:96:log_dist] [Rank 0] step=6810, skipped=127, lr=[5.5197297415315674e-06, 5.5197297415315674e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:43:48,902] [INFO] [timer.py:199:stop] epoch=7/micro_step=1480/global_step=6810, RunningAvgSamplesPerSec=23.795020884751104, CurrSamplesPerSec=23.729361188135897, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:44:15,669] [INFO] [logging.py:96:log_dist] [Rank 0] step=6820, skipped=127, lr=[5.509537791128122e-06, 5.509537791128122e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:44:15,911] [INFO] [timer.py:199:stop] epoch=7/micro_step=1520/global_step=6820, RunningAvgSamplesPerSec=23.79493470308955, CurrSamplesPerSec=23.716115657866535, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:44:37,181] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 20:44:39,584] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 20:44:42,040] [INFO] [logging.py:96:log_dist] [Rank 0] step=6830, skipped=129, lr=[5.501381983589255e-06, 5.501381983589255e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:44:42,281] [INFO] [timer.py:199:stop] epoch=7/micro_step=1560/global_step=6830, RunningAvgSamplesPerSec=23.795676182649405, CurrSamplesPerSec=23.783728496576934, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:45:09,022] [INFO] [logging.py:96:log_dist] [Rank 0] step=6840, skipped=129, lr=[5.491184454152322e-06, 5.491184454152322e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:45:09,265] [INFO] [timer.py:199:stop] epoch=7/micro_step=1600/global_step=6840, RunningAvgSamplesPerSec=23.795622218186164, CurrSamplesPerSec=23.715277565501268, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:45:35,998] [INFO] [logging.py:96:log_dist] [Rank 0] step=6850, skipped=129, lr=[5.480983890276089e-06, 5.480983890276089e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:45:36,239] [INFO] [timer.py:199:stop] epoch=7/micro_step=1640/global_step=6850, RunningAvgSamplesPerSec=23.795583336406118, CurrSamplesPerSec=23.767926055742237, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:46:02,944] [INFO] [logging.py:96:log_dist] [Rank 0] step=6860, skipped=129, lr=[5.470780338423652e-06, 5.470780338423652e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:46:03,187] [INFO] [timer.py:199:stop] epoch=7/micro_step=1680/global_step=6860, RunningAvgSamplesPerSec=23.795577411574648, CurrSamplesPerSec=23.720654957068103, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:46:29,950] [INFO] [logging.py:96:log_dist] [Rank 0] step=6870, skipped=129, lr=[5.460573845071715e-06, 5.460573845071715e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:46:30,194] [INFO] [timer.py:199:stop] epoch=7/micro_step=1720/global_step=6870, RunningAvgSamplesPerSec=23.795500158242167, CurrSamplesPerSec=23.75820107262429, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:46:56,946] [INFO] [logging.py:96:log_dist] [Rank 0] step=6880, skipped=129, lr=[5.450364456710384e-06, 5.450364456710384e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:46:57,187] [INFO] [timer.py:199:stop] epoch=7/micro_step=1760/global_step=6880, RunningAvgSamplesPerSec=23.795436849978877, CurrSamplesPerSec=23.746366396783518, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:47:23,923] [INFO] [logging.py:96:log_dist] [Rank 0] step=6890, skipped=129, lr=[5.4401522198429465e-06, 5.4401522198429465e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:47:24,166] [INFO] [timer.py:199:stop] epoch=7/micro_step=1800/global_step=6890, RunningAvgSamplesPerSec=23.795390842780876, CurrSamplesPerSec=23.7796769084125, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:47:50,934] [INFO] [logging.py:96:log_dist] [Rank 0] step=6900, skipped=129, lr=[5.429937180985671e-06, 5.429937180985671e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:47:51,178] [INFO] [timer.py:199:stop] epoch=7/micro_step=1840/global_step=6900, RunningAvgSamplesPerSec=23.79530617751703, CurrSamplesPerSec=23.73424551337006, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:48:17,933] [INFO] [logging.py:96:log_dist] [Rank 0] step=6910, skipped=129, lr=[5.419719386667584e-06, 5.419719386667584e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:48:18,174] [INFO] [timer.py:199:stop] epoch=7/micro_step=1880/global_step=6910, RunningAvgSamplesPerSec=23.79523695488304, CurrSamplesPerSec=23.678353530929733, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:48:44,884] [INFO] [logging.py:96:log_dist] [Rank 0] step=6920, skipped=129, lr=[5.409498883430266e-06, 5.409498883430266e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:48:45,126] [INFO] [timer.py:199:stop] epoch=7/micro_step=1920/global_step=6920, RunningAvgSamplesPerSec=23.79523229912054, CurrSamplesPerSec=23.801214818821226, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:49:11,854] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 20:49:11,855] [INFO] [logging.py:96:log_dist] [Rank 0] step=6930, skipped=130, lr=[5.400298152867298e-06, 5.400298152867298e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:49:11,856] [INFO] [timer.py:199:stop] epoch=7/micro_step=1960/global_step=6930, RunningAvgSamplesPerSec=23.79550871161303, CurrSamplesPerSec=26.621812523491837, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:49:14,259] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 20:49:38,405] [INFO] [logging.py:96:log_dist] [Rank 0] step=6940, skipped=131, lr=[5.391095299734483e-06, 5.391095299734483e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:49:38,648] [INFO] [timer.py:199:stop] epoch=7/micro_step=2000/global_step=6940, RunningAvgSamplesPerSec=23.79572429147884, CurrSamplesPerSec=23.759012761082282, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:50:05,312] [INFO] [logging.py:96:log_dist] [Rank 0] step=6950, skipped=131, lr=[5.380867459228737e-06, 5.380867459228737e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:50:05,553] [INFO] [timer.py:199:stop] epoch=7/micro_step=2040/global_step=6950, RunningAvgSamplesPerSec=23.795775028771708, CurrSamplesPerSec=23.876889796557343, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:50:33,329] [INFO] [logging.py:96:log_dist] [Rank 0] step=6960, skipped=131, lr=[5.370637086772486e-06, 5.370637086772486e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:50:33,572] [INFO] [timer.py:199:stop] epoch=7/micro_step=2080/global_step=6960, RunningAvgSamplesPerSec=23.794986483107415, CurrSamplesPerSec=23.83429856595786, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:51:00,222] [INFO] [logging.py:96:log_dist] [Rank 0] step=6970, skipped=131, lr=[5.3604042289646046e-06, 5.3604042289646046e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:51:00,464] [INFO] [timer.py:199:stop] epoch=7/micro_step=2120/global_step=6970, RunningAvgSamplesPerSec=23.795052950778054, CurrSamplesPerSec=23.88312480620879, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:51:27,085] [INFO] [logging.py:96:log_dist] [Rank 0] step=6980, skipped=131, lr=[5.3501689324152854e-06, 5.3501689324152854e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:51:27,325] [INFO] [timer.py:199:stop] epoch=7/micro_step=2160/global_step=6980, RunningAvgSamplesPerSec=23.79515839499605, CurrSamplesPerSec=23.887073558820422, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:51:53,963] [INFO] [logging.py:96:log_dist] [Rank 0] step=6990, skipped=131, lr=[5.339931243745829e-06, 5.339931243745829e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:51:54,205] [INFO] [timer.py:199:stop] epoch=7/micro_step=2200/global_step=6990, RunningAvgSamplesPerSec=23.795243504121256, CurrSamplesPerSec=23.872674778213103, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:52:20,829] [INFO] [logging.py:96:log_dist] [Rank 0] step=7000, skipped=131, lr=[5.329691209588432e-06, 5.329691209588432e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:52:21,071] [INFO] [timer.py:199:stop] epoch=7/micro_step=2240/global_step=7000, RunningAvgSamplesPerSec=23.795338151942893, CurrSamplesPerSec=23.935087770607314, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:52:47,723] [INFO] [logging.py:96:log_dist] [Rank 0] step=7010, skipped=131, lr=[5.319448876585975e-06, 5.319448876585975e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:52:47,961] [INFO] [timer.py:199:stop] epoch=7/micro_step=2280/global_step=7010, RunningAvgSamplesPerSec=23.79540556315959, CurrSamplesPerSec=23.72233825053189, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:53:14,580] [INFO] [logging.py:96:log_dist] [Rank 0] step=7020, skipped=131, lr=[5.3092042913918115e-06, 5.3092042913918115e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:53:14,821] [INFO] [timer.py:199:stop] epoch=7/micro_step=2320/global_step=7020, RunningAvgSamplesPerSec=23.795508344105357, CurrSamplesPerSec=23.838542382198405, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:53:41,447] [INFO] [logging.py:96:log_dist] [Rank 0] step=7030, skipped=131, lr=[5.298957500669552e-06, 5.298957500669552e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:53:41,690] [INFO] [timer.py:199:stop] epoch=7/micro_step=2360/global_step=7030, RunningAvgSamplesPerSec=23.795602860261866, CurrSamplesPerSec=23.80075054672874, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:53:46,779] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 20:53:49,180] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 20:54:07,759] [INFO] [logging.py:96:log_dist] [Rank 0] step=7040, skipped=133, lr=[5.29075851147588e-06, 5.29075851147588e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:54:08,001] [INFO] [timer.py:199:stop] epoch=7/micro_step=2400/global_step=7040, RunningAvgSamplesPerSec=23.796397654549697, CurrSamplesPerSec=23.84017046134761, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:54:34,641] [INFO] [logging.py:96:log_dist] [Rank 0] step=7050, skipped=133, lr=[5.2805078684272754e-06, 5.2805078684272754e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:54:34,883] [INFO] [timer.py:199:stop] epoch=7/micro_step=2440/global_step=7050, RunningAvgSamplesPerSec=23.796471063610955, CurrSamplesPerSec=23.84042453797072, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:55:01,498] [INFO] [logging.py:96:log_dist] [Rank 0] step=7060, skipped=133, lr=[5.27025515056145e-06, 5.27025515056145e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:55:01,740] [INFO] [timer.py:199:stop] epoch=7/micro_step=2480/global_step=7060, RunningAvgSamplesPerSec=23.79657453845888, CurrSamplesPerSec=23.874139779566917, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:55:28,394] [INFO] [logging.py:96:log_dist] [Rank 0] step=7070, skipped=133, lr=[5.2600004045790595e-06, 5.2600004045790595e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:55:28,637] [INFO] [timer.py:199:stop] epoch=7/micro_step=2520/global_step=7070, RunningAvgSamplesPerSec=23.796626009432945, CurrSamplesPerSec=23.853737255398194, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:55:55,289] [INFO] [logging.py:96:log_dist] [Rank 0] step=7080, skipped=133, lr=[5.249743677189995e-06, 5.249743677189995e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:55:55,532] [INFO] [timer.py:199:stop] epoch=7/micro_step=2560/global_step=7080, RunningAvgSamplesPerSec=23.796687942624636, CurrSamplesPerSec=23.816074756272364, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:56:22,202] [INFO] [logging.py:96:log_dist] [Rank 0] step=7090, skipped=133, lr=[5.239485015113176e-06, 5.239485015113176e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:56:22,444] [INFO] [timer.py:199:stop] epoch=7/micro_step=2600/global_step=7090, RunningAvgSamplesPerSec=23.79672528739751, CurrSamplesPerSec=23.857655091603426, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:56:49,103] [INFO] [logging.py:96:log_dist] [Rank 0] step=7100, skipped=133, lr=[5.22922446507633e-06, 5.22922446507633e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:56:49,344] [INFO] [timer.py:199:stop] epoch=7/micro_step=2640/global_step=7100, RunningAvgSamplesPerSec=23.796776367728707, CurrSamplesPerSec=23.768205953521054, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:57:15,980] [INFO] [logging.py:96:log_dist] [Rank 0] step=7110, skipped=133, lr=[5.21896207381579e-06, 5.21896207381579e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:57:16,222] [INFO] [timer.py:199:stop] epoch=7/micro_step=2680/global_step=7110, RunningAvgSamplesPerSec=23.796854319692393, CurrSamplesPerSec=23.843054549149553, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:57:42,852] [INFO] [logging.py:96:log_dist] [Rank 0] step=7120, skipped=133, lr=[5.208697888076266e-06, 5.208697888076266e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:57:43,095] [INFO] [timer.py:199:stop] epoch=7/micro_step=2720/global_step=7120, RunningAvgSamplesPerSec=23.796934515434913, CurrSamplesPerSec=23.86416434227751, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:58:09,725] [INFO] [logging.py:96:log_dist] [Rank 0] step=7130, skipped=133, lr=[5.1984319546106546e-06, 5.1984319546106546e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:58:09,966] [INFO] [timer.py:199:stop] epoch=7/micro_step=2760/global_step=7130, RunningAvgSamplesPerSec=23.797017882579198, CurrSamplesPerSec=23.83373142719446, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:58:20,439] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 20:58:22,841] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 20:58:36,048] [INFO] [logging.py:96:log_dist] [Rank 0] step=7140, skipped=135, lr=[5.190217980898411e-06, 5.190217980898411e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:58:36,289] [INFO] [timer.py:199:stop] epoch=7/micro_step=2800/global_step=7140, RunningAvgSamplesPerSec=23.79778067487763, CurrSamplesPerSec=23.857309473131497, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:59:02,907] [INFO] [logging.py:96:log_dist] [Rank 0] step=7150, skipped=135, lr=[5.179949019368396e-06, 5.179949019368396e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:59:03,149] [INFO] [timer.py:199:stop] epoch=7/micro_step=2840/global_step=7150, RunningAvgSamplesPerSec=23.797876469835234, CurrSamplesPerSec=23.887615604703992, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:59:29,774] [INFO] [logging.py:96:log_dist] [Rank 0] step=7160, skipped=135, lr=[5.169678441062055e-06, 5.169678441062055e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:59:30,016] [INFO] [timer.py:199:stop] epoch=7/micro_step=2880/global_step=7160, RunningAvgSamplesPerSec=23.797961243794642, CurrSamplesPerSec=23.856919339158193, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 20:59:56,636] [INFO] [logging.py:96:log_dist] [Rank 0] step=7170, skipped=135, lr=[5.1594062927613975e-06, 5.1594062927613975e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 20:59:56,877] [INFO] [timer.py:199:stop] epoch=7/micro_step=2920/global_step=7170, RunningAvgSamplesPerSec=23.798053701546706, CurrSamplesPerSec=23.906421749049212, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:00:23,485] [INFO] [logging.py:96:log_dist] [Rank 0] step=7180, skipped=135, lr=[5.14913262125558e-06, 5.14913262125558e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:00:23,726] [INFO] [timer.py:199:stop] epoch=7/micro_step=2960/global_step=7180, RunningAvgSamplesPerSec=23.79815974559728, CurrSamplesPerSec=23.910859547777456, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:00:50,350] [INFO] [logging.py:96:log_dist] [Rank 0] step=7190, skipped=135, lr=[5.138857473340704e-06, 5.138857473340704e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:00:50,591] [INFO] [timer.py:199:stop] epoch=7/micro_step=3000/global_step=7190, RunningAvgSamplesPerSec=23.798252255873305, CurrSamplesPerSec=23.852295951189934, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:01:17,200] [INFO] [logging.py:96:log_dist] [Rank 0] step=7200, skipped=135, lr=[5.128580895819588e-06, 5.128580895819588e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:01:17,442] [INFO] [timer.py:199:stop] epoch=7/micro_step=3040/global_step=7200, RunningAvgSamplesPerSec=23.79835799411396, CurrSamplesPerSec=23.836285880792932, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:01:44,136] [INFO] [logging.py:96:log_dist] [Rank 0] step=7210, skipped=135, lr=[5.118302935501566e-06, 5.118302935501566e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:01:44,377] [INFO] [timer.py:199:stop] epoch=7/micro_step=3080/global_step=7210, RunningAvgSamplesPerSec=23.798362116282146, CurrSamplesPerSec=23.848788798804378, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:02:10,998] [INFO] [logging.py:96:log_dist] [Rank 0] step=7220, skipped=135, lr=[5.108023639202274e-06, 5.108023639202274e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:02:11,239] [INFO] [timer.py:199:stop] epoch=7/micro_step=3120/global_step=7220, RunningAvgSamplesPerSec=23.798456359592638, CurrSamplesPerSec=23.872507057645024, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:02:38,691] [INFO] [logging.py:96:log_dist] [Rank 0] step=7230, skipped=135, lr=[5.097743053743426e-06, 5.097743053743426e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:02:38,932] [INFO] [timer.py:199:stop] epoch=7/micro_step=3160/global_step=7230, RunningAvgSamplesPerSec=23.79757246067692, CurrSamplesPerSec=23.80989582062908, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:02:54,779] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 21:02:57,178] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 21:03:04,996] [INFO] [logging.py:96:log_dist] [Rank 0] step=7240, skipped=137, lr=[5.089517688649419e-06, 5.089517688649419e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:03:05,238] [INFO] [timer.py:199:stop] epoch=7/micro_step=3200/global_step=7240, RunningAvgSamplesPerSec=23.798345318579003, CurrSamplesPerSec=23.83516413945778, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:03:31,896] [INFO] [logging.py:96:log_dist] [Rank 0] step=7250, skipped=137, lr=[5.0792349007127126e-06, 5.0792349007127126e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:03:32,137] [INFO] [timer.py:199:stop] epoch=7/micro_step=3240/global_step=7250, RunningAvgSamplesPerSec=23.798396414616942, CurrSamplesPerSec=23.820157781489744, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:03:58,765] [INFO] [logging.py:96:log_dist] [Rank 0] step=7260, skipped=137, lr=[5.068950954747821e-06, 5.068950954747821e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:03:59,008] [INFO] [timer.py:199:stop] epoch=7/micro_step=3280/global_step=7260, RunningAvgSamplesPerSec=23.798479774102145, CurrSamplesPerSec=23.81605573924402, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:04:25,657] [INFO] [logging.py:96:log_dist] [Rank 0] step=7270, skipped=137, lr=[5.058665897597642e-06, 5.058665897597642e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:04:25,901] [INFO] [timer.py:199:stop] epoch=7/micro_step=3320/global_step=7270, RunningAvgSamplesPerSec=23.798539731625915, CurrSamplesPerSec=23.64301230322368, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:04:52,536] [INFO] [logging.py:96:log_dist] [Rank 0] step=7280, skipped=137, lr=[5.048379776110132e-06, 5.048379776110132e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:04:52,779] [INFO] [timer.py:199:stop] epoch=7/micro_step=3360/global_step=7280, RunningAvgSamplesPerSec=23.798610610825698, CurrSamplesPerSec=23.901083242068896, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:05:19,394] [INFO] [logging.py:96:log_dist] [Rank 0] step=7290, skipped=137, lr=[5.038092637138101e-06, 5.038092637138101e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:05:19,636] [INFO] [timer.py:199:stop] epoch=7/micro_step=3400/global_step=7290, RunningAvgSamplesPerSec=23.798704251467175, CurrSamplesPerSec=23.888861340628857, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:05:46,273] [INFO] [logging.py:96:log_dist] [Rank 0] step=7300, skipped=137, lr=[5.027804527538988e-06, 5.027804527538988e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:05:46,517] [INFO] [timer.py:199:stop] epoch=7/micro_step=3440/global_step=7300, RunningAvgSamplesPerSec=23.798772829561347, CurrSamplesPerSec=23.78475688836171, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:06:13,130] [INFO] [logging.py:96:log_dist] [Rank 0] step=7310, skipped=137, lr=[5.017515494174654e-06, 5.017515494174654e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:06:13,371] [INFO] [timer.py:199:stop] epoch=7/micro_step=3480/global_step=7310, RunningAvgSamplesPerSec=23.79887009649188, CurrSamplesPerSec=23.85165589840892, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:06:40,010] [INFO] [logging.py:96:log_dist] [Rank 0] step=7320, skipped=137, lr=[5.00722558391117e-06, 5.00722558391117e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:06:40,252] [INFO] [timer.py:199:stop] epoch=7/micro_step=3520/global_step=7320, RunningAvgSamplesPerSec=23.798941206154133, CurrSamplesPerSec=23.859697201415642, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:07:06,912] [INFO] [logging.py:96:log_dist] [Rank 0] step=7330, skipped=137, lr=[4.9969348436186015e-06, 4.9969348436186015e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:07:07,154] [INFO] [timer.py:199:stop] epoch=7/micro_step=3560/global_step=7330, RunningAvgSamplesPerSec=23.798982830990123, CurrSamplesPerSec=23.854481290879505, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:07:28,346] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 21:07:30,741] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 21:07:33,191] [INFO] [logging.py:96:log_dist] [Rank 0] step=7340, skipped=139, lr=[4.988701685262707e-06, 4.988701685262707e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:07:33,431] [INFO] [timer.py:199:stop] epoch=7/micro_step=3600/global_step=7340, RunningAvgSamplesPerSec=23.799775768003265, CurrSamplesPerSec=23.834125035560007, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:08:00,070] [INFO] [logging.py:96:log_dist] [Rank 0] step=7350, skipped=139, lr=[4.978409569042277e-06, 4.978409569042277e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:08:00,312] [INFO] [timer.py:199:stop] epoch=7/micro_step=3640/global_step=7350, RunningAvgSamplesPerSec=23.79984260158082, CurrSamplesPerSec=23.844431195340043, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:08:26,938] [INFO] [logging.py:96:log_dist] [Rank 0] step=7360, skipped=139, lr=[4.968116754048374e-06, 4.968116754048374e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:08:27,179] [INFO] [timer.py:199:stop] epoch=7/micro_step=3680/global_step=7360, RunningAvgSamplesPerSec=23.799931094012173, CurrSamplesPerSec=23.832303118929236, MemAllocated=11.44GB, MaxMemAllocated=31.66GB ***** Evaluating perplexity, Epoch 8/16 ***** ppl: 1.8051525354385376 saving the final model ... Beginning of Epoch 9/16, Total Micro Batches 3680 [2023-04-23 21:09:45,517] [INFO] [logging.py:96:log_dist] [Rank 0] step=7370, skipped=139, lr=[4.957823287164291e-06, 4.957823287164291e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:09:45,758] [INFO] [timer.py:199:stop] epoch=8/micro_step=40/global_step=7370, RunningAvgSamplesPerSec=23.799922097925826, CurrSamplesPerSec=23.86809409366705, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:10:12,439] [INFO] [logging.py:96:log_dist] [Rank 0] step=7380, skipped=139, lr=[4.947529215276293e-06, 4.947529215276293e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:10:12,680] [INFO] [timer.py:199:stop] epoch=8/micro_step=80/global_step=7380, RunningAvgSamplesPerSec=23.799938043791215, CurrSamplesPerSec=23.807581393834916, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:10:39,368] [INFO] [logging.py:96:log_dist] [Rank 0] step=7390, skipped=139, lr=[4.937234585273403e-06, 4.937234585273403e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:10:39,609] [INFO] [timer.py:199:stop] epoch=8/micro_step=120/global_step=7390, RunningAvgSamplesPerSec=23.79995242559557, CurrSamplesPerSec=23.865231527958997, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:11:06,291] [INFO] [logging.py:96:log_dist] [Rank 0] step=7400, skipped=139, lr=[4.92693944404718e-06, 4.92693944404718e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:11:06,532] [INFO] [timer.py:199:stop] epoch=8/micro_step=160/global_step=7400, RunningAvgSamplesPerSec=23.79996668696059, CurrSamplesPerSec=23.78797961806017, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:11:33,192] [INFO] [logging.py:96:log_dist] [Rank 0] step=7410, skipped=139, lr=[4.916643838491515e-06, 4.916643838491515e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:11:33,433] [INFO] [timer.py:199:stop] epoch=8/micro_step=200/global_step=7410, RunningAvgSamplesPerSec=23.800006638430645, CurrSamplesPerSec=23.88122103038484, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:12:00,051] [INFO] [logging.py:96:log_dist] [Rank 0] step=7420, skipped=139, lr=[4.906347815502415e-06, 4.906347815502415e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:12:00,293] [INFO] [timer.py:199:stop] epoch=8/micro_step=240/global_step=7420, RunningAvgSamplesPerSec=23.80009268209198, CurrSamplesPerSec=23.832193093394498, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:12:26,948] [INFO] [logging.py:96:log_dist] [Rank 0] step=7430, skipped=139, lr=[4.896051421977788e-06, 4.896051421977788e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:12:27,190] [INFO] [timer.py:199:stop] epoch=8/micro_step=280/global_step=7430, RunningAvgSamplesPerSec=23.80014217871916, CurrSamplesPerSec=23.857209818131466, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:12:53,749] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 21:12:53,750] [INFO] [logging.py:96:log_dist] [Rank 0] step=7440, skipped=140, lr=[4.88678438976023e-06, 4.88678438976023e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:12:53,750] [INFO] [timer.py:199:stop] epoch=8/micro_step=320/global_step=7440, RunningAvgSamplesPerSec=23.800583943922778, CurrSamplesPerSec=26.727976502401265, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:12:56,146] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 21:13:20,110] [INFO] [logging.py:96:log_dist] [Rank 0] step=7450, skipped=141, lr=[4.877517129588383e-06, 4.877517129588383e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:13:20,353] [INFO] [timer.py:199:stop] epoch=8/micro_step=360/global_step=7450, RunningAvgSamplesPerSec=23.80097479653553, CurrSamplesPerSec=23.780098226118508, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:13:46,997] [INFO] [logging.py:96:log_dist] [Rank 0] step=7460, skipped=141, lr=[4.867219948074693e-06, 4.867219948074693e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:13:47,240] [INFO] [timer.py:199:stop] epoch=8/micro_step=400/global_step=7460, RunningAvgSamplesPerSec=23.801029821725667, CurrSamplesPerSec=23.825542667339683, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:14:13,876] [INFO] [logging.py:96:log_dist] [Rank 0] step=7470, skipped=141, lr=[4.856922574251095e-06, 4.856922574251095e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:14:14,118] [INFO] [timer.py:199:stop] epoch=8/micro_step=440/global_step=7470, RunningAvgSamplesPerSec=23.801097334194687, CurrSamplesPerSec=23.803766528308554, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:14:40,941] [INFO] [logging.py:96:log_dist] [Rank 0] step=7480, skipped=141, lr=[4.846625055021653e-06, 4.846625055021653e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:14:41,183] [INFO] [timer.py:199:stop] epoch=8/micro_step=480/global_step=7480, RunningAvgSamplesPerSec=23.80095102027574, CurrSamplesPerSec=23.86949910283345, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:15:07,817] [INFO] [logging.py:96:log_dist] [Rank 0] step=7490, skipped=141, lr=[4.836327437291088e-06, 4.836327437291088e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:15:08,058] [INFO] [timer.py:199:stop] epoch=8/micro_step=520/global_step=7490, RunningAvgSamplesPerSec=23.801021238582585, CurrSamplesPerSec=23.86211510420829, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:15:34,743] [INFO] [logging.py:96:log_dist] [Rank 0] step=7500, skipped=141, lr=[4.8260297679645725e-06, 4.8260297679645725e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:15:34,986] [INFO] [timer.py:199:stop] epoch=8/micro_step=560/global_step=7500, RunningAvgSamplesPerSec=23.80103747035047, CurrSamplesPerSec=23.837352691253727, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:16:01,595] [INFO] [logging.py:96:log_dist] [Rank 0] step=7510, skipped=141, lr=[4.815732093947511e-06, 4.815732093947511e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:16:01,836] [INFO] [timer.py:199:stop] epoch=8/micro_step=600/global_step=7510, RunningAvgSamplesPerSec=23.80113317610337, CurrSamplesPerSec=23.83805971746395, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:16:28,512] [INFO] [logging.py:96:log_dist] [Rank 0] step=7520, skipped=141, lr=[4.805434462145331e-06, 4.805434462145331e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:16:28,755] [INFO] [timer.py:199:stop] epoch=8/micro_step=640/global_step=7520, RunningAvgSamplesPerSec=23.80115909011171, CurrSamplesPerSec=23.824569951866348, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:16:55,424] [INFO] [logging.py:96:log_dist] [Rank 0] step=7530, skipped=141, lr=[4.795136919463269e-06, 4.795136919463269e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:16:55,666] [INFO] [timer.py:199:stop] epoch=8/micro_step=680/global_step=7530, RunningAvgSamplesPerSec=23.801186408210615, CurrSamplesPerSec=23.773672647352726, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:17:22,287] [INFO] [logging.py:96:log_dist] [Rank 0] step=7540, skipped=141, lr=[4.784839512806156e-06, 4.784839512806156e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:17:22,528] [INFO] [timer.py:199:stop] epoch=8/micro_step=720/global_step=7540, RunningAvgSamplesPerSec=23.801266737703923, CurrSamplesPerSec=23.82998644959505, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:17:27,619] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 21:17:30,021] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 21:17:48,596] [INFO] [logging.py:96:log_dist] [Rank 0] step=7550, skipped=143, lr=[4.77660171693808e-06, 4.77660171693808e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:17:48,837] [INFO] [timer.py:199:stop] epoch=8/micro_step=760/global_step=7550, RunningAvgSamplesPerSec=23.801998988720356, CurrSamplesPerSec=23.857934986293277, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:18:15,468] [INFO] [logging.py:96:log_dist] [Rank 0] step=7560, skipped=143, lr=[4.7663046733239265e-06, 4.7663046733239265e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:18:15,711] [INFO] [timer.py:199:stop] epoch=8/micro_step=800/global_step=7560, RunningAvgSamplesPerSec=23.802064202512245, CurrSamplesPerSec=23.862322981884365, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:18:42,345] [INFO] [logging.py:96:log_dist] [Rank 0] step=7570, skipped=143, lr=[4.756007897064266e-06, 4.756007897064266e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:18:42,587] [INFO] [timer.py:199:stop] epoch=8/micro_step=840/global_step=7570, RunningAvgSamplesPerSec=23.80213046718903, CurrSamplesPerSec=23.8317128015148, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:19:09,216] [INFO] [logging.py:96:log_dist] [Rank 0] step=7580, skipped=143, lr=[4.745711435060432e-06, 4.745711435060432e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:19:09,459] [INFO] [timer.py:199:stop] epoch=8/micro_step=880/global_step=7580, RunningAvgSamplesPerSec=23.802200602415102, CurrSamplesPerSec=23.82872357848792, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:19:36,076] [INFO] [logging.py:96:log_dist] [Rank 0] step=7590, skipped=143, lr=[4.735415334212338e-06, 4.735415334212338e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:19:36,316] [INFO] [timer.py:199:stop] epoch=8/micro_step=920/global_step=7590, RunningAvgSamplesPerSec=23.802284778873982, CurrSamplesPerSec=23.839126686712927, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:20:02,928] [INFO] [logging.py:96:log_dist] [Rank 0] step=7600, skipped=143, lr=[4.725119641418242e-06, 4.725119641418242e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:20:03,169] [INFO] [timer.py:199:stop] epoch=8/micro_step=960/global_step=7600, RunningAvgSamplesPerSec=23.802375134128916, CurrSamplesPerSec=23.88705442823867, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:20:29,784] [INFO] [logging.py:96:log_dist] [Rank 0] step=7610, skipped=143, lr=[4.714824403574548e-06, 4.714824403574548e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:20:30,026] [INFO] [timer.py:199:stop] epoch=8/micro_step=1000/global_step=7610, RunningAvgSamplesPerSec=23.802459728053904, CurrSamplesPerSec=23.88071326645719, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:20:56,636] [INFO] [logging.py:96:log_dist] [Rank 0] step=7620, skipped=143, lr=[4.704529667575589e-06, 4.704529667575589e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:20:56,877] [INFO] [timer.py:199:stop] epoch=8/micro_step=1040/global_step=7620, RunningAvgSamplesPerSec=23.802551598428643, CurrSamplesPerSec=23.871451958245416, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:21:23,505] [INFO] [logging.py:96:log_dist] [Rank 0] step=7630, skipped=143, lr=[4.694235480313407e-06, 4.694235480313407e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:21:23,746] [INFO] [timer.py:199:stop] epoch=8/micro_step=1080/global_step=7630, RunningAvgSamplesPerSec=23.802622920967714, CurrSamplesPerSec=23.85321158309667, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:21:50,345] [INFO] [logging.py:96:log_dist] [Rank 0] step=7640, skipped=143, lr=[4.683941888677548e-06, 4.683941888677548e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:21:50,586] [INFO] [timer.py:199:stop] epoch=8/micro_step=1120/global_step=7640, RunningAvgSamplesPerSec=23.80272419823172, CurrSamplesPerSec=23.859334558258862, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:22:01,053] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 21:22:03,454] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 21:22:16,663] [INFO] [logging.py:96:log_dist] [Rank 0] step=7650, skipped=145, lr=[4.675707475727864e-06, 4.675707475727864e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:22:16,904] [INFO] [timer.py:199:stop] epoch=8/micro_step=1160/global_step=7650, RunningAvgSamplesPerSec=23.803431020076, CurrSamplesPerSec=23.811577018547425, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:22:43,551] [INFO] [logging.py:96:log_dist] [Rank 0] step=7660, skipped=145, lr=[4.6654150743722495e-06, 4.6654150743722495e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:22:43,794] [INFO] [timer.py:199:stop] epoch=8/micro_step=1200/global_step=7660, RunningAvgSamplesPerSec=23.803476699633446, CurrSamplesPerSec=23.879973966793784, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:23:10,457] [INFO] [logging.py:96:log_dist] [Rank 0] step=7670, skipped=145, lr=[4.655123399918569e-06, 4.655123399918569e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:23:10,699] [INFO] [timer.py:199:stop] epoch=8/micro_step=1240/global_step=7670, RunningAvgSamplesPerSec=23.8035040422505, CurrSamplesPerSec=23.836252015357967, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:23:37,312] [INFO] [logging.py:96:log_dist] [Rank 0] step=7680, skipped=145, lr=[4.6448324992449274e-06, 4.6448324992449274e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:23:37,554] [INFO] [timer.py:199:stop] epoch=8/micro_step=1280/global_step=7680, RunningAvgSamplesPerSec=23.803590703244435, CurrSamplesPerSec=23.884956627035393, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:24:04,167] [INFO] [logging.py:96:log_dist] [Rank 0] step=7690, skipped=145, lr=[4.634542419225899e-06, 4.634542419225899e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:24:04,403] [INFO] [timer.py:199:stop] epoch=8/micro_step=1320/global_step=7690, RunningAvgSamplesPerSec=23.803685603513664, CurrSamplesPerSec=23.95616850099618, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:24:31,031] [INFO] [logging.py:96:log_dist] [Rank 0] step=7700, skipped=145, lr=[4.624253206732319e-06, 4.624253206732319e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:24:31,272] [INFO] [timer.py:199:stop] epoch=8/micro_step=1360/global_step=7700, RunningAvgSamplesPerSec=23.803755224807823, CurrSamplesPerSec=23.89034533898984, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:24:57,912] [INFO] [logging.py:96:log_dist] [Rank 0] step=7710, skipped=145, lr=[4.613964908631074e-06, 4.613964908631074e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:24:58,153] [INFO] [timer.py:199:stop] epoch=8/micro_step=1400/global_step=7710, RunningAvgSamplesPerSec=23.803811466049947, CurrSamplesPerSec=23.869800500968406, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:25:24,795] [INFO] [logging.py:96:log_dist] [Rank 0] step=7720, skipped=145, lr=[4.603677571784887e-06, 4.603677571784887e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:25:25,037] [INFO] [timer.py:199:stop] epoch=8/micro_step=1440/global_step=7720, RunningAvgSamplesPerSec=23.803865853289242, CurrSamplesPerSec=23.813197191407447, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:25:51,705] [INFO] [logging.py:96:log_dist] [Rank 0] step=7730, skipped=145, lr=[4.593391243052097e-06, 4.593391243052097e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:25:51,945] [INFO] [timer.py:199:stop] epoch=8/micro_step=1480/global_step=7730, RunningAvgSamplesPerSec=23.803891084629786, CurrSamplesPerSec=23.87046912259335, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:26:18,593] [INFO] [logging.py:96:log_dist] [Rank 0] step=7740, skipped=145, lr=[4.583105969286457e-06, 4.583105969286457e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:26:18,834] [INFO] [timer.py:199:stop] epoch=8/micro_step=1520/global_step=7740, RunningAvgSamplesPerSec=23.803936431002093, CurrSamplesPerSec=23.855295331582578, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:26:34,670] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 21:26:37,069] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 21:26:44,884] [INFO] [logging.py:96:log_dist] [Rank 0] step=7750, skipped=147, lr=[4.574878541332951e-06, 4.574878541332951e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:26:45,125] [INFO] [timer.py:199:stop] epoch=8/micro_step=1560/global_step=7750, RunningAvgSamplesPerSec=23.804668513357683, CurrSamplesPerSec=23.85122992205392, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:27:11,723] [INFO] [logging.py:96:log_dist] [Rank 0] step=7760, skipped=147, lr=[4.564595284564153e-06, 4.564595284564153e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:27:11,964] [INFO] [timer.py:199:stop] epoch=8/micro_step=1600/global_step=7760, RunningAvgSamplesPerSec=23.804773751880106, CurrSamplesPerSec=23.851312572992793, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:27:38,585] [INFO] [logging.py:96:log_dist] [Rank 0] step=7770, skipped=147, lr=[4.5543132139267634e-06, 4.5543132139267634e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:27:38,827] [INFO] [timer.py:199:stop] epoch=8/micro_step=1640/global_step=7770, RunningAvgSamplesPerSec=23.804851377443395, CurrSamplesPerSec=23.847638338454924, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:28:05,464] [INFO] [logging.py:96:log_dist] [Rank 0] step=7780, skipped=147, lr=[4.544032376255134e-06, 4.544032376255134e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:28:05,705] [INFO] [timer.py:199:stop] epoch=8/micro_step=1680/global_step=7780, RunningAvgSamplesPerSec=23.804910597734565, CurrSamplesPerSec=23.87799422876438, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:28:32,355] [INFO] [logging.py:96:log_dist] [Rank 0] step=7790, skipped=147, lr=[4.533752818378e-06, 4.533752818378e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:28:32,597] [INFO] [timer.py:199:stop] epoch=8/micro_step=1720/global_step=7790, RunningAvgSamplesPerSec=23.804953185920702, CurrSamplesPerSec=23.7966235420177, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:28:59,231] [INFO] [logging.py:96:log_dist] [Rank 0] step=7800, skipped=147, lr=[4.523474587118277e-06, 4.523474587118277e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:28:59,472] [INFO] [timer.py:199:stop] epoch=8/micro_step=1760/global_step=7800, RunningAvgSamplesPerSec=23.805018523726535, CurrSamplesPerSec=23.870596483340275, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:29:26,095] [INFO] [logging.py:96:log_dist] [Rank 0] step=7810, skipped=147, lr=[4.513197729292828e-06, 4.513197729292828e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:29:26,335] [INFO] [timer.py:199:stop] epoch=8/micro_step=1800/global_step=7810, RunningAvgSamplesPerSec=23.805096032364336, CurrSamplesPerSec=23.796452669309698, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:29:52,941] [INFO] [logging.py:96:log_dist] [Rank 0] step=7820, skipped=147, lr=[4.502922291712266e-06, 4.502922291712266e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:29:53,182] [INFO] [timer.py:199:stop] epoch=8/micro_step=1840/global_step=7820, RunningAvgSamplesPerSec=23.805191496069703, CurrSamplesPerSec=23.914184717308686, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:30:19,788] [INFO] [logging.py:96:log_dist] [Rank 0] step=7830, skipped=147, lr=[4.492648321180732e-06, 4.492648321180732e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:30:20,029] [INFO] [timer.py:199:stop] epoch=8/micro_step=1880/global_step=7830, RunningAvgSamplesPerSec=23.805284166227686, CurrSamplesPerSec=23.845123814374773, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:30:46,638] [INFO] [logging.py:96:log_dist] [Rank 0] step=7840, skipped=147, lr=[4.482375864495682e-06, 4.482375864495682e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:30:46,878] [INFO] [timer.py:199:stop] epoch=8/micro_step=1920/global_step=7840, RunningAvgSamplesPerSec=23.80537296397802, CurrSamplesPerSec=23.936214668429127, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:31:08,061] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 21:31:10,459] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 21:31:12,904] [INFO] [logging.py:96:log_dist] [Rank 0] step=7850, skipped=149, lr=[4.474159020560612e-06, 4.474159020560612e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:31:13,145] [INFO] [timer.py:199:stop] epoch=8/micro_step=1960/global_step=7850, RunningAvgSamplesPerSec=23.806119244091185, CurrSamplesPerSec=23.879999459123624, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:31:39,791] [INFO] [logging.py:96:log_dist] [Rank 0] step=7860, skipped=149, lr=[4.463889406706694e-06, 4.463889406706694e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:31:40,031] [INFO] [timer.py:199:stop] epoch=8/micro_step=2000/global_step=7860, RunningAvgSamplesPerSec=23.806167783369485, CurrSamplesPerSec=23.816972817190816, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:32:06,675] [INFO] [logging.py:96:log_dist] [Rank 0] step=7870, skipped=149, lr=[4.453621437694782e-06, 4.453621437694782e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:32:06,915] [INFO] [timer.py:199:stop] epoch=8/micro_step=2040/global_step=7870, RunningAvgSamplesPerSec=23.806217622975556, CurrSamplesPerSec=23.888825199702087, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:32:33,556] [INFO] [logging.py:96:log_dist] [Rank 0] step=7880, skipped=149, lr=[4.4433551602950034e-06, 4.4433551602950034e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:32:33,798] [INFO] [timer.py:199:stop] epoch=8/micro_step=2080/global_step=7880, RunningAvgSamplesPerSec=23.806268143194036, CurrSamplesPerSec=23.843569184427587, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:33:00,431] [INFO] [logging.py:96:log_dist] [Rank 0] step=7890, skipped=149, lr=[4.4330906212697755e-06, 4.4330906212697755e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:33:00,672] [INFO] [timer.py:199:stop] epoch=8/micro_step=2120/global_step=7890, RunningAvgSamplesPerSec=23.806330888151116, CurrSamplesPerSec=23.87430539972751, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:33:27,276] [INFO] [logging.py:96:log_dist] [Rank 0] step=7900, skipped=149, lr=[4.422827867373595e-06, 4.422827867373595e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:33:27,516] [INFO] [timer.py:199:stop] epoch=8/micro_step=2160/global_step=7900, RunningAvgSamplesPerSec=23.80642522169127, CurrSamplesPerSec=23.880324491100016, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:33:54,165] [INFO] [logging.py:96:log_dist] [Rank 0] step=7910, skipped=149, lr=[4.412566945352832e-06, 4.412566945352832e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:33:54,408] [INFO] [timer.py:199:stop] epoch=8/micro_step=2200/global_step=7910, RunningAvgSamplesPerSec=23.806466393007213, CurrSamplesPerSec=23.872422136737118, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:34:21,025] [INFO] [logging.py:96:log_dist] [Rank 0] step=7920, skipped=149, lr=[4.402307901945511e-06, 4.402307901945511e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:34:21,267] [INFO] [timer.py:199:stop] epoch=8/micro_step=2240/global_step=7920, RunningAvgSamplesPerSec=23.806541698739245, CurrSamplesPerSec=23.88073238688331, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:34:47,891] [INFO] [logging.py:96:log_dist] [Rank 0] step=7930, skipped=149, lr=[4.392050783881096e-06, 4.392050783881096e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:34:48,133] [INFO] [timer.py:199:stop] epoch=8/micro_step=2280/global_step=7930, RunningAvgSamplesPerSec=23.80660989510517, CurrSamplesPerSec=23.874473145565776, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:35:14,746] [INFO] [logging.py:96:log_dist] [Rank 0] step=7940, skipped=149, lr=[4.381795637880289e-06, 4.381795637880289e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:35:14,988] [INFO] [timer.py:199:stop] epoch=8/micro_step=2320/global_step=7940, RunningAvgSamplesPerSec=23.806690138832828, CurrSamplesPerSec=23.83011126322057, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:35:41,570] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 21:35:41,571] [INFO] [logging.py:96:log_dist] [Rank 0] step=7950, skipped=150, lr=[4.3725677312013645e-06, 4.3725677312013645e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:35:41,572] [INFO] [timer.py:199:stop] epoch=8/micro_step=2360/global_step=7950, RunningAvgSamplesPerSec=23.807073134862552, CurrSamplesPerSec=26.692715740437087, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:35:43,969] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 21:36:07,913] [INFO] [logging.py:96:log_dist] [Rank 0] step=7960, skipped=151, lr=[4.363341493777045e-06, 4.363341493777045e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:36:08,153] [INFO] [timer.py:199:stop] epoch=8/micro_step=2400/global_step=7960, RunningAvgSamplesPerSec=23.807454668543805, CurrSamplesPerSec=23.78971253876617, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:36:34,763] [INFO] [logging.py:96:log_dist] [Rank 0] step=7970, skipped=151, lr=[4.353092118031165e-06, 4.353092118031165e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:36:35,004] [INFO] [timer.py:199:stop] epoch=8/micro_step=2440/global_step=7970, RunningAvgSamplesPerSec=23.807537573971878, CurrSamplesPerSec=23.843946174160575, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:37:01,622] [INFO] [logging.py:96:log_dist] [Rank 0] step=7980, skipped=151, lr=[4.342844891803808e-06, 4.342844891803808e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:37:01,865] [INFO] [timer.py:199:stop] epoch=8/micro_step=2480/global_step=7980, RunningAvgSamplesPerSec=23.80760965665842, CurrSamplesPerSec=23.835864686289884, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:37:28,720] [INFO] [logging.py:96:log_dist] [Rank 0] step=7990, skipped=151, lr=[4.33259986177061e-06, 4.33259986177061e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:37:28,961] [INFO] [timer.py:199:stop] epoch=8/micro_step=2520/global_step=7990, RunningAvgSamplesPerSec=23.807580587352668, CurrSamplesPerSec=23.851509666004794, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:37:55,587] [INFO] [logging.py:96:log_dist] [Rank 0] step=8000, skipped=151, lr=[4.322357074597214e-06, 4.322357074597214e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:37:55,829] [INFO] [timer.py:199:stop] epoch=8/micro_step=2560/global_step=8000, RunningAvgSamplesPerSec=23.80764874336715, CurrSamplesPerSec=23.886801482316624, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:38:22,448] [INFO] [logging.py:96:log_dist] [Rank 0] step=8010, skipped=151, lr=[4.312116576939037e-06, 4.312116576939037e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:38:22,689] [INFO] [timer.py:199:stop] epoch=8/micro_step=2600/global_step=8010, RunningAvgSamplesPerSec=23.807723290089275, CurrSamplesPerSec=23.88008018519385, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:38:49,293] [INFO] [logging.py:96:log_dist] [Rank 0] step=8020, skipped=151, lr=[4.301878415441072e-06, 4.301878415441072e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:38:49,534] [INFO] [timer.py:199:stop] epoch=8/micro_step=2640/global_step=8020, RunningAvgSamplesPerSec=23.807812287512803, CurrSamplesPerSec=23.850187297863265, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:39:16,229] [INFO] [logging.py:96:log_dist] [Rank 0] step=8030, skipped=151, lr=[4.291642636737669e-06, 4.291642636737669e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:39:16,470] [INFO] [timer.py:199:stop] epoch=8/micro_step=2680/global_step=8030, RunningAvgSamplesPerSec=23.8078027110257, CurrSamplesPerSec=23.794967656769423, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:39:43,117] [INFO] [logging.py:96:log_dist] [Rank 0] step=8040, skipped=151, lr=[4.281409287452326e-06, 4.281409287452326e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:39:43,359] [INFO] [timer.py:199:stop] epoch=8/micro_step=2720/global_step=8040, RunningAvgSamplesPerSec=23.807841891916556, CurrSamplesPerSec=23.91064656394879, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:40:10,027] [INFO] [logging.py:96:log_dist] [Rank 0] step=8050, skipped=151, lr=[4.271178414197473e-06, 4.271178414197473e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:40:10,271] [INFO] [timer.py:199:stop] epoch=8/micro_step=2760/global_step=8050, RunningAvgSamplesPerSec=23.80785504665051, CurrSamplesPerSec=23.866186347535038, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:40:15,382] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 21:40:17,780] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 21:40:36,442] [INFO] [logging.py:96:log_dist] [Rank 0] step=8060, skipped=153, lr=[4.262995529651905e-06, 4.262995529651905e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:40:36,684] [INFO] [timer.py:199:stop] epoch=8/micro_step=2800/global_step=8060, RunningAvgSamplesPerSec=23.808457609188515, CurrSamplesPerSec=23.851215087330655, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:41:03,341] [INFO] [logging.py:96:log_dist] [Rank 0] step=8070, skipped=153, lr=[4.252769230679131e-06, 4.252769230679131e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:41:03,583] [INFO] [timer.py:199:stop] epoch=8/micro_step=2840/global_step=8070, RunningAvgSamplesPerSec=23.808485708612764, CurrSamplesPerSec=23.851272307079224, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:41:30,279] [INFO] [logging.py:96:log_dist] [Rank 0] step=8080, skipped=153, lr=[4.2425455381909766e-06, 4.2425455381909766e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:41:30,522] [INFO] [timer.py:199:stop] epoch=8/micro_step=2880/global_step=8080, RunningAvgSamplesPerSec=23.808469533014712, CurrSamplesPerSec=23.850897204837928, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:41:57,138] [INFO] [logging.py:96:log_dist] [Rank 0] step=8090, skipped=153, lr=[4.232324498755892e-06, 4.232324498755892e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:41:57,380] [INFO] [timer.py:199:stop] epoch=8/micro_step=2920/global_step=8090, RunningAvgSamplesPerSec=23.808543115340072, CurrSamplesPerSec=23.87841478942536, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:42:23,979] [INFO] [logging.py:96:log_dist] [Rank 0] step=8100, skipped=153, lr=[4.222106158930236e-06, 4.222106158930236e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:42:24,221] [INFO] [timer.py:199:stop] epoch=8/micro_step=2960/global_step=8100, RunningAvgSamplesPerSec=23.808633055211935, CurrSamplesPerSec=23.889977511322435, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:42:50,875] [INFO] [logging.py:96:log_dist] [Rank 0] step=8110, skipped=153, lr=[4.211890565258069e-06, 4.211890565258069e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:42:51,117] [INFO] [timer.py:199:stop] epoch=8/micro_step=3000/global_step=8110, RunningAvgSamplesPerSec=23.808666459248585, CurrSamplesPerSec=23.795404281844686, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:43:17,761] [INFO] [logging.py:96:log_dist] [Rank 0] step=8120, skipped=153, lr=[4.2016777642709525e-06, 4.2016777642709525e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:43:18,004] [INFO] [timer.py:199:stop] epoch=8/micro_step=3040/global_step=8120, RunningAvgSamplesPerSec=23.80870798569425, CurrSamplesPerSec=23.848218851534774, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:43:44,666] [INFO] [logging.py:96:log_dist] [Rank 0] step=8130, skipped=153, lr=[4.191467802487718e-06, 4.191467802487718e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:43:44,908] [INFO] [timer.py:199:stop] epoch=8/micro_step=3080/global_step=8130, RunningAvgSamplesPerSec=23.808729925429635, CurrSamplesPerSec=23.83742889565496, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:44:11,558] [INFO] [logging.py:96:log_dist] [Rank 0] step=8140, skipped=153, lr=[4.181260726414269e-06, 4.181260726414269e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:44:11,801] [INFO] [timer.py:199:stop] epoch=8/micro_step=3120/global_step=8140, RunningAvgSamplesPerSec=23.808764308708017, CurrSamplesPerSec=23.858747142150285, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:44:38,454] [INFO] [logging.py:96:log_dist] [Rank 0] step=8150, skipped=153, lr=[4.171056582543364e-06, 4.171056582543364e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:44:38,698] [INFO] [timer.py:199:stop] epoch=8/micro_step=3160/global_step=8150, RunningAvgSamplesPerSec=23.808795173173404, CurrSamplesPerSec=23.803612439344942, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:44:49,164] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 21:44:51,565] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 21:45:04,769] [INFO] [logging.py:96:log_dist] [Rank 0] step=8160, skipped=155, lr=[4.162895409867083e-06, 4.162895409867083e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:45:05,011] [INFO] [timer.py:199:stop] epoch=8/micro_step=3200/global_step=8160, RunningAvgSamplesPerSec=23.809457441215915, CurrSamplesPerSec=23.822664928421638, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:45:31,652] [INFO] [logging.py:96:log_dist] [Rank 0] step=8170, skipped=155, lr=[4.152696661079738e-06, 4.152696661079738e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:45:31,893] [INFO] [timer.py:199:stop] epoch=8/micro_step=3240/global_step=8170, RunningAvgSamplesPerSec=23.809500977067053, CurrSamplesPerSec=23.837075396017035, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:45:58,552] [INFO] [logging.py:96:log_dist] [Rank 0] step=8180, skipped=155, lr=[4.142500974602931e-06, 4.142500974602931e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:45:58,793] [INFO] [timer.py:199:stop] epoch=8/micro_step=3280/global_step=8180, RunningAvgSamplesPerSec=23.809525929128068, CurrSamplesPerSec=23.836448859544607, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:46:25,534] [INFO] [logging.py:96:log_dist] [Rank 0] step=8190, skipped=155, lr=[4.1323083968775395e-06, 4.1323083968775395e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:46:25,769] [INFO] [timer.py:199:stop] epoch=8/micro_step=3320/global_step=8190, RunningAvgSamplesPerSec=23.80948589973411, CurrSamplesPerSec=23.18032820010996, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:46:52,426] [INFO] [logging.py:96:log_dist] [Rank 0] step=8200, skipped=155, lr=[4.122118974330281e-06, 4.122118974330281e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:46:52,668] [INFO] [timer.py:199:stop] epoch=8/micro_step=3360/global_step=8200, RunningAvgSamplesPerSec=23.809515551520498, CurrSamplesPerSec=23.88272957778082, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:47:19,331] [INFO] [logging.py:96:log_dist] [Rank 0] step=8210, skipped=155, lr=[4.111932753373508e-06, 4.111932753373508e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:47:19,573] [INFO] [timer.py:199:stop] epoch=8/micro_step=3400/global_step=8210, RunningAvgSamplesPerSec=23.80953728400712, CurrSamplesPerSec=23.86098880957215, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:47:46,198] [INFO] [logging.py:96:log_dist] [Rank 0] step=8220, skipped=155, lr=[4.101749780404979e-06, 4.101749780404979e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:47:46,439] [INFO] [timer.py:199:stop] epoch=8/micro_step=3440/global_step=8220, RunningAvgSamplesPerSec=23.809601686385395, CurrSamplesPerSec=23.82836399143393, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:48:13,087] [INFO] [logging.py:96:log_dist] [Rank 0] step=8230, skipped=155, lr=[4.091570101807668e-06, 4.091570101807668e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:48:13,329] [INFO] [timer.py:199:stop] epoch=8/micro_step=3480/global_step=8230, RunningAvgSamplesPerSec=23.809638687501117, CurrSamplesPerSec=23.886004418861045, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:48:39,978] [INFO] [logging.py:96:log_dist] [Rank 0] step=8240, skipped=155, lr=[4.081393763949535e-06, 4.081393763949535e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:48:40,220] [INFO] [timer.py:199:stop] epoch=8/micro_step=3520/global_step=8240, RunningAvgSamplesPerSec=23.809674033841304, CurrSamplesPerSec=23.819411658381405, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:49:06,877] [INFO] [logging.py:96:log_dist] [Rank 0] step=8250, skipped=155, lr=[4.071220813183329e-06, 4.071220813183329e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:49:07,118] [INFO] [timer.py:199:stop] epoch=8/micro_step=3560/global_step=8250, RunningAvgSamplesPerSec=23.809698457649805, CurrSamplesPerSec=23.810601218987884, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:49:22,950] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 21:49:25,348] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 21:49:33,163] [INFO] [logging.py:96:log_dist] [Rank 0] step=8260, skipped=157, lr=[4.063084922415755e-06, 4.063084922415755e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:49:33,405] [INFO] [timer.py:199:stop] epoch=8/micro_step=3600/global_step=8260, RunningAvgSamplesPerSec=23.810381880502582, CurrSamplesPerSec=23.85822336950938, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:50:00,030] [INFO] [logging.py:96:log_dist] [Rank 0] step=8270, skipped=157, lr=[4.052918185174562e-06, 4.052918185174562e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:50:00,273] [INFO] [timer.py:199:stop] epoch=8/micro_step=3640/global_step=8270, RunningAvgSamplesPerSec=23.81043829054771, CurrSamplesPerSec=23.787848921384807, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:50:26,927] [INFO] [logging.py:96:log_dist] [Rank 0] step=8280, skipped=157, lr=[4.042754964730232e-06, 4.042754964730232e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:50:27,170] [INFO] [timer.py:199:stop] epoch=8/micro_step=3680/global_step=8280, RunningAvgSamplesPerSec=23.810464480006125, CurrSamplesPerSec=23.80330215666321, MemAllocated=11.44GB, MaxMemAllocated=31.66GB ***** Evaluating perplexity, Epoch 9/16 ***** ppl: 1.7905216217041016 saving the final model ... Beginning of Epoch 10/16, Total Micro Batches 3680 [2023-04-23 21:51:45,139] [INFO] [logging.py:96:log_dist] [Rank 0] step=8290, skipped=157, lr=[4.032595307375774e-06, 4.032595307375774e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:51:45,381] [INFO] [timer.py:199:stop] epoch=9/micro_step=40/global_step=8290, RunningAvgSamplesPerSec=23.8104612878938, CurrSamplesPerSec=23.916573189926503, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:52:12,016] [INFO] [logging.py:96:log_dist] [Rank 0] step=8300, skipped=157, lr=[4.022439259387947e-06, 4.022439259387947e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:52:12,258] [INFO] [timer.py:199:stop] epoch=9/micro_step=80/global_step=8300, RunningAvgSamplesPerSec=23.81050964341161, CurrSamplesPerSec=23.840136584873708, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:52:38,905] [INFO] [logging.py:96:log_dist] [Rank 0] step=8310, skipped=157, lr=[4.012286867027082e-06, 4.012286867027082e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:52:39,148] [INFO] [timer.py:199:stop] epoch=9/micro_step=120/global_step=8310, RunningAvgSamplesPerSec=23.81054235089694, CurrSamplesPerSec=23.841686532114526, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:53:05,786] [INFO] [logging.py:96:log_dist] [Rank 0] step=8320, skipped=157, lr=[4.002138176536857e-06, 4.002138176536857e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:53:06,028] [INFO] [timer.py:199:stop] epoch=9/micro_step=160/global_step=8320, RunningAvgSamplesPerSec=23.810586821721394, CurrSamplesPerSec=23.823165998289113, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:53:32,694] [INFO] [logging.py:96:log_dist] [Rank 0] step=8330, skipped=157, lr=[3.991993234144084e-06, 3.991993234144084e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:53:32,937] [INFO] [timer.py:199:stop] epoch=9/micro_step=200/global_step=8330, RunningAvgSamplesPerSec=23.810598652238266, CurrSamplesPerSec=23.810164036038362, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:53:59,592] [INFO] [logging.py:96:log_dist] [Rank 0] step=8340, skipped=157, lr=[3.981852086058506e-06, 3.981852086058506e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:53:59,835] [INFO] [timer.py:199:stop] epoch=9/micro_step=240/global_step=8340, RunningAvgSamplesPerSec=23.81062527613488, CurrSamplesPerSec=23.767463081405825, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:54:26,524] [INFO] [logging.py:96:log_dist] [Rank 0] step=8350, skipped=157, lr=[3.971714778472583e-06, 3.971714778472583e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:54:26,767] [INFO] [timer.py:199:stop] epoch=9/micro_step=280/global_step=8350, RunningAvgSamplesPerSec=23.810617577393373, CurrSamplesPerSec=23.82051712090936, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:54:48,000] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 21:54:50,409] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 21:54:52,857] [INFO] [logging.py:96:log_dist] [Rank 0] step=8360, skipped=159, lr=[3.963607728593767e-06, 3.963607728593767e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:54:53,099] [INFO] [timer.py:199:stop] epoch=9/micro_step=320/global_step=8360, RunningAvgSamplesPerSec=23.811241923092176, CurrSamplesPerSec=23.841186800334942, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:55:19,748] [INFO] [logging.py:96:log_dist] [Rank 0] step=8370, skipped=159, lr=[3.953477450256246e-06, 3.953477450256246e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:55:19,990] [INFO] [timer.py:199:stop] epoch=9/micro_step=360/global_step=8370, RunningAvgSamplesPerSec=23.811274654226036, CurrSamplesPerSec=23.82497805977398, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:55:46,996] [INFO] [logging.py:96:log_dist] [Rank 0] step=8380, skipped=159, lr=[3.943351141663531e-06, 3.943351141663531e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:55:47,238] [INFO] [timer.py:199:stop] epoch=9/micro_step=400/global_step=8380, RunningAvgSamplesPerSec=23.810987735244783, CurrSamplesPerSec=23.8374564140307, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:56:13,874] [INFO] [logging.py:96:log_dist] [Rank 0] step=8390, skipped=159, lr=[3.933228848940487e-06, 3.933228848940487e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:56:14,115] [INFO] [timer.py:199:stop] epoch=9/micro_step=440/global_step=8390, RunningAvgSamplesPerSec=23.811034141395616, CurrSamplesPerSec=23.81390912325734, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:56:40,804] [INFO] [logging.py:96:log_dist] [Rank 0] step=8400, skipped=159, lr=[3.923110618193687e-06, 3.923110618193687e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:56:41,045] [INFO] [timer.py:199:stop] epoch=9/micro_step=480/global_step=8400, RunningAvgSamplesPerSec=23.81105502466084, CurrSamplesPerSec=23.893743496624253, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:57:07,668] [INFO] [logging.py:96:log_dist] [Rank 0] step=8410, skipped=159, lr=[3.912996495511206e-06, 3.912996495511206e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:57:07,909] [INFO] [timer.py:199:stop] epoch=9/micro_step=520/global_step=8410, RunningAvgSamplesPerSec=23.81111814493284, CurrSamplesPerSec=23.89614915993074, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:57:34,534] [INFO] [logging.py:96:log_dist] [Rank 0] step=8420, skipped=159, lr=[3.902886526962401e-06, 3.902886526962401e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:57:34,776] [INFO] [timer.py:199:stop] epoch=9/micro_step=560/global_step=8420, RunningAvgSamplesPerSec=23.811176069831124, CurrSamplesPerSec=23.885332801352778, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:58:01,394] [INFO] [logging.py:96:log_dist] [Rank 0] step=8430, skipped=159, lr=[3.8927807585977114e-06, 3.8927807585977114e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:58:01,636] [INFO] [timer.py:199:stop] epoch=9/micro_step=600/global_step=8430, RunningAvgSamplesPerSec=23.8112386895755, CurrSamplesPerSec=23.869494857843655, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:58:28,241] [INFO] [logging.py:96:log_dist] [Rank 0] step=8440, skipped=159, lr=[3.882679236448442e-06, 3.882679236448442e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:58:28,483] [INFO] [timer.py:199:stop] epoch=9/micro_step=640/global_step=8440, RunningAvgSamplesPerSec=23.811316191648196, CurrSamplesPerSec=23.858250935917148, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:58:55,137] [INFO] [logging.py:96:log_dist] [Rank 0] step=8450, skipped=159, lr=[3.87258200652656e-06, 3.87258200652656e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:58:55,378] [INFO] [timer.py:199:stop] epoch=9/micro_step=680/global_step=8450, RunningAvgSamplesPerSec=23.811342837670473, CurrSamplesPerSec=23.890526067716067, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:59:21,967] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 21:59:21,968] [INFO] [logging.py:96:log_dist] [Rank 0] step=8460, skipped=160, lr=[3.863498207464418e-06, 3.863498207464418e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:59:21,969] [INFO] [timer.py:199:stop] epoch=9/micro_step=720/global_step=8460, RunningAvgSamplesPerSec=23.811687525335703, CurrSamplesPerSec=26.691675305844804, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 21:59:24,370] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 21:59:48,347] [INFO] [logging.py:96:log_dist] [Rank 0] step=8470, skipped=161, lr=[3.854417955875361e-06, 3.854417955875361e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 21:59:48,590] [INFO] [timer.py:199:stop] epoch=9/micro_step=760/global_step=8470, RunningAvgSamplesPerSec=23.81199939481765, CurrSamplesPerSec=23.855229612693428, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:00:15,219] [INFO] [logging.py:96:log_dist] [Rank 0] step=8480, skipped=161, lr=[3.844332988806596e-06, 3.844332988806596e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:00:15,461] [INFO] [timer.py:199:stop] epoch=9/micro_step=800/global_step=8480, RunningAvgSamplesPerSec=23.812049667713733, CurrSamplesPerSec=23.881233777874254, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:00:42,118] [INFO] [logging.py:96:log_dist] [Rank 0] step=8490, skipped=161, lr=[3.834252488630594e-06, 3.834252488630594e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:00:42,360] [INFO] [timer.py:199:stop] epoch=9/micro_step=840/global_step=8490, RunningAvgSamplesPerSec=23.812071456166567, CurrSamplesPerSec=23.832493550140132, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:01:08,997] [INFO] [logging.py:96:log_dist] [Rank 0] step=8500, skipped=161, lr=[3.824176501263572e-06, 3.824176501263572e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:01:09,239] [INFO] [timer.py:199:stop] epoch=9/micro_step=880/global_step=8500, RunningAvgSamplesPerSec=23.812114657413446, CurrSamplesPerSec=23.842669116701376, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:01:35,867] [INFO] [logging.py:96:log_dist] [Rank 0] step=8510, skipped=161, lr=[3.814105072601182e-06, 3.814105072601182e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:01:36,109] [INFO] [timer.py:199:stop] epoch=9/micro_step=920/global_step=8510, RunningAvgSamplesPerSec=23.812169199194226, CurrSamplesPerSec=23.839797825429763, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:02:02,716] [INFO] [logging.py:96:log_dist] [Rank 0] step=8520, skipped=161, lr=[3.8040382485183128e-06, 3.8040382485183128e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:02:02,958] [INFO] [timer.py:199:stop] epoch=9/micro_step=960/global_step=8520, RunningAvgSamplesPerSec=23.812241409446578, CurrSamplesPerSec=23.881877543786853, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:02:29,603] [INFO] [logging.py:96:log_dist] [Rank 0] step=8530, skipped=161, lr=[3.7939760748688867e-06, 3.7939760748688867e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:02:29,846] [INFO] [timer.py:199:stop] epoch=9/micro_step=1000/global_step=8530, RunningAvgSamplesPerSec=23.812274304491943, CurrSamplesPerSec=23.86083185793196, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:02:56,501] [INFO] [logging.py:96:log_dist] [Rank 0] step=8540, skipped=161, lr=[3.783918597485634e-06, 3.783918597485634e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:02:56,741] [INFO] [timer.py:199:stop] epoch=9/micro_step=1040/global_step=8540, RunningAvgSamplesPerSec=23.812300279871163, CurrSamplesPerSec=23.788723773272363, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:03:23,373] [INFO] [logging.py:96:log_dist] [Rank 0] step=8550, skipped=161, lr=[3.7738658621798984e-06, 3.7738658621798984e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:03:23,614] [INFO] [timer.py:199:stop] epoch=9/micro_step=1080/global_step=8550, RunningAvgSamplesPerSec=23.81234854635888, CurrSamplesPerSec=23.870683513965663, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:03:50,250] [INFO] [logging.py:96:log_dist] [Rank 0] step=8560, skipped=161, lr=[3.763817914741425e-06, 3.763817914741425e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:03:50,492] [INFO] [timer.py:199:stop] epoch=9/micro_step=1120/global_step=8560, RunningAvgSamplesPerSec=23.812390416749626, CurrSamplesPerSec=23.814483769920372, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:03:55,574] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 22:03:57,981] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 22:04:16,542] [INFO] [logging.py:96:log_dist] [Rank 0] step=8570, skipped=163, lr=[3.755783034811864e-06, 3.755783034811864e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:04:16,783] [INFO] [timer.py:199:stop] epoch=9/micro_step=1160/global_step=8570, RunningAvgSamplesPerSec=23.81303695481547, CurrSamplesPerSec=23.855246572372092, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:04:43,947] [INFO] [logging.py:96:log_dist] [Rank 0] step=8580, skipped=163, lr=[3.7457438208548643e-06, 3.7457438208548643e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:04:44,189] [INFO] [timer.py:199:stop] epoch=9/micro_step=1200/global_step=8580, RunningAvgSamplesPerSec=23.812611396727938, CurrSamplesPerSec=23.900587401335176, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:05:10,779] [INFO] [logging.py:96:log_dist] [Rank 0] step=8590, skipped=163, lr=[3.7357095228597225e-06, 3.7357095228597225e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:05:11,021] [INFO] [timer.py:199:stop] epoch=9/micro_step=1240/global_step=8590, RunningAvgSamplesPerSec=23.812698229881494, CurrSamplesPerSec=23.845613119964014, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:05:37,679] [INFO] [logging.py:96:log_dist] [Rank 0] step=8600, skipped=163, lr=[3.7256801865321977e-06, 3.7256801865321977e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:05:37,920] [INFO] [timer.py:199:stop] epoch=9/micro_step=1280/global_step=8600, RunningAvgSamplesPerSec=23.812725175493217, CurrSamplesPerSec=23.8344043784162, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:06:04,534] [INFO] [logging.py:96:log_dist] [Rank 0] step=8610, skipped=163, lr=[3.7156558575554496e-06, 3.7156558575554496e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:06:04,775] [INFO] [timer.py:199:stop] epoch=9/micro_step=1320/global_step=8610, RunningAvgSamplesPerSec=23.812791518434693, CurrSamplesPerSec=23.836901825182437, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:06:31,394] [INFO] [logging.py:96:log_dist] [Rank 0] step=8620, skipped=163, lr=[3.7056365815898375e-06, 3.7056365815898375e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:06:31,636] [INFO] [timer.py:199:stop] epoch=9/micro_step=1360/global_step=8620, RunningAvgSamplesPerSec=23.812851644523192, CurrSamplesPerSec=23.90264325288746, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:06:58,260] [INFO] [logging.py:96:log_dist] [Rank 0] step=8630, skipped=163, lr=[3.695622404272694e-06, 3.695622404272694e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:06:58,501] [INFO] [timer.py:199:stop] epoch=9/micro_step=1400/global_step=8630, RunningAvgSamplesPerSec=23.812906673615267, CurrSamplesPerSec=23.878138662068626, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:07:25,134] [INFO] [logging.py:96:log_dist] [Rank 0] step=8640, skipped=163, lr=[3.685613371218135e-06, 3.685613371218135e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:07:25,374] [INFO] [timer.py:199:stop] epoch=9/micro_step=1440/global_step=8640, RunningAvgSamplesPerSec=23.812953459802053, CurrSamplesPerSec=23.648365302422555, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:07:51,970] [INFO] [logging.py:96:log_dist] [Rank 0] step=8650, skipped=163, lr=[3.6756095280168414e-06, 3.6756095280168414e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:07:52,214] [INFO] [timer.py:199:stop] epoch=9/micro_step=1480/global_step=8650, RunningAvgSamplesPerSec=23.813038504280097, CurrSamplesPerSec=23.8594363517163, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:08:18,882] [INFO] [logging.py:96:log_dist] [Rank 0] step=8660, skipped=163, lr=[3.665610920235854e-06, 3.665610920235854e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:08:19,119] [INFO] [timer.py:199:stop] epoch=9/micro_step=1520/global_step=8660, RunningAvgSamplesPerSec=23.813054898762832, CurrSamplesPerSec=23.684422578755303, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:08:29,608] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 22:08:32,011] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 22:08:45,201] [INFO] [logging.py:96:log_dist] [Rank 0] step=8670, skipped=165, lr=[3.6576158341195206e-06, 3.6576158341195206e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:08:45,441] [INFO] [timer.py:199:stop] epoch=9/micro_step=1560/global_step=8670, RunningAvgSamplesPerSec=23.813665819511243, CurrSamplesPerSec=23.860524323542595, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:09:12,097] [INFO] [logging.py:96:log_dist] [Rank 0] step=8680, skipped=165, lr=[3.6476267648477806e-06, 3.6476267648477806e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:09:12,340] [INFO] [timer.py:199:stop] epoch=9/micro_step=1600/global_step=8680, RunningAvgSamplesPerSec=23.81368719158841, CurrSamplesPerSec=23.745937871174778, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:09:39,026] [INFO] [logging.py:96:log_dist] [Rank 0] step=8690, skipped=165, lr=[3.637643058456535e-06, 3.637643058456535e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:09:39,267] [INFO] [timer.py:199:stop] epoch=9/micro_step=1640/global_step=8690, RunningAvgSamplesPerSec=23.813689720343014, CurrSamplesPerSec=23.760941269277435, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:10:05,875] [INFO] [logging.py:96:log_dist] [Rank 0] step=8700, skipped=165, lr=[3.627664760421107e-06, 3.627664760421107e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:10:06,118] [INFO] [timer.py:199:stop] epoch=9/micro_step=1680/global_step=8700, RunningAvgSamplesPerSec=23.813763706724373, CurrSamplesPerSec=23.85835908167049, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:10:32,710] [INFO] [logging.py:96:log_dist] [Rank 0] step=8710, skipped=165, lr=[3.6176919161921795e-06, 3.6176919161921795e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:10:32,951] [INFO] [timer.py:199:stop] epoch=9/micro_step=1720/global_step=8710, RunningAvgSamplesPerSec=23.813849094706956, CurrSamplesPerSec=23.8777457226098, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:10:59,594] [INFO] [logging.py:96:log_dist] [Rank 0] step=8720, skipped=165, lr=[3.6077245711955935e-06, 3.6077245711955935e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:10:59,837] [INFO] [timer.py:199:stop] epoch=9/micro_step=1760/global_step=8720, RunningAvgSamplesPerSec=23.813881998723087, CurrSamplesPerSec=23.840221276238985, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:11:26,455] [INFO] [logging.py:96:log_dist] [Rank 0] step=8730, skipped=165, lr=[3.597762770832147e-06, 3.597762770832147e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:11:26,696] [INFO] [timer.py:199:stop] epoch=9/micro_step=1800/global_step=8730, RunningAvgSamplesPerSec=23.81394239538355, CurrSamplesPerSec=23.889966880633402, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:11:53,330] [INFO] [logging.py:96:log_dist] [Rank 0] step=8740, skipped=165, lr=[3.5878065604773783e-06, 3.5878065604773783e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:11:53,572] [INFO] [timer.py:199:stop] epoch=9/micro_step=1840/global_step=8740, RunningAvgSamplesPerSec=23.81399136125027, CurrSamplesPerSec=23.900615065726914, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:12:20,190] [INFO] [logging.py:96:log_dist] [Rank 0] step=8750, skipped=165, lr=[3.577855985481361e-06, 3.577855985481361e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:12:20,431] [INFO] [timer.py:199:stop] epoch=9/micro_step=1880/global_step=8750, RunningAvgSamplesPerSec=23.814054091980246, CurrSamplesPerSec=23.86383975005663, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:12:47,058] [INFO] [logging.py:96:log_dist] [Rank 0] step=8760, skipped=165, lr=[3.567911091168506e-06, 3.567911091168506e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:12:47,299] [INFO] [timer.py:199:stop] epoch=9/micro_step=1920/global_step=8760, RunningAvgSamplesPerSec=23.814107309847778, CurrSamplesPerSec=23.818221762720214, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:13:03,139] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 22:13:05,539] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 22:13:13,366] [INFO] [logging.py:96:log_dist] [Rank 0] step=8770, skipped=167, lr=[3.5599592962515865e-06, 3.5599592962515865e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:13:13,607] [INFO] [timer.py:199:stop] epoch=9/micro_step=1960/global_step=8770, RunningAvgSamplesPerSec=23.814723331403638, CurrSamplesPerSec=23.826564104075313, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:13:40,241] [INFO] [logging.py:96:log_dist] [Rank 0] step=8780, skipped=167, lr=[3.550024741303201e-06, 3.550024741303201e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:13:40,481] [INFO] [timer.py:199:stop] epoch=9/micro_step=2000/global_step=8780, RunningAvgSamplesPerSec=23.81477127550438, CurrSamplesPerSec=23.784478708038986, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:14:07,102] [INFO] [logging.py:96:log_dist] [Rank 0] step=8790, skipped=167, lr=[3.5400959938080063e-06, 3.5400959938080063e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:14:07,345] [INFO] [timer.py:199:stop] epoch=9/micro_step=2040/global_step=8790, RunningAvgSamplesPerSec=23.814828319850445, CurrSamplesPerSec=23.86484325651629, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:14:33,935] [INFO] [logging.py:96:log_dist] [Rank 0] step=8800, skipped=167, lr=[3.530173098990983e-06, 3.530173098990983e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:14:34,177] [INFO] [timer.py:199:stop] epoch=9/micro_step=2080/global_step=8800, RunningAvgSamplesPerSec=23.814914380240104, CurrSamplesPerSec=23.900306504983988, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:15:00,796] [INFO] [logging.py:96:log_dist] [Rank 0] step=8810, skipped=167, lr=[3.5202561020504554e-06, 3.5202561020504554e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:15:01,037] [INFO] [timer.py:199:stop] epoch=9/micro_step=2120/global_step=8810, RunningAvgSamplesPerSec=23.814969480537496, CurrSamplesPerSec=23.86245237696059, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:15:27,649] [INFO] [logging.py:96:log_dist] [Rank 0] step=8820, skipped=167, lr=[3.51034504815789e-06, 3.51034504815789e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:15:27,889] [INFO] [timer.py:199:stop] epoch=9/micro_step=2160/global_step=8820, RunningAvgSamplesPerSec=23.81503659680237, CurrSamplesPerSec=23.899321291945604, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:15:54,503] [INFO] [logging.py:96:log_dist] [Rank 0] step=8830, skipped=167, lr=[3.500439982457675e-06, 3.500439982457675e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:15:54,744] [INFO] [timer.py:199:stop] epoch=9/micro_step=2200/global_step=8830, RunningAvgSamplesPerSec=23.815102712476662, CurrSamplesPerSec=23.91960580991433, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:16:21,362] [INFO] [logging.py:96:log_dist] [Rank 0] step=8840, skipped=167, lr=[3.4905409500669228e-06, 3.4905409500669228e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:16:21,604] [INFO] [timer.py:199:stop] epoch=9/micro_step=2240/global_step=8840, RunningAvgSamplesPerSec=23.815162770922807, CurrSamplesPerSec=23.875685657099893, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:16:48,231] [INFO] [logging.py:96:log_dist] [Rank 0] step=8850, skipped=167, lr=[3.4806479960752714e-06, 3.4806479960752714e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:16:48,473] [INFO] [timer.py:199:stop] epoch=9/micro_step=2280/global_step=8850, RunningAvgSamplesPerSec=23.815214744687744, CurrSamplesPerSec=23.898699989850627, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:17:15,127] [INFO] [logging.py:96:log_dist] [Rank 0] step=8860, skipped=167, lr=[3.470761165544666e-06, 3.470761165544666e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:17:15,368] [INFO] [timer.py:199:stop] epoch=9/micro_step=2320/global_step=8860, RunningAvgSamplesPerSec=23.815239189977916, CurrSamplesPerSec=23.700614253965863, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:17:36,559] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 22:17:38,964] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 22:17:41,410] [INFO] [logging.py:96:log_dist] [Rank 0] step=8870, skipped=169, lr=[3.46285614027596e-06, 3.46285614027596e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:17:41,652] [INFO] [timer.py:199:stop] epoch=9/micro_step=2360/global_step=8870, RunningAvgSamplesPerSec=23.815871501857647, CurrSamplesPerSec=23.857733546270737, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:18:08,287] [INFO] [logging.py:96:log_dist] [Rank 0] step=8880, skipped=169, lr=[3.45298044544218e-06, 3.45298044544218e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:18:08,527] [INFO] [timer.py:199:stop] epoch=9/micro_step=2400/global_step=8880, RunningAvgSamplesPerSec=23.815910914319048, CurrSamplesPerSec=23.82974317255737, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:18:35,173] [INFO] [logging.py:96:log_dist] [Rank 0] step=8890, skipped=169, lr=[3.4431110000938553e-06, 3.4431110000938553e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:18:35,414] [INFO] [timer.py:199:stop] epoch=9/micro_step=2440/global_step=8890, RunningAvgSamplesPerSec=23.81594636410551, CurrSamplesPerSec=23.862528741926347, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:19:02,041] [INFO] [logging.py:96:log_dist] [Rank 0] step=8900, skipped=169, lr=[3.433247849185851e-06, 3.433247849185851e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:19:02,282] [INFO] [timer.py:199:stop] epoch=9/micro_step=2480/global_step=8900, RunningAvgSamplesPerSec=23.815995442389912, CurrSamplesPerSec=23.89313524469783, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:19:28,949] [INFO] [logging.py:96:log_dist] [Rank 0] step=8910, skipped=169, lr=[3.4233910376443656e-06, 3.4233910376443656e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:19:29,191] [INFO] [timer.py:199:stop] epoch=9/micro_step=2520/global_step=8910, RunningAvgSamplesPerSec=23.816003238104866, CurrSamplesPerSec=23.821882709585356, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:19:55,825] [INFO] [logging.py:96:log_dist] [Rank 0] step=8920, skipped=169, lr=[3.413540610366716e-06, 3.413540610366716e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:19:56,066] [INFO] [timer.py:199:stop] epoch=9/micro_step=2560/global_step=8920, RunningAvgSamplesPerSec=23.816045584141527, CurrSamplesPerSec=23.90357765428238, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:20:22,743] [INFO] [logging.py:96:log_dist] [Rank 0] step=8930, skipped=169, lr=[3.4036966122211405e-06, 3.4036966122211405e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:20:22,986] [INFO] [timer.py:199:stop] epoch=9/micro_step=2600/global_step=8930, RunningAvgSamplesPerSec=23.816042555475956, CurrSamplesPerSec=23.814016867394205, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:20:49,712] [INFO] [logging.py:96:log_dist] [Rank 0] step=8940, skipped=169, lr=[3.3938590880465983e-06, 3.3938590880465983e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:20:49,954] [INFO] [timer.py:199:stop] epoch=9/micro_step=2640/global_step=8940, RunningAvgSamplesPerSec=23.816039403801206, CurrSamplesPerSec=23.826629665117636, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:21:16,615] [INFO] [logging.py:96:log_dist] [Rank 0] step=8950, skipped=169, lr=[3.3840280826525543e-06, 3.3840280826525543e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:21:16,852] [INFO] [timer.py:199:stop] epoch=9/micro_step=2680/global_step=8950, RunningAvgSamplesPerSec=23.816057937729205, CurrSamplesPerSec=23.63690409782836, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:21:43,523] [INFO] [logging.py:96:log_dist] [Rank 0] step=8960, skipped=169, lr=[3.3742036408187796e-06, 3.3742036408187796e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:21:43,766] [INFO] [timer.py:199:stop] epoch=9/micro_step=2720/global_step=8960, RunningAvgSamplesPerSec=23.816068799147583, CurrSamplesPerSec=23.86693115984455, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:22:10,363] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 22:22:10,364] [INFO] [logging.py:96:log_dist] [Rank 0] step=8970, skipped=170, lr=[3.36536729199881e-06, 3.36536729199881e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:22:10,364] [INFO] [timer.py:199:stop] epoch=9/micro_step=2760/global_step=8970, RunningAvgSamplesPerSec=23.816383486350027, CurrSamplesPerSec=26.678011208253906, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:22:12,765] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 22:22:36,732] [INFO] [logging.py:96:log_dist] [Rank 0] step=8980, skipped=171, lr=[3.3565363285122686e-06, 3.3565363285122686e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:22:36,973] [INFO] [timer.py:199:stop] epoch=9/micro_step=2800/global_step=8980, RunningAvgSamplesPerSec=23.816685055625964, CurrSamplesPerSec=23.814741524859475, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:23:03,638] [INFO] [logging.py:96:log_dist] [Rank 0] step=8990, skipped=171, lr=[3.346730502620388e-06, 3.346730502620388e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:23:03,880] [INFO] [timer.py:199:stop] epoch=9/micro_step=2840/global_step=8990, RunningAvgSamplesPerSec=23.81669177131515, CurrSamplesPerSec=23.8066608162992, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:23:30,507] [INFO] [logging.py:96:log_dist] [Rank 0] step=9000, skipped=171, lr=[3.3369314101775284e-06, 3.3369314101775284e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:23:30,749] [INFO] [timer.py:199:stop] epoch=9/micro_step=2880/global_step=9000, RunningAvgSamplesPerSec=23.816736094426677, CurrSamplesPerSec=23.78016774497491, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:23:57,381] [INFO] [logging.py:96:log_dist] [Rank 0] step=9010, skipped=171, lr=[3.3271390958180995e-06, 3.3271390958180995e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:23:57,622] [INFO] [timer.py:199:stop] epoch=9/micro_step=2920/global_step=9010, RunningAvgSamplesPerSec=23.81677608341705, CurrSamplesPerSec=23.865918990352473, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:24:24,259] [INFO] [logging.py:96:log_dist] [Rank 0] step=9020, skipped=171, lr=[3.317353604145638e-06, 3.317353604145638e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:24:24,500] [INFO] [timer.py:199:stop] epoch=9/micro_step=2960/global_step=9020, RunningAvgSamplesPerSec=23.81681518855763, CurrSamplesPerSec=23.855386490641198, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:24:51,105] [INFO] [logging.py:96:log_dist] [Rank 0] step=9030, skipped=171, lr=[3.307574979732607e-06, 3.307574979732607e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:24:51,346] [INFO] [timer.py:199:stop] epoch=9/micro_step=3000/global_step=9030, RunningAvgSamplesPerSec=23.816884297989244, CurrSamplesPerSec=23.791884316518605, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:25:17,962] [INFO] [logging.py:96:log_dist] [Rank 0] step=9040, skipped=171, lr=[3.297803267120187e-06, 3.297803267120187e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:25:18,204] [INFO] [timer.py:199:stop] epoch=9/micro_step=3040/global_step=9040, RunningAvgSamplesPerSec=23.816941088439783, CurrSamplesPerSec=23.852365892814127, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:25:44,819] [INFO] [logging.py:96:log_dist] [Rank 0] step=9050, skipped=171, lr=[3.2880385108180727e-06, 3.2880385108180727e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:25:45,060] [INFO] [timer.py:199:stop] epoch=9/micro_step=3080/global_step=9050, RunningAvgSamplesPerSec=23.816997893292193, CurrSamplesPerSec=23.87502099016664, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:26:11,668] [INFO] [logging.py:96:log_dist] [Rank 0] step=9060, skipped=171, lr=[3.278280755304281e-06, 3.278280755304281e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:26:11,910] [INFO] [timer.py:199:stop] epoch=9/micro_step=3120/global_step=9060, RunningAvgSamplesPerSec=23.817062400245643, CurrSamplesPerSec=23.87556036561427, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:26:38,520] [INFO] [logging.py:96:log_dist] [Rank 0] step=9070, skipped=171, lr=[3.268530045024933e-06, 3.268530045024933e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:26:38,761] [INFO] [timer.py:199:stop] epoch=9/micro_step=3160/global_step=9070, RunningAvgSamplesPerSec=23.817127207114208, CurrSamplesPerSec=23.89226332916965, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:26:43,841] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 22:26:46,247] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 22:27:04,817] [INFO] [logging.py:96:log_dist] [Rank 0] step=9080, skipped=173, lr=[3.260734579217568e-06, 3.260734579217568e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:27:05,059] [INFO] [timer.py:199:stop] epoch=9/micro_step=3200/global_step=9080, RunningAvgSamplesPerSec=23.81773024536875, CurrSamplesPerSec=23.863061189021774, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:27:31,685] [INFO] [logging.py:96:log_dist] [Rank 0] step=9090, skipped=173, lr=[3.2509966622618906e-06, 3.2509966622618906e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:27:31,927] [INFO] [timer.py:199:stop] epoch=9/micro_step=3240/global_step=9090, RunningAvgSamplesPerSec=23.817776616593175, CurrSamplesPerSec=23.840873419907023, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:27:58,566] [INFO] [logging.py:96:log_dist] [Rank 0] step=9100, skipped=173, lr=[3.2412659148184337e-06, 3.2412659148184337e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:27:58,807] [INFO] [timer.py:199:stop] epoch=9/micro_step=3280/global_step=9100, RunningAvgSamplesPerSec=23.81781015066118, CurrSamplesPerSec=23.85208612877839, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:28:25,402] [INFO] [logging.py:96:log_dist] [Rank 0] step=9110, skipped=173, lr=[3.2315423812103054e-06, 3.2315423812103054e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:28:25,643] [INFO] [timer.py:199:stop] epoch=9/micro_step=3320/global_step=9110, RunningAvgSamplesPerSec=23.817886453997918, CurrSamplesPerSec=23.88488649415111, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:28:52,273] [INFO] [logging.py:96:log_dist] [Rank 0] step=9120, skipped=173, lr=[3.2218261057277478e-06, 3.2218261057277478e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:28:52,515] [INFO] [timer.py:199:stop] epoch=9/micro_step=3360/global_step=9120, RunningAvgSamplesPerSec=23.81792664426119, CurrSamplesPerSec=23.842315461626715, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:29:19,112] [INFO] [logging.py:96:log_dist] [Rank 0] step=9130, skipped=173, lr=[3.212117132627944e-06, 3.212117132627944e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:29:19,353] [INFO] [timer.py:199:stop] epoch=9/micro_step=3400/global_step=9130, RunningAvgSamplesPerSec=23.817998245788747, CurrSamplesPerSec=23.90751401018821, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:29:45,950] [INFO] [logging.py:96:log_dist] [Rank 0] step=9140, skipped=173, lr=[3.2024155061348207e-06, 3.2024155061348207e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:29:46,191] [INFO] [timer.py:199:stop] epoch=9/micro_step=3440/global_step=9140, RunningAvgSamplesPerSec=23.818072739886187, CurrSamplesPerSec=23.873466705893193, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:30:12,820] [INFO] [logging.py:96:log_dist] [Rank 0] step=9150, skipped=173, lr=[3.192721270438834e-06, 3.192721270438834e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:30:13,061] [INFO] [timer.py:199:stop] epoch=9/micro_step=3480/global_step=9150, RunningAvgSamplesPerSec=23.818114418081826, CurrSamplesPerSec=23.811277088957347, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:30:39,680] [INFO] [logging.py:96:log_dist] [Rank 0] step=9160, skipped=173, lr=[3.183034469696777e-06, 3.183034469696777e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:30:39,922] [INFO] [timer.py:199:stop] epoch=9/micro_step=3520/global_step=9160, RunningAvgSamplesPerSec=23.818165148206962, CurrSamplesPerSec=23.862757839756256, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:31:06,554] [INFO] [logging.py:96:log_dist] [Rank 0] step=9170, skipped=173, lr=[3.17335514803158e-06, 3.17335514803158e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:31:06,795] [INFO] [timer.py:199:stop] epoch=9/micro_step=3560/global_step=9170, RunningAvgSamplesPerSec=23.818203776133025, CurrSamplesPerSec=23.837212984450225, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:31:17,259] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 22:31:19,660] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 22:31:32,866] [INFO] [logging.py:96:log_dist] [Rank 0] step=9180, skipped=175, lr=[3.1656171052636245e-06, 3.1656171052636245e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:31:33,108] [INFO] [timer.py:199:stop] epoch=9/micro_step=3600/global_step=9180, RunningAvgSamplesPerSec=23.818785502965152, CurrSamplesPerSec=23.82326748336075, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:31:59,713] [INFO] [logging.py:96:log_dist] [Rank 0] step=9190, skipped=175, lr=[3.1559513570176826e-06, 3.1559513570176826e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:31:59,954] [INFO] [timer.py:199:stop] epoch=9/micro_step=3640/global_step=9190, RunningAvgSamplesPerSec=23.818850717475716, CurrSamplesPerSec=23.874719460182423, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:32:26,560] [INFO] [logging.py:96:log_dist] [Rank 0] step=9200, skipped=175, lr=[3.146293211210923e-06, 3.146293211210923e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:32:26,802] [INFO] [timer.py:199:stop] epoch=9/micro_step=3680/global_step=9200, RunningAvgSamplesPerSec=23.81891555593617, CurrSamplesPerSec=23.87026959348894, MemAllocated=11.44GB, MaxMemAllocated=31.66GB ***** Evaluating perplexity, Epoch 10/16 ***** ppl: 1.791977047920227 saving the final model ... Beginning of Epoch 11/16, Total Micro Batches 3680 [2023-04-23 22:33:43,955] [INFO] [logging.py:96:log_dist] [Rank 0] step=9210, skipped=175, lr=[3.1366427118357515e-06, 3.1366427118357515e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:33:44,196] [INFO] [timer.py:199:stop] epoch=10/micro_step=40/global_step=9210, RunningAvgSamplesPerSec=23.818926531186612, CurrSamplesPerSec=23.861807536739573, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:34:10,823] [INFO] [logging.py:96:log_dist] [Rank 0] step=9220, skipped=175, lr=[3.1269999028497428e-06, 3.1269999028497428e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:34:11,064] [INFO] [timer.py:199:stop] epoch=10/micro_step=80/global_step=9220, RunningAvgSamplesPerSec=23.818969712946412, CurrSamplesPerSec=23.824865987297983, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:34:37,707] [INFO] [logging.py:96:log_dist] [Rank 0] step=9230, skipped=175, lr=[3.1173648281754492e-06, 3.1173648281754492e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:34:37,949] [INFO] [timer.py:199:stop] epoch=10/micro_step=120/global_step=9230, RunningAvgSamplesPerSec=23.819000814095684, CurrSamplesPerSec=23.804116929816395, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:35:04,591] [INFO] [logging.py:96:log_dist] [Rank 0] step=9240, skipped=175, lr=[3.1077375317001873e-06, 3.1077375317001873e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:35:04,833] [INFO] [timer.py:199:stop] epoch=10/micro_step=160/global_step=9240, RunningAvgSamplesPerSec=23.81903057479609, CurrSamplesPerSec=23.854434654813538, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:35:31,472] [INFO] [logging.py:96:log_dist] [Rank 0] step=9250, skipped=175, lr=[3.0981180572758433e-06, 3.0981180572758433e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:35:31,714] [INFO] [timer.py:199:stop] epoch=10/micro_step=200/global_step=9250, RunningAvgSamplesPerSec=23.81906273440979, CurrSamplesPerSec=23.852605392997962, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:35:58,321] [INFO] [logging.py:96:log_dist] [Rank 0] step=9260, skipped=175, lr=[3.088506448718682e-06, 3.088506448718682e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:35:58,562] [INFO] [timer.py:199:stop] epoch=10/micro_step=240/global_step=9260, RunningAvgSamplesPerSec=23.819124208550356, CurrSamplesPerSec=23.876297268312875, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:36:25,201] [INFO] [logging.py:96:log_dist] [Rank 0] step=9270, skipped=175, lr=[3.078902749809133e-06, 3.078902749809133e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:36:25,442] [INFO] [timer.py:199:stop] epoch=10/micro_step=280/global_step=9270, RunningAvgSamplesPerSec=23.819155253867322, CurrSamplesPerSec=23.84490352750476, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:36:41,267] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 22:36:43,662] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 22:36:51,504] [INFO] [logging.py:96:log_dist] [Rank 0] step=9280, skipped=177, lr=[3.0712255150252274e-06, 3.0712255150252274e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:36:51,745] [INFO] [timer.py:199:stop] epoch=10/micro_step=320/global_step=9280, RunningAvgSamplesPerSec=23.81973804045485, CurrSamplesPerSec=23.806109770691595, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:37:18,413] [INFO] [logging.py:96:log_dist] [Rank 0] step=9290, skipped=177, lr=[3.061636163692937e-06, 3.061636163692937e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:37:18,655] [INFO] [timer.py:199:stop] epoch=10/micro_step=360/global_step=9290, RunningAvgSamplesPerSec=23.819741144674072, CurrSamplesPerSec=23.834000179885642, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:37:45,333] [INFO] [logging.py:96:log_dist] [Rank 0] step=9300, skipped=177, lr=[3.052054844401161e-06, 3.052054844401161e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:37:45,574] [INFO] [timer.py:199:stop] epoch=10/micro_step=400/global_step=9300, RunningAvgSamplesPerSec=23.81973464256713, CurrSamplesPerSec=23.806135105562362, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:38:12,241] [INFO] [logging.py:96:log_dist] [Rank 0] step=9310, skipped=177, lr=[3.042481600792362e-06, 3.042481600792362e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:38:12,482] [INFO] [timer.py:199:stop] epoch=10/micro_step=440/global_step=9310, RunningAvgSamplesPerSec=23.819740702383566, CurrSamplesPerSec=23.882935689898797, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:38:39,168] [INFO] [logging.py:96:log_dist] [Rank 0] step=9320, skipped=177, lr=[3.0329164764722254e-06, 3.0329164764722254e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:38:39,413] [INFO] [timer.py:199:stop] epoch=10/micro_step=480/global_step=9320, RunningAvgSamplesPerSec=23.81972585400348, CurrSamplesPerSec=23.74430373409468, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:39:06,083] [INFO] [logging.py:96:log_dist] [Rank 0] step=9330, skipped=177, lr=[3.0233595150094465e-06, 3.0233595150094465e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:39:06,325] [INFO] [timer.py:199:stop] epoch=10/micro_step=520/global_step=9330, RunningAvgSamplesPerSec=23.81972766322398, CurrSamplesPerSec=23.83837302392536, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:39:33,147] [INFO] [logging.py:96:log_dist] [Rank 0] step=9340, skipped=177, lr=[3.01381075993554e-06, 3.01381075993554e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:39:33,390] [INFO] [timer.py:199:stop] epoch=10/micro_step=560/global_step=9340, RunningAvgSamplesPerSec=23.819606650956665, CurrSamplesPerSec=23.841487483895357, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:40:00,050] [INFO] [logging.py:96:log_dist] [Rank 0] step=9350, skipped=177, lr=[3.004270254744646e-06, 3.004270254744646e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:40:00,291] [INFO] [timer.py:199:stop] epoch=10/micro_step=600/global_step=9350, RunningAvgSamplesPerSec=23.819621178552584, CurrSamplesPerSec=23.807433589891655, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:40:26,936] [INFO] [logging.py:96:log_dist] [Rank 0] step=9360, skipped=177, lr=[2.994738042893322e-06, 2.994738042893322e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:40:27,178] [INFO] [timer.py:199:stop] epoch=10/micro_step=640/global_step=9360, RunningAvgSamplesPerSec=23.819646174869185, CurrSamplesPerSec=23.79012156174458, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:40:53,837] [INFO] [logging.py:96:log_dist] [Rank 0] step=9370, skipped=177, lr=[2.985214167800349e-06, 2.985214167800349e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:40:54,080] [INFO] [timer.py:199:stop] epoch=10/micro_step=680/global_step=9370, RunningAvgSamplesPerSec=23.81965703896077, CurrSamplesPerSec=23.821817174663714, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:41:15,304] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 22:41:17,703] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 22:41:20,147] [INFO] [logging.py:96:log_dist] [Rank 0] step=9380, skipped=179, lr=[2.9776010993451697e-06, 2.9776010993451697e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:41:20,388] [INFO] [timer.py:199:stop] epoch=10/micro_step=720/global_step=9380, RunningAvgSamplesPerSec=23.820228093740678, CurrSamplesPerSec=23.878410541263808, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:41:47,039] [INFO] [logging.py:96:log_dist] [Rank 0] step=9390, skipped=179, lr=[2.9680923397112276e-06, 2.9680923397112276e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:41:47,281] [INFO] [timer.py:199:stop] epoch=10/micro_step=760/global_step=9390, RunningAvgSamplesPerSec=23.820245892450302, CurrSamplesPerSec=23.86759113210265, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:42:13,908] [INFO] [logging.py:96:log_dist] [Rank 0] step=9400, skipped=179, lr=[2.9585920382055805e-06, 2.9585920382055805e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:42:14,149] [INFO] [timer.py:199:stop] epoch=10/micro_step=800/global_step=9400, RunningAvgSamplesPerSec=23.820284889772132, CurrSamplesPerSec=23.838417480239105, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:42:40,943] [INFO] [logging.py:96:log_dist] [Rank 0] step=9410, skipped=179, lr=[2.949100238101662e-06, 2.949100238101662e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:42:41,184] [INFO] [timer.py:199:stop] epoch=10/micro_step=840/global_step=9410, RunningAvgSamplesPerSec=23.820298050549393, CurrSamplesPerSec=23.83532498644789, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:43:07,827] [INFO] [logging.py:96:log_dist] [Rank 0] step=9420, skipped=179, lr=[2.939616982634181e-06, 2.939616982634181e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:43:08,068] [INFO] [timer.py:199:stop] epoch=10/micro_step=880/global_step=9420, RunningAvgSamplesPerSec=23.820324915637016, CurrSamplesPerSec=23.861417255641673, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:43:34,696] [INFO] [logging.py:96:log_dist] [Rank 0] step=9430, skipped=179, lr=[2.9301423149989243e-06, 2.9301423149989243e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:43:34,938] [INFO] [timer.py:199:stop] epoch=10/micro_step=920/global_step=9430, RunningAvgSamplesPerSec=23.82036777253752, CurrSamplesPerSec=23.82267549929824, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:44:01,574] [INFO] [logging.py:96:log_dist] [Rank 0] step=9440, skipped=179, lr=[2.9206762783525677e-06, 2.9206762783525677e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:44:01,816] [INFO] [timer.py:199:stop] epoch=10/micro_step=960/global_step=9440, RunningAvgSamplesPerSec=23.820401951093213, CurrSamplesPerSec=23.82793884662444, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:44:28,471] [INFO] [logging.py:96:log_dist] [Rank 0] step=9450, skipped=179, lr=[2.911218915812467e-06, 2.911218915812467e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:44:28,714] [INFO] [timer.py:199:stop] epoch=10/micro_step=1000/global_step=9450, RunningAvgSamplesPerSec=23.82041499793117, CurrSamplesPerSec=23.814878855588933, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:44:55,359] [INFO] [logging.py:96:log_dist] [Rank 0] step=9460, skipped=179, lr=[2.901770270456469e-06, 2.901770270456469e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:44:55,600] [INFO] [timer.py:199:stop] epoch=10/micro_step=1040/global_step=9460, RunningAvgSamplesPerSec=23.82043792388165, CurrSamplesPerSec=23.88052843724988, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:45:22,265] [INFO] [logging.py:96:log_dist] [Rank 0] step=9470, skipped=179, lr=[2.8923303853227185e-06, 2.8923303853227185e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:45:22,508] [INFO] [timer.py:199:stop] epoch=10/micro_step=1080/global_step=9470, RunningAvgSamplesPerSec=23.820443201112578, CurrSamplesPerSec=23.754292707147638, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:45:49,111] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 22:45:49,112] [INFO] [logging.py:96:log_dist] [Rank 0] step=9480, skipped=180, lr=[2.8838420142312345e-06, 2.8838420142312345e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:45:49,112] [INFO] [timer.py:199:stop] epoch=10/micro_step=1120/global_step=9480, RunningAvgSamplesPerSec=23.82073181179409, CurrSamplesPerSec=26.73526244006525, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:45:51,508] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 22:46:15,467] [INFO] [logging.py:96:log_dist] [Rank 0] step=9490, skipped=181, lr=[2.8753608050668098e-06, 2.8753608050668098e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:46:15,710] [INFO] [timer.py:199:stop] epoch=10/micro_step=1160/global_step=9490, RunningAvgSamplesPerSec=23.821023273514786, CurrSamplesPerSec=23.820965253493828, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:46:42,367] [INFO] [logging.py:96:log_dist] [Rank 0] step=9500, skipped=181, lr=[2.865945677177859e-06, 2.865945677177859e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:46:42,609] [INFO] [timer.py:199:stop] epoch=10/micro_step=1200/global_step=9500, RunningAvgSamplesPerSec=23.821035236729443, CurrSamplesPerSec=23.830682460901404, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:47:09,245] [INFO] [logging.py:96:log_dist] [Rank 0] step=9510, skipped=181, lr=[2.8565394726905166e-06, 2.8565394726905166e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:47:09,487] [INFO] [timer.py:199:stop] epoch=10/micro_step=1240/global_step=9510, RunningAvgSamplesPerSec=23.8210663398584, CurrSamplesPerSec=23.807000746043364, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:47:36,150] [INFO] [logging.py:96:log_dist] [Rank 0] step=9520, skipped=181, lr=[2.8471422344496033e-06, 2.8471422344496033e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:47:36,394] [INFO] [timer.py:199:stop] epoch=10/micro_step=1280/global_step=9520, RunningAvgSamplesPerSec=23.821071610771384, CurrSamplesPerSec=23.81398517784098, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:48:03,018] [INFO] [logging.py:96:log_dist] [Rank 0] step=9530, skipped=181, lr=[2.837754005259109e-06, 2.837754005259109e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:48:03,260] [INFO] [timer.py:199:stop] epoch=10/micro_step=1320/global_step=9530, RunningAvgSamplesPerSec=23.82111404128867, CurrSamplesPerSec=23.85402977663172, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:48:29,923] [INFO] [logging.py:96:log_dist] [Rank 0] step=9540, skipped=181, lr=[2.8283748278819783e-06, 2.8283748278819783e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:48:30,164] [INFO] [timer.py:199:stop] epoch=10/micro_step=1360/global_step=9540, RunningAvgSamplesPerSec=23.821136179657632, CurrSamplesPerSec=23.84982494372842, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:48:56,786] [INFO] [logging.py:96:log_dist] [Rank 0] step=9550, skipped=181, lr=[2.8190047450399323e-06, 2.8190047450399323e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:48:57,027] [INFO] [timer.py:199:stop] epoch=10/micro_step=1400/global_step=9550, RunningAvgSamplesPerSec=23.82117993771119, CurrSamplesPerSec=23.82787962367575, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:49:23,636] [INFO] [logging.py:96:log_dist] [Rank 0] step=9560, skipped=181, lr=[2.8096437994132626e-06, 2.8096437994132626e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:49:23,878] [INFO] [timer.py:199:stop] epoch=10/micro_step=1440/global_step=9560, RunningAvgSamplesPerSec=23.82123490867829, CurrSamplesPerSec=23.849922418027166, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:49:50,525] [INFO] [logging.py:96:log_dist] [Rank 0] step=9570, skipped=181, lr=[2.8002920336406414e-06, 2.8002920336406414e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:49:50,768] [INFO] [timer.py:199:stop] epoch=10/micro_step=1480/global_step=9570, RunningAvgSamplesPerSec=23.821255356009942, CurrSamplesPerSec=23.80012169795872, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:50:17,408] [INFO] [logging.py:96:log_dist] [Rank 0] step=9580, skipped=181, lr=[2.79094949031893e-06, 2.79094949031893e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:50:17,649] [INFO] [timer.py:199:stop] epoch=10/micro_step=1520/global_step=9580, RunningAvgSamplesPerSec=23.82128598359356, CurrSamplesPerSec=23.882064518608647, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:50:22,731] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 22:50:25,133] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 22:50:43,694] [INFO] [logging.py:96:log_dist] [Rank 0] step=9590, skipped=183, lr=[2.7834821244244923e-06, 2.7834821244244923e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:50:43,936] [INFO] [timer.py:199:stop] epoch=10/micro_step=1560/global_step=9590, RunningAvgSamplesPerSec=23.821868042315224, CurrSamplesPerSec=23.86601871812487, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:51:10,562] [INFO] [logging.py:96:log_dist] [Rank 0] step=9600, skipped=183, lr=[2.77415628872429e-06, 2.77415628872429e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:51:10,803] [INFO] [timer.py:199:stop] epoch=10/micro_step=1600/global_step=9600, RunningAvgSamplesPerSec=23.821914336682784, CurrSamplesPerSec=23.829912408231667, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:51:37,432] [INFO] [logging.py:96:log_dist] [Rank 0] step=9610, skipped=183, lr=[2.764839794522103e-06, 2.764839794522103e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:51:37,675] [INFO] [timer.py:199:stop] epoch=10/micro_step=1640/global_step=9610, RunningAvgSamplesPerSec=23.821956772349022, CurrSamplesPerSec=23.824999205642406, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:52:04,337] [INFO] [logging.py:96:log_dist] [Rank 0] step=9620, skipped=183, lr=[2.755532684254133e-06, 2.755532684254133e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:52:04,578] [INFO] [timer.py:199:stop] epoch=10/micro_step=1680/global_step=9620, RunningAvgSamplesPerSec=23.82196673740961, CurrSamplesPerSec=23.815191553004958, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:52:31,227] [INFO] [logging.py:96:log_dist] [Rank 0] step=9630, skipped=183, lr=[2.7462350003138303e-06, 2.7462350003138303e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:52:31,468] [INFO] [timer.py:199:stop] epoch=10/micro_step=1720/global_step=9630, RunningAvgSamplesPerSec=23.821990019110302, CurrSamplesPerSec=23.89145314275741, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:52:58,143] [INFO] [logging.py:96:log_dist] [Rank 0] step=9640, skipped=183, lr=[2.7369467850517175e-06, 2.7369467850517175e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:52:58,384] [INFO] [timer.py:199:stop] epoch=10/micro_step=1760/global_step=9640, RunningAvgSamplesPerSec=23.821984290325812, CurrSamplesPerSec=23.808140953959505, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:53:25,022] [INFO] [logging.py:96:log_dist] [Rank 0] step=9650, skipped=183, lr=[2.727668080775184e-06, 2.727668080775184e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:53:25,264] [INFO] [timer.py:199:stop] epoch=10/micro_step=1800/global_step=9650, RunningAvgSamplesPerSec=23.822013080425627, CurrSamplesPerSec=23.886744091970684, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:53:51,910] [INFO] [logging.py:96:log_dist] [Rank 0] step=9660, skipped=183, lr=[2.7183989297482968e-06, 2.7183989297482968e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:53:52,151] [INFO] [timer.py:199:stop] epoch=10/micro_step=1840/global_step=9660, RunningAvgSamplesPerSec=23.822034105524274, CurrSamplesPerSec=23.834736635642095, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:54:18,802] [INFO] [logging.py:96:log_dist] [Rank 0] step=9670, skipped=183, lr=[2.7091393741916128e-06, 2.7091393741916128e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:54:19,044] [INFO] [timer.py:199:stop] epoch=10/micro_step=1880/global_step=9670, RunningAvgSamplesPerSec=23.822046975740726, CurrSamplesPerSec=23.896134269278136, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:54:45,653] [INFO] [logging.py:96:log_dist] [Rank 0] step=9680, skipped=183, lr=[2.6998894562819735e-06, 2.6998894562819735e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:54:45,894] [INFO] [timer.py:199:stop] epoch=10/micro_step=1920/global_step=9680, RunningAvgSamplesPerSec=23.822103809654926, CurrSamplesPerSec=23.867544444762558, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:54:56,356] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 22:54:58,756] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 22:55:11,949] [INFO] [logging.py:96:log_dist] [Rank 0] step=9690, skipped=185, lr=[2.6924964893749543e-06, 2.6924964893749543e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:55:12,191] [INFO] [timer.py:199:stop] epoch=10/micro_step=1960/global_step=9690, RunningAvgSamplesPerSec=23.822664293383728, CurrSamplesPerSec=23.866112081050133, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:55:38,846] [INFO] [logging.py:96:log_dist] [Rank 0] step=9700, skipped=185, lr=[2.683264025375404e-06, 2.683264025375404e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:55:39,088] [INFO] [timer.py:199:stop] epoch=10/micro_step=2000/global_step=9700, RunningAvgSamplesPerSec=23.82267380586454, CurrSamplesPerSec=23.801130404366276, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:56:05,755] [INFO] [logging.py:96:log_dist] [Rank 0] step=9710, skipped=185, lr=[2.6740413168839124e-06, 2.6740413168839124e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:56:05,997] [INFO] [timer.py:199:stop] epoch=10/micro_step=2040/global_step=9710, RunningAvgSamplesPerSec=23.822676388698337, CurrSamplesPerSec=23.816362128397596, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:56:32,923] [INFO] [logging.py:96:log_dist] [Rank 0] step=9720, skipped=185, lr=[2.6648284059094918e-06, 2.6648284059094918e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:56:33,165] [INFO] [timer.py:199:stop] epoch=10/micro_step=2080/global_step=9720, RunningAvgSamplesPerSec=23.822505052834444, CurrSamplesPerSec=23.837291304729284, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:56:59,798] [INFO] [logging.py:96:log_dist] [Rank 0] step=9730, skipped=185, lr=[2.6556253344165228e-06, 2.6556253344165228e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:57:00,040] [INFO] [timer.py:199:stop] epoch=10/micro_step=2120/global_step=9730, RunningAvgSamplesPerSec=23.82253786195968, CurrSamplesPerSec=23.83892768123429, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:57:26,691] [INFO] [logging.py:96:log_dist] [Rank 0] step=9740, skipped=185, lr=[2.6464321443245723e-06, 2.6464321443245723e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:57:26,933] [INFO] [timer.py:199:stop] epoch=10/micro_step=2160/global_step=9740, RunningAvgSamplesPerSec=23.82255519789243, CurrSamplesPerSec=23.645669753968647, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:57:53,568] [INFO] [logging.py:96:log_dist] [Rank 0] step=9750, skipped=185, lr=[2.6372488775081915e-06, 2.6372488775081915e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:57:53,807] [INFO] [timer.py:199:stop] epoch=10/micro_step=2200/global_step=9750, RunningAvgSamplesPerSec=23.822588661581722, CurrSamplesPerSec=23.869753804984438, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:58:20,477] [INFO] [logging.py:96:log_dist] [Rank 0] step=9760, skipped=185, lr=[2.628075575796736e-06, 2.628075575796736e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:58:20,719] [INFO] [timer.py:199:stop] epoch=10/micro_step=2240/global_step=9760, RunningAvgSamplesPerSec=23.822589297637297, CurrSamplesPerSec=23.780123505655627, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:58:47,371] [INFO] [logging.py:96:log_dist] [Rank 0] step=9770, skipped=185, lr=[2.6189122809741714e-06, 2.6189122809741714e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:58:47,613] [INFO] [timer.py:199:stop] epoch=10/micro_step=2280/global_step=9770, RunningAvgSamplesPerSec=23.822605321752963, CurrSamplesPerSec=23.85043946893402, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:59:14,241] [INFO] [logging.py:96:log_dist] [Rank 0] step=9780, skipped=185, lr=[2.6097590347788754e-06, 2.6097590347788754e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:59:14,479] [INFO] [timer.py:199:stop] epoch=10/micro_step=2320/global_step=9780, RunningAvgSamplesPerSec=23.82264405977876, CurrSamplesPerSec=23.832186745798488, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 22:59:30,340] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 22:59:32,739] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 22:59:40,570] [INFO] [logging.py:96:log_dist] [Rank 0] step=9790, skipped=187, lr=[2.602443700853256e-06, 2.602443700853256e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 22:59:40,811] [INFO] [timer.py:199:stop] epoch=10/micro_step=2360/global_step=9790, RunningAvgSamplesPerSec=23.82316770278983, CurrSamplesPerSec=23.77017384204114, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:00:07,414] [INFO] [logging.py:96:log_dist] [Rank 0] step=9800, skipped=187, lr=[2.593308647221558e-06, 2.593308647221558e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:00:07,655] [INFO] [timer.py:199:stop] epoch=10/micro_step=2400/global_step=9800, RunningAvgSamplesPerSec=23.823225559035368, CurrSamplesPerSec=23.876860063256427, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:00:34,383] [INFO] [logging.py:96:log_dist] [Rank 0] step=9810, skipped=187, lr=[2.5841837588404933e-06, 2.5841837588404933e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:00:34,624] [INFO] [timer.py:199:stop] epoch=10/micro_step=2440/global_step=9810, RunningAvgSamplesPerSec=23.82320149696972, CurrSamplesPerSec=23.927087286857333, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:01:01,218] [INFO] [logging.py:96:log_dist] [Rank 0] step=9820, skipped=187, lr=[2.575069077273513e-06, 2.575069077273513e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:01:01,460] [INFO] [timer.py:199:stop] epoch=10/micro_step=2480/global_step=9820, RunningAvgSamplesPerSec=23.82326822350998, CurrSamplesPerSec=23.893615888758042, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:01:28,150] [INFO] [logging.py:96:log_dist] [Rank 0] step=9830, skipped=187, lr=[2.5659646440375635e-06, 2.5659646440375635e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:01:28,391] [INFO] [timer.py:199:stop] epoch=10/micro_step=2520/global_step=9830, RunningAvgSamplesPerSec=23.82324779612205, CurrSamplesPerSec=23.661658297445936, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:01:55,025] [INFO] [logging.py:96:log_dist] [Rank 0] step=9840, skipped=187, lr=[2.5568705006029174e-06, 2.5568705006029174e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:01:55,266] [INFO] [timer.py:199:stop] epoch=10/micro_step=2560/global_step=9840, RunningAvgSamplesPerSec=23.823280789500394, CurrSamplesPerSec=23.863489709518166, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:02:21,915] [INFO] [logging.py:96:log_dist] [Rank 0] step=9850, skipped=187, lr=[2.5477866883929775e-06, 2.5477866883929775e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:02:22,156] [INFO] [timer.py:199:stop] epoch=10/micro_step=2600/global_step=9850, RunningAvgSamplesPerSec=23.823296118081096, CurrSamplesPerSec=23.812343774311508, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:02:49,301] [INFO] [logging.py:96:log_dist] [Rank 0] step=9860, skipped=187, lr=[2.538713248784089e-06, 2.538713248784089e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:02:49,543] [INFO] [timer.py:199:stop] epoch=10/micro_step=2640/global_step=9860, RunningAvgSamplesPerSec=23.822897993227997, CurrSamplesPerSec=23.847458258024023, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:03:16,153] [INFO] [logging.py:96:log_dist] [Rank 0] step=9870, skipped=187, lr=[2.529650223105343e-06, 2.529650223105343e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:03:16,395] [INFO] [timer.py:199:stop] epoch=10/micro_step=2680/global_step=9870, RunningAvgSamplesPerSec=23.82294974740927, CurrSamplesPerSec=23.844761614080692, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:03:43,026] [INFO] [logging.py:96:log_dist] [Rank 0] step=9880, skipped=187, lr=[2.520597652638405e-06, 2.520597652638405e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:03:43,267] [INFO] [timer.py:199:stop] epoch=10/micro_step=2720/global_step=9880, RunningAvgSamplesPerSec=23.82298077566509, CurrSamplesPerSec=23.850636547516604, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:04:04,462] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 23:04:06,855] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 23:04:09,305] [INFO] [logging.py:96:log_dist] [Rank 0] step=9890, skipped=189, lr=[2.513363151728236e-06, 2.513363151728236e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:04:09,547] [INFO] [timer.py:199:stop] epoch=10/micro_step=2760/global_step=9890, RunningAvgSamplesPerSec=23.823542195991525, CurrSamplesPerSec=23.8232505691221, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:04:36,157] [INFO] [logging.py:96:log_dist] [Rank 0] step=9900, skipped=189, lr=[2.5043295045202063e-06, 2.5043295045202063e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:04:36,397] [INFO] [timer.py:199:stop] epoch=10/micro_step=2800/global_step=9900, RunningAvgSamplesPerSec=23.82359286027931, CurrSamplesPerSec=23.843249387876117, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:05:03,018] [INFO] [logging.py:96:log_dist] [Rank 0] step=9910, skipped=189, lr=[2.4953064278586837e-06, 2.4953064278586837e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:05:03,260] [INFO] [timer.py:199:stop] epoch=10/micro_step=2840/global_step=9910, RunningAvgSamplesPerSec=23.823631204198943, CurrSamplesPerSec=23.873241648715187, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:05:29,894] [INFO] [logging.py:96:log_dist] [Rank 0] step=9920, skipped=189, lr=[2.4862939628433646e-06, 2.4862939628433646e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:05:30,136] [INFO] [timer.py:199:stop] epoch=10/micro_step=2880/global_step=9920, RunningAvgSamplesPerSec=23.823657870765395, CurrSamplesPerSec=23.85041403967673, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:05:56,758] [INFO] [logging.py:96:log_dist] [Rank 0] step=9930, skipped=189, lr=[2.4772921505256056e-06, 2.4772921505256056e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:05:57,000] [INFO] [timer.py:199:stop] epoch=10/micro_step=2920/global_step=9930, RunningAvgSamplesPerSec=23.823694753249736, CurrSamplesPerSec=23.84665322582436, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:06:23,642] [INFO] [logging.py:96:log_dist] [Rank 0] step=9940, skipped=189, lr=[2.46830103190825e-06, 2.46830103190825e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:06:23,885] [INFO] [timer.py:199:stop] epoch=10/micro_step=2960/global_step=9940, RunningAvgSamplesPerSec=23.823716304300632, CurrSamplesPerSec=23.754576488128528, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:06:50,581] [INFO] [logging.py:96:log_dist] [Rank 0] step=9950, skipped=189, lr=[2.4593206479454217e-06, 2.4593206479454217e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:06:50,823] [INFO] [timer.py:199:stop] epoch=10/micro_step=3000/global_step=9950, RunningAvgSamplesPerSec=23.823695492748076, CurrSamplesPerSec=23.81769131469156, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:07:17,514] [INFO] [logging.py:96:log_dist] [Rank 0] step=9960, skipped=189, lr=[2.4503510395423546e-06, 2.4503510395423546e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:07:17,756] [INFO] [timer.py:199:stop] epoch=10/micro_step=3040/global_step=9960, RunningAvgSamplesPerSec=23.82367777523339, CurrSamplesPerSec=23.79392994552761, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:07:44,423] [INFO] [logging.py:96:log_dist] [Rank 0] step=9970, skipped=189, lr=[2.4413922475552013e-06, 2.4413922475552013e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:07:44,666] [INFO] [timer.py:199:stop] epoch=10/micro_step=3080/global_step=9970, RunningAvgSamplesPerSec=23.823677026355753, CurrSamplesPerSec=23.7945036283598, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:08:11,350] [INFO] [logging.py:96:log_dist] [Rank 0] step=9980, skipped=189, lr=[2.432444312790838e-06, 2.432444312790838e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:08:11,591] [INFO] [timer.py:199:stop] epoch=10/micro_step=3120/global_step=9980, RunningAvgSamplesPerSec=23.82366342927478, CurrSamplesPerSec=23.910963911238618, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:08:38,202] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 23:08:38,203] [INFO] [logging.py:96:log_dist] [Rank 0] step=9990, skipped=190, lr=[2.424400488115442e-06, 2.424400488115442e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:08:38,204] [INFO] [timer.py:199:stop] epoch=10/micro_step=3160/global_step=9990, RunningAvgSamplesPerSec=23.823939360078842, CurrSamplesPerSec=26.69748632208053, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:08:40,606] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 23:09:04,599] [INFO] [logging.py:96:log_dist] [Rank 0] step=10000, skipped=191, lr=[2.4163655204824553e-06, 2.4163655204824553e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:09:04,841] [INFO] [timer.py:199:stop] epoch=10/micro_step=3200/global_step=10000, RunningAvgSamplesPerSec=23.824179378747143, CurrSamplesPerSec=23.799767193836356, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:09:31,494] [INFO] [logging.py:96:log_dist] [Rank 0] step=10010, skipped=191, lr=[2.4074482026128632e-06, 2.4074482026128632e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:09:31,737] [INFO] [timer.py:199:stop] epoch=10/micro_step=3240/global_step=10010, RunningAvgSamplesPerSec=23.824188681019958, CurrSamplesPerSec=23.786927761358243, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:09:58,399] [INFO] [logging.py:96:log_dist] [Rank 0] step=10020, skipped=191, lr=[2.39854189657962e-06, 2.39854189657962e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:09:58,641] [INFO] [timer.py:199:stop] epoch=10/micro_step=3280/global_step=10020, RunningAvgSamplesPerSec=23.82419002658694, CurrSamplesPerSec=23.79993600400609, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:10:25,298] [INFO] [logging.py:96:log_dist] [Rank 0] step=10030, skipped=191, lr=[2.3896466429505306e-06, 2.3896466429505306e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:10:25,540] [INFO] [timer.py:199:stop] epoch=10/micro_step=3320/global_step=10030, RunningAvgSamplesPerSec=23.82419604052315, CurrSamplesPerSec=23.781035711580518, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:10:52,162] [INFO] [logging.py:96:log_dist] [Rank 0] step=10040, skipped=191, lr=[2.3807624822430644e-06, 2.3807624822430644e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:10:52,404] [INFO] [timer.py:199:stop] epoch=10/micro_step=3360/global_step=10040, RunningAvgSamplesPerSec=23.82423465362853, CurrSamplesPerSec=23.892948096489125, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:11:19,068] [INFO] [logging.py:96:log_dist] [Rank 0] step=10050, skipped=191, lr=[2.3718894549241623e-06, 2.3718894549241623e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:11:19,311] [INFO] [timer.py:199:stop] epoch=10/micro_step=3400/global_step=10050, RunningAvgSamplesPerSec=23.824237059686535, CurrSamplesPerSec=23.819692770098627, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:11:45,975] [INFO] [logging.py:96:log_dist] [Rank 0] step=10060, skipped=191, lr=[2.363027601410054e-06, 2.363027601410054e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:11:46,215] [INFO] [timer.py:199:stop] epoch=10/micro_step=3440/global_step=10060, RunningAvgSamplesPerSec=23.824245791710354, CurrSamplesPerSec=23.841635710976647, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:12:12,864] [INFO] [logging.py:96:log_dist] [Rank 0] step=10070, skipped=191, lr=[2.354176962066067e-06, 2.354176962066067e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:12:13,106] [INFO] [timer.py:199:stop] epoch=10/micro_step=3480/global_step=10070, RunningAvgSamplesPerSec=23.824262027334736, CurrSamplesPerSec=23.812797936686017, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:12:39,765] [INFO] [logging.py:96:log_dist] [Rank 0] step=10080, skipped=191, lr=[2.345337577206456e-06, 2.345337577206456e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:12:40,006] [INFO] [timer.py:199:stop] epoch=10/micro_step=3520/global_step=10080, RunningAvgSamplesPerSec=23.82426985243905, CurrSamplesPerSec=23.790541141165892, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:13:06,647] [INFO] [logging.py:96:log_dist] [Rank 0] step=10090, skipped=191, lr=[2.3365094870942098e-06, 2.3365094870942098e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:13:06,888] [INFO] [timer.py:199:stop] epoch=10/micro_step=3560/global_step=10090, RunningAvgSamplesPerSec=23.824292854595104, CurrSamplesPerSec=23.902728388941835, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:13:11,971] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 23:13:14,364] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 23:13:32,936] [INFO] [logging.py:96:log_dist] [Rank 0] step=10100, skipped=193, lr=[2.3294551742464016e-06, 2.3294551742464016e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:13:33,177] [INFO] [timer.py:199:stop] epoch=10/micro_step=3600/global_step=10100, RunningAvgSamplesPerSec=23.824838495261513, CurrSamplesPerSec=23.894675075341958, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:13:59,813] [INFO] [logging.py:96:log_dist] [Rank 0] step=10110, skipped=193, lr=[2.320647515977803e-06, 2.320647515977803e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:14:00,054] [INFO] [timer.py:199:stop] epoch=10/micro_step=3640/global_step=10110, RunningAvgSamplesPerSec=23.824862508349426, CurrSamplesPerSec=23.870598606030907, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:14:26,724] [INFO] [logging.py:96:log_dist] [Rank 0] step=10120, skipped=193, lr=[2.311851264918654e-06, 2.311851264918654e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:14:26,967] [INFO] [timer.py:199:stop] epoch=10/micro_step=3680/global_step=10120, RunningAvgSamplesPerSec=23.82485703227083, CurrSamplesPerSec=23.777989680617146, MemAllocated=11.44GB, MaxMemAllocated=31.66GB ***** Evaluating perplexity, Epoch 11/16 ***** ppl: 1.781893253326416 saving the final model ... Beginning of Epoch 12/16, Total Micro Batches 3680 [2023-04-23 23:15:45,406] [INFO] [logging.py:96:log_dist] [Rank 0] step=10130, skipped=193, lr=[2.303066461135465e-06, 2.303066461135465e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:15:45,648] [INFO] [timer.py:199:stop] epoch=11/micro_step=40/global_step=10130, RunningAvgSamplesPerSec=23.82484587368391, CurrSamplesPerSec=23.863563959683372, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:16:12,281] [INFO] [logging.py:96:log_dist] [Rank 0] step=10140, skipped=193, lr=[2.2942931446426167e-06, 2.2942931446426167e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:16:12,523] [INFO] [timer.py:199:stop] epoch=11/micro_step=80/global_step=10140, RunningAvgSamplesPerSec=23.824874579051368, CurrSamplesPerSec=23.81362814803591, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:16:39,202] [INFO] [logging.py:96:log_dist] [Rank 0] step=10150, skipped=193, lr=[2.285531355402154e-06, 2.285531355402154e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:16:39,442] [INFO] [timer.py:199:stop] epoch=11/micro_step=120/global_step=10150, RunningAvgSamplesPerSec=23.824866099402435, CurrSamplesPerSec=23.84584401139297, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:17:06,099] [INFO] [logging.py:96:log_dist] [Rank 0] step=10160, skipped=193, lr=[2.2767811333236225e-06, 2.2767811333236225e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:17:06,342] [INFO] [timer.py:199:stop] epoch=11/micro_step=160/global_step=10160, RunningAvgSamplesPerSec=23.82487150498599, CurrSamplesPerSec=23.792620280671517, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:17:33,362] [INFO] [logging.py:96:log_dist] [Rank 0] step=10170, skipped=193, lr=[2.2680425182638796e-06, 2.2680425182638796e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:17:33,603] [INFO] [timer.py:199:stop] epoch=11/micro_step=200/global_step=10170, RunningAvgSamplesPerSec=23.82463054170481, CurrSamplesPerSec=23.81442883926115, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:18:00,254] [INFO] [logging.py:96:log_dist] [Rank 0] step=10180, skipped=193, lr=[2.2593155500269077e-06, 2.2593155500269077e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:18:00,500] [INFO] [timer.py:199:stop] epoch=11/micro_step=240/global_step=10180, RunningAvgSamplesPerSec=23.824639214090897, CurrSamplesPerSec=23.830208576445848, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:18:27,200] [INFO] [logging.py:96:log_dist] [Rank 0] step=10190, skipped=193, lr=[2.250600268363646e-06, 2.250600268363646e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:18:27,442] [INFO] [timer.py:199:stop] epoch=11/micro_step=280/global_step=10190, RunningAvgSamplesPerSec=23.824613818327155, CurrSamplesPerSec=23.249726154574493, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:18:37,897] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 23:18:40,297] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 23:18:53,506] [INFO] [logging.py:96:log_dist] [Rank 0] step=10200, skipped=195, lr=[2.243636484044757e-06, 2.243636484044757e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:18:53,747] [INFO] [timer.py:199:stop] epoch=11/micro_step=320/global_step=10200, RunningAvgSamplesPerSec=23.8251367268619, CurrSamplesPerSec=23.863398488519167, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:19:20,426] [INFO] [logging.py:96:log_dist] [Rank 0] step=10210, skipped=195, lr=[2.2349423382164974e-06, 2.2349423382164974e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:19:20,668] [INFO] [timer.py:199:stop] epoch=11/micro_step=360/global_step=10210, RunningAvgSamplesPerSec=23.825128203295833, CurrSamplesPerSec=23.882032647693396, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:19:47,334] [INFO] [logging.py:96:log_dist] [Rank 0] step=10220, skipped=195, lr=[2.226259989980796e-06, 2.226259989980796e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:19:47,575] [INFO] [timer.py:199:stop] epoch=11/micro_step=400/global_step=10220, RunningAvgSamplesPerSec=23.825130311278354, CurrSamplesPerSec=23.842628879966355, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:20:14,251] [INFO] [logging.py:96:log_dist] [Rank 0] step=10230, skipped=195, lr=[2.217589478885348e-06, 2.217589478885348e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:20:14,493] [INFO] [timer.py:199:stop] epoch=11/micro_step=440/global_step=10230, RunningAvgSamplesPerSec=23.825120440749608, CurrSamplesPerSec=23.79033662070468, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:20:41,149] [INFO] [logging.py:96:log_dist] [Rank 0] step=10240, skipped=195, lr=[2.208930844423929e-06, 2.208930844423929e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:20:41,390] [INFO] [timer.py:199:stop] epoch=11/micro_step=480/global_step=10240, RunningAvgSamplesPerSec=23.82512911971709, CurrSamplesPerSec=23.828846263612895, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:21:08,008] [INFO] [logging.py:96:log_dist] [Rank 0] step=10250, skipped=195, lr=[2.2002841260362138e-06, 2.2002841260362138e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:21:08,250] [INFO] [timer.py:199:stop] epoch=11/micro_step=520/global_step=10250, RunningAvgSamplesPerSec=23.825169934937595, CurrSamplesPerSec=23.89261421435235, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:21:34,917] [INFO] [logging.py:96:log_dist] [Rank 0] step=10260, skipped=195, lr=[2.19164936310761e-06, 2.19164936310761e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:21:35,159] [INFO] [timer.py:199:stop] epoch=11/micro_step=560/global_step=10260, RunningAvgSamplesPerSec=23.82516941950037, CurrSamplesPerSec=23.80712320758435, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:22:01,863] [INFO] [logging.py:96:log_dist] [Rank 0] step=10270, skipped=195, lr=[2.1830265949690576e-06, 2.1830265949690576e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:22:02,106] [INFO] [timer.py:199:stop] epoch=11/micro_step=600/global_step=10270, RunningAvgSamplesPerSec=23.82513702992857, CurrSamplesPerSec=23.759864462743497, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:22:28,779] [INFO] [logging.py:96:log_dist] [Rank 0] step=10280, skipped=195, lr=[2.174415860896868e-06, 2.174415860896868e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:22:29,021] [INFO] [timer.py:199:stop] epoch=11/micro_step=640/global_step=10280, RunningAvgSamplesPerSec=23.825130825012682, CurrSamplesPerSec=23.8200520954884, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:22:55,697] [INFO] [logging.py:96:log_dist] [Rank 0] step=10290, skipped=195, lr=[2.1658172001125357e-06, 2.1658172001125357e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:22:55,939] [INFO] [timer.py:199:stop] epoch=11/micro_step=680/global_step=10290, RunningAvgSamplesPerSec=23.825124476361996, CurrSamplesPerSec=23.81753281886923, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:23:11,789] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 23:23:14,185] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 23:23:22,008] [INFO] [logging.py:96:log_dist] [Rank 0] step=10300, skipped=197, lr=[2.158946990574067e-06, 2.158946990574067e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:23:22,249] [INFO] [timer.py:199:stop] epoch=11/micro_step=720/global_step=10300, RunningAvgSamplesPerSec=23.825638892270295, CurrSamplesPerSec=23.786860310795777, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:23:48,868] [INFO] [logging.py:96:log_dist] [Rank 0] step=10310, skipped=197, lr=[2.150370160370387e-06, 2.150370160370387e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:23:49,108] [INFO] [timer.py:199:stop] epoch=11/micro_step=760/global_step=10310, RunningAvgSamplesPerSec=23.82567931478046, CurrSamplesPerSec=23.847242165096887, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:24:15,779] [INFO] [logging.py:96:log_dist] [Rank 0] step=10320, skipped=197, lr=[2.141805512981618e-06, 2.141805512981618e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:24:16,023] [INFO] [timer.py:199:stop] epoch=11/micro_step=800/global_step=10320, RunningAvgSamplesPerSec=23.825671035242973, CurrSamplesPerSec=23.75447768953618, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:24:42,688] [INFO] [logging.py:96:log_dist] [Rank 0] step=10330, skipped=197, lr=[2.1332530874193264e-06, 2.1332530874193264e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:24:42,928] [INFO] [timer.py:199:stop] epoch=11/micro_step=840/global_step=10330, RunningAvgSamplesPerSec=23.825671357726215, CurrSamplesPerSec=23.904884661698837, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:25:09,563] [INFO] [logging.py:96:log_dist] [Rank 0] step=10340, skipped=197, lr=[2.1247129226394157e-06, 2.1247129226394157e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:25:09,805] [INFO] [timer.py:199:stop] epoch=11/micro_step=880/global_step=10340, RunningAvgSamplesPerSec=23.825693546726473, CurrSamplesPerSec=23.89480694851812, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:25:36,454] [INFO] [logging.py:96:log_dist] [Rank 0] step=10350, skipped=197, lr=[2.116185057541941e-06, 2.116185057541941e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:25:36,696] [INFO] [timer.py:199:stop] epoch=11/micro_step=920/global_step=10350, RunningAvgSamplesPerSec=23.825703714947846, CurrSamplesPerSec=23.770759009712048, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:26:03,326] [INFO] [logging.py:96:log_dist] [Rank 0] step=10360, skipped=197, lr=[2.107669530970934e-06, 2.107669530970934e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:26:03,568] [INFO] [timer.py:199:stop] epoch=11/micro_step=960/global_step=10360, RunningAvgSamplesPerSec=23.82573223104131, CurrSamplesPerSec=23.892261202624205, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:26:30,248] [INFO] [logging.py:96:log_dist] [Rank 0] step=10370, skipped=197, lr=[2.0991663817142195e-06, 2.0991663817142195e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:26:30,490] [INFO] [timer.py:199:stop] epoch=11/micro_step=1000/global_step=10370, RunningAvgSamplesPerSec=23.82571770471672, CurrSamplesPerSec=23.80562630724091, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:26:57,166] [INFO] [logging.py:96:log_dist] [Rank 0] step=10380, skipped=197, lr=[2.09067564850325e-06, 2.09067564850325e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:26:57,410] [INFO] [timer.py:199:stop] epoch=11/micro_step=1040/global_step=10380, RunningAvgSamplesPerSec=23.825706242561317, CurrSamplesPerSec=23.746255062662726, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:27:24,072] [INFO] [logging.py:96:log_dist] [Rank 0] step=10390, skipped=197, lr=[2.082197370012922e-06, 2.082197370012922e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:27:24,314] [INFO] [timer.py:199:stop] epoch=11/micro_step=1080/global_step=10390, RunningAvgSamplesPerSec=23.82570828225606, CurrSamplesPerSec=23.78606568765613, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:27:45,568] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 23:27:47,973] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 23:27:50,424] [INFO] [logging.py:96:log_dist] [Rank 0] step=10400, skipped=199, lr=[2.07542374057284e-06, 2.07542374057284e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:27:50,666] [INFO] [timer.py:199:stop] epoch=11/micro_step=1120/global_step=10400, RunningAvgSamplesPerSec=23.826183057108693, CurrSamplesPerSec=23.814016867394205, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:28:17,307] [INFO] [logging.py:96:log_dist] [Rank 0] step=10410, skipped=199, lr=[2.066967977859208e-06, 2.066967977859208e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:28:17,549] [INFO] [timer.py:199:stop] epoch=11/micro_step=1160/global_step=10410, RunningAvgSamplesPerSec=23.82620089002596, CurrSamplesPerSec=23.80227427114498, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:28:44,177] [INFO] [logging.py:96:log_dist] [Rank 0] step=10420, skipped=199, lr=[2.058524777853557e-06, 2.058524777853557e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:28:44,418] [INFO] [timer.py:199:stop] epoch=11/micro_step=1200/global_step=10420, RunningAvgSamplesPerSec=23.82623091941205, CurrSamplesPerSec=23.884259566967234, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:29:11,067] [INFO] [logging.py:96:log_dist] [Rank 0] step=10430, skipped=199, lr=[2.0500941790142736e-06, 2.0500941790142736e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:29:11,309] [INFO] [timer.py:199:stop] epoch=11/micro_step=1240/global_step=10430, RunningAvgSamplesPerSec=23.82624403221109, CurrSamplesPerSec=23.768309075417136, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:29:37,970] [INFO] [logging.py:96:log_dist] [Rank 0] step=10440, skipped=199, lr=[2.0416762197423444e-06, 2.0416762197423444e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:29:38,212] [INFO] [timer.py:199:stop] epoch=11/micro_step=1280/global_step=10440, RunningAvgSamplesPerSec=23.8262459045934, CurrSamplesPerSec=23.822639558356073, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:30:04,875] [INFO] [logging.py:96:log_dist] [Rank 0] step=10450, skipped=199, lr=[2.033270938381179e-06, 2.033270938381179e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:30:05,116] [INFO] [timer.py:199:stop] epoch=11/micro_step=1320/global_step=10450, RunningAvgSamplesPerSec=23.826248333265998, CurrSamplesPerSec=23.74662688069188, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:30:31,804] [INFO] [logging.py:96:log_dist] [Rank 0] step=10460, skipped=199, lr=[2.0248783732164517e-06, 2.0248783732164517e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:30:32,047] [INFO] [timer.py:199:stop] epoch=11/micro_step=1360/global_step=10460, RunningAvgSamplesPerSec=23.826228926917203, CurrSamplesPerSec=23.741403589643536, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:30:58,681] [INFO] [logging.py:96:log_dist] [Rank 0] step=10470, skipped=199, lr=[2.016498562475902e-06, 2.016498562475902e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:30:58,922] [INFO] [timer.py:199:stop] epoch=11/micro_step=1400/global_step=10470, RunningAvgSamplesPerSec=23.826253153661124, CurrSamplesPerSec=23.820863788035805, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:31:25,611] [INFO] [logging.py:96:log_dist] [Rank 0] step=10480, skipped=199, lr=[2.008131544329181e-06, 2.008131544329181e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:31:25,854] [INFO] [timer.py:199:stop] epoch=11/micro_step=1440/global_step=10480, RunningAvgSamplesPerSec=23.82623022102428, CurrSamplesPerSec=23.724873074553077, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:31:52,560] [INFO] [logging.py:96:log_dist] [Rank 0] step=10490, skipped=199, lr=[1.9997773568876697e-06, 1.9997773568876697e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:31:52,801] [INFO] [timer.py:199:stop] epoch=11/micro_step=1480/global_step=10490, RunningAvgSamplesPerSec=23.8261978213997, CurrSamplesPerSec=23.79751380742819, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:32:19,434] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 23:32:19,435] [INFO] [logging.py:96:log_dist] [Rank 0] step=10500, skipped=200, lr=[1.9922695898952237e-06, 1.9922695898952237e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:32:19,435] [INFO] [timer.py:199:stop] epoch=11/micro_step=1520/global_step=10500, RunningAvgSamplesPerSec=23.826430125073667, CurrSamplesPerSec=26.66192192167734, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:32:21,832] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 23:32:45,853] [INFO] [logging.py:96:log_dist] [Rank 0] step=10510, skipped=201, lr=[1.984772274297629e-06, 1.984772274297629e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:32:46,096] [INFO] [timer.py:199:stop] epoch=11/micro_step=1560/global_step=10510, RunningAvgSamplesPerSec=23.826639983908066, CurrSamplesPerSec=23.822531736180252, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:33:12,742] [INFO] [logging.py:96:log_dist] [Rank 0] step=10520, skipped=201, lr=[1.976454215085109e-06, 1.976454215085109e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:33:12,985] [INFO] [timer.py:199:stop] epoch=11/micro_step=1600/global_step=10520, RunningAvgSamplesPerSec=23.826653613854614, CurrSamplesPerSec=23.80299188207051, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:33:39,622] [INFO] [logging.py:96:log_dist] [Rank 0] step=10530, skipped=201, lr=[1.9681491308665617e-06, 1.9681491308665617e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:33:39,865] [INFO] [timer.py:199:stop] epoch=11/micro_step=1640/global_step=10530, RunningAvgSamplesPerSec=23.82667374196775, CurrSamplesPerSec=23.883503047813992, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:34:06,591] [INFO] [logging.py:96:log_dist] [Rank 0] step=10540, skipped=201, lr=[1.959857059471263e-06, 1.959857059471263e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:34:06,834] [INFO] [timer.py:199:stop] epoch=11/micro_step=1680/global_step=10540, RunningAvgSamplesPerSec=23.82662177556911, CurrSamplesPerSec=23.817097494530955, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:34:33,489] [INFO] [logging.py:96:log_dist] [Rank 0] step=10550, skipped=201, lr=[1.951578038669213e-06, 1.951578038669213e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:34:33,731] [INFO] [timer.py:199:stop] epoch=11/micro_step=1720/global_step=10550, RunningAvgSamplesPerSec=23.826628993842338, CurrSamplesPerSec=23.822821378353346, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:35:00,413] [INFO] [logging.py:96:log_dist] [Rank 0] step=10560, skipped=201, lr=[1.943312106170969e-06, 1.943312106170969e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:35:00,656] [INFO] [timer.py:199:stop] epoch=11/micro_step=1760/global_step=10560, RunningAvgSamplesPerSec=23.826613774497073, CurrSamplesPerSec=23.790167946681613, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:35:27,336] [INFO] [logging.py:96:log_dist] [Rank 0] step=10570, skipped=201, lr=[1.9350592996274678e-06, 1.9350592996274678e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:35:27,579] [INFO] [timer.py:199:stop] epoch=11/micro_step=1800/global_step=10570, RunningAvgSamplesPerSec=23.826600249176053, CurrSamplesPerSec=23.815654275733767, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:35:54,216] [INFO] [logging.py:96:log_dist] [Rank 0] step=10580, skipped=201, lr=[1.926819656629861e-06, 1.926819656629861e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:35:54,460] [INFO] [timer.py:199:stop] epoch=11/micro_step=1840/global_step=10580, RunningAvgSamplesPerSec=23.826620046804432, CurrSamplesPerSec=23.81610856439774, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:36:21,098] [INFO] [logging.py:96:log_dist] [Rank 0] step=10590, skipped=201, lr=[1.9185932147093413e-06, 1.9185932147093413e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:36:21,340] [INFO] [timer.py:199:stop] epoch=11/micro_step=1880/global_step=10590, RunningAvgSamplesPerSec=23.82664087209953, CurrSamplesPerSec=23.83477896206855, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:36:48,004] [INFO] [logging.py:96:log_dist] [Rank 0] step=10600, skipped=201, lr=[1.91038001133697e-06, 1.91038001133697e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:36:48,245] [INFO] [timer.py:199:stop] epoch=11/micro_step=1920/global_step=10600, RunningAvgSamplesPerSec=23.826643297488722, CurrSamplesPerSec=23.849649068205164, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:36:53,334] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 23:36:55,732] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 23:37:14,321] [INFO] [logging.py:96:log_dist] [Rank 0] step=10610, skipped=203, lr=[1.903819005535802e-06, 1.903819005535802e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:37:14,563] [INFO] [timer.py:199:stop] epoch=11/micro_step=1960/global_step=10610, RunningAvgSamplesPerSec=23.827133991295156, CurrSamplesPerSec=23.758916028458785, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:37:41,543] [INFO] [logging.py:96:log_dist] [Rank 0] step=10620, skipped=203, lr=[1.8956297257845855e-06, 1.8956297257845855e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:37:41,785] [INFO] [timer.py:199:stop] epoch=11/micro_step=2000/global_step=10620, RunningAvgSamplesPerSec=23.827016377800277, CurrSamplesPerSec=23.801284461197326, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:38:08,417] [INFO] [logging.py:96:log_dist] [Rank 0] step=10630, skipped=203, lr=[1.8874537891791408e-06, 1.8874537891791408e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:38:08,658] [INFO] [timer.py:199:stop] epoch=11/micro_step=2040/global_step=10630, RunningAvgSamplesPerSec=23.827043477824855, CurrSamplesPerSec=23.834400145899824, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:38:35,276] [INFO] [logging.py:96:log_dist] [Rank 0] step=10640, skipped=203, lr=[1.8792912329604874e-06, 1.8792912329604874e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:38:35,516] [INFO] [timer.py:199:stop] epoch=11/micro_step=2080/global_step=10640, RunningAvgSamplesPerSec=23.827083669798512, CurrSamplesPerSec=23.88602779863179, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:39:02,176] [INFO] [logging.py:96:log_dist] [Rank 0] step=10650, skipped=203, lr=[1.8711420943086855e-06, 1.8711420943086855e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:39:02,417] [INFO] [timer.py:199:stop] epoch=11/micro_step=2120/global_step=10650, RunningAvgSamplesPerSec=23.827090610615866, CurrSamplesPerSec=23.820726388272522, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:39:29,095] [INFO] [logging.py:96:log_dist] [Rank 0] step=10660, skipped=203, lr=[1.8630064103426853e-06, 1.8630064103426853e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:39:29,336] [INFO] [timer.py:199:stop] epoch=11/micro_step=2160/global_step=10660, RunningAvgSamplesPerSec=23.82707973826309, CurrSamplesPerSec=23.78769293045872, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:39:56,012] [INFO] [logging.py:96:log_dist] [Rank 0] step=10670, skipped=203, lr=[1.8548842181201499e-06, 1.8548842181201499e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:39:56,254] [INFO] [timer.py:199:stop] epoch=11/micro_step=2200/global_step=10670, RunningAvgSamplesPerSec=23.827069903139694, CurrSamplesPerSec=23.8267692478934, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:40:22,908] [INFO] [logging.py:96:log_dist] [Rank 0] step=10680, skipped=203, lr=[1.8467755546372901e-06, 1.8467755546372901e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:40:23,151] [INFO] [timer.py:199:stop] epoch=11/micro_step=2240/global_step=10680, RunningAvgSamplesPerSec=23.82707713166856, CurrSamplesPerSec=23.847409530845535, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:40:49,968] [INFO] [logging.py:96:log_dist] [Rank 0] step=10690, skipped=203, lr=[1.8386804568286889e-06, 1.8386804568286889e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:40:50,209] [INFO] [timer.py:199:stop] epoch=11/micro_step=2280/global_step=10690, RunningAvgSamplesPerSec=23.82697433568222, CurrSamplesPerSec=23.82928201754452, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:41:16,831] [INFO] [logging.py:96:log_dist] [Rank 0] step=10700, skipped=203, lr=[1.830598961567143e-06, 1.830598961567143e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:41:17,073] [INFO] [timer.py:199:stop] epoch=11/micro_step=2320/global_step=10700, RunningAvgSamplesPerSec=23.8270070600111, CurrSamplesPerSec=23.81162137497321, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:41:27,546] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 23:41:29,950] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 23:41:43,180] [INFO] [logging.py:96:log_dist] [Rank 0] step=10710, skipped=205, lr=[1.8241435839307546e-06, 1.8241435839307546e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:41:43,423] [INFO] [timer.py:199:stop] epoch=11/micro_step=2360/global_step=10710, RunningAvgSamplesPerSec=23.827467667854144, CurrSamplesPerSec=23.78007294663514, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:42:10,080] [INFO] [logging.py:96:log_dist] [Rank 0] step=10720, skipped=205, lr=[1.8160866659754722e-06, 1.8160866659754722e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:42:10,321] [INFO] [timer.py:199:stop] epoch=11/micro_step=2400/global_step=10720, RunningAvgSamplesPerSec=23.827474038732745, CurrSamplesPerSec=23.86277693144065, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:42:37,004] [INFO] [logging.py:96:log_dist] [Rank 0] step=10730, skipped=205, lr=[1.8080434534809147e-06, 1.8080434534809147e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:42:37,248] [INFO] [timer.py:199:stop] epoch=11/micro_step=2440/global_step=10730, RunningAvgSamplesPerSec=23.8274553776075, CurrSamplesPerSec=23.779139749988815, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:43:03,937] [INFO] [logging.py:96:log_dist] [Rank 0] step=10740, skipped=205, lr=[1.8000139830835436e-06, 1.8000139830835436e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:43:04,180] [INFO] [timer.py:199:stop] epoch=11/micro_step=2480/global_step=10740, RunningAvgSamplesPerSec=23.82743388383718, CurrSamplesPerSec=23.86397340460761, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:43:30,867] [INFO] [logging.py:96:log_dist] [Rank 0] step=10750, skipped=205, lr=[1.79199829135722e-06, 1.79199829135722e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:43:31,111] [INFO] [timer.py:199:stop] epoch=11/micro_step=2520/global_step=10750, RunningAvgSamplesPerSec=23.82741270835168, CurrSamplesPerSec=23.752209752091183, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:43:57,796] [INFO] [logging.py:96:log_dist] [Rank 0] step=10760, skipped=205, lr=[1.783996414813054e-06, 1.783996414813054e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:43:58,038] [INFO] [timer.py:199:stop] epoch=11/micro_step=2560/global_step=10760, RunningAvgSamplesPerSec=23.827394686794445, CurrSamplesPerSec=23.758007621383488, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:44:24,705] [INFO] [logging.py:96:log_dist] [Rank 0] step=10770, skipped=205, lr=[1.7760083898992177e-06, 1.7760083898992177e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:44:24,948] [INFO] [timer.py:199:stop] epoch=11/micro_step=2600/global_step=10770, RunningAvgSamplesPerSec=23.82739042333403, CurrSamplesPerSec=23.812652180361805, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:44:51,658] [INFO] [logging.py:96:log_dist] [Rank 0] step=10780, skipped=205, lr=[1.7680342530007955e-06, 1.7680342530007955e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:44:51,901] [INFO] [timer.py:199:stop] epoch=11/micro_step=2640/global_step=10780, RunningAvgSamplesPerSec=23.82735096165294, CurrSamplesPerSec=23.75047388500768, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:45:18,618] [INFO] [logging.py:96:log_dist] [Rank 0] step=10790, skipped=205, lr=[1.760074040439612e-06, 1.760074040439612e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:45:18,860] [INFO] [timer.py:199:stop] epoch=11/micro_step=2680/global_step=10790, RunningAvgSamplesPerSec=23.827310096996015, CurrSamplesPerSec=23.812424043610367, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:45:45,654] [INFO] [logging.py:96:log_dist] [Rank 0] step=10800, skipped=205, lr=[1.7521277884740652e-06, 1.7521277884740652e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:45:45,896] [INFO] [timer.py:199:stop] epoch=11/micro_step=2720/global_step=10800, RunningAvgSamplesPerSec=23.827221484145387, CurrSamplesPerSec=23.82314908419457, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:46:01,763] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 23:46:04,161] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 23:46:11,982] [INFO] [logging.py:96:log_dist] [Rank 0] step=10810, skipped=207, lr=[1.7457808628555402e-06, 1.7457808628555402e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:46:12,223] [INFO] [timer.py:199:stop] epoch=11/micro_step=2760/global_step=10810, RunningAvgSamplesPerSec=23.82769659299637, CurrSamplesPerSec=23.870435159957033, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:46:38,873] [INFO] [logging.py:96:log_dist] [Rank 0] step=10820, skipped=207, lr=[1.7378598311302241e-06, 1.7378598311302241e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:46:39,116] [INFO] [timer.py:199:stop] epoch=11/micro_step=2800/global_step=10820, RunningAvgSamplesPerSec=23.827708726654983, CurrSamplesPerSec=23.80109663875195, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:47:05,827] [INFO] [logging.py:96:log_dist] [Rank 0] step=10830, skipped=207, lr=[1.7299528611852372e-06, 1.7299528611852372e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:47:06,069] [INFO] [timer.py:199:stop] epoch=11/micro_step=2840/global_step=10830, RunningAvgSamplesPerSec=23.827674484958333, CurrSamplesPerSec=23.72710223837055, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:47:32,723] [INFO] [logging.py:96:log_dist] [Rank 0] step=10840, skipped=207, lr=[1.722059989036462e-06, 1.722059989036462e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:47:32,965] [INFO] [timer.py:199:stop] epoch=11/micro_step=2880/global_step=10840, RunningAvgSamplesPerSec=23.827682932703755, CurrSamplesPerSec=23.84708115722665, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:47:59,627] [INFO] [logging.py:96:log_dist] [Rank 0] step=10850, skipped=207, lr=[1.7141812506355663e-06, 1.7141812506355663e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:47:59,869] [INFO] [timer.py:199:stop] epoch=11/micro_step=2920/global_step=10850, RunningAvgSamplesPerSec=23.827682687612132, CurrSamplesPerSec=23.815383823736205, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:48:26,509] [INFO] [logging.py:96:log_dist] [Rank 0] step=10860, skipped=207, lr=[1.7063166818698375e-06, 1.7063166818698375e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:48:26,751] [INFO] [timer.py:199:stop] epoch=11/micro_step=2960/global_step=10860, RunningAvgSamplesPerSec=23.827699024368844, CurrSamplesPerSec=23.868083482444693, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:48:53,344] [INFO] [logging.py:96:log_dist] [Rank 0] step=10870, skipped=207, lr=[1.6984663185620213e-06, 1.6984663185620213e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:48:53,586] [INFO] [timer.py:199:stop] epoch=11/micro_step=3000/global_step=10870, RunningAvgSamplesPerSec=23.827752269955113, CurrSamplesPerSec=23.852170905065883, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:49:20,248] [INFO] [logging.py:96:log_dist] [Rank 0] step=10880, skipped=207, lr=[1.6906301964701611e-06, 1.6906301964701611e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:49:20,491] [INFO] [timer.py:199:stop] epoch=11/micro_step=3040/global_step=10880, RunningAvgSamplesPerSec=23.82775104911334, CurrSamplesPerSec=23.827875393476393, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:49:47,131] [INFO] [logging.py:96:log_dist] [Rank 0] step=10890, skipped=207, lr=[1.682808351287426e-06, 1.682808351287426e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:49:47,373] [INFO] [timer.py:199:stop] epoch=11/micro_step=3080/global_step=10890, RunningAvgSamplesPerSec=23.827770258857758, CurrSamplesPerSec=23.821982069670447, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:50:14,002] [INFO] [logging.py:96:log_dist] [Rank 0] step=10900, skipped=207, lr=[1.6750008186419596e-06, 1.6750008186419596e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:50:14,243] [INFO] [timer.py:199:stop] epoch=11/micro_step=3120/global_step=10900, RunningAvgSamplesPerSec=23.82779953660717, CurrSamplesPerSec=23.830075299838448, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:50:35,465] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 23:50:37,866] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 23:50:40,322] [INFO] [logging.py:96:log_dist] [Rank 0] step=10910, skipped=209, lr=[1.6687651214529172e-06, 1.6687651214529172e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:50:40,563] [INFO] [timer.py:199:stop] epoch=11/micro_step=3160/global_step=10910, RunningAvgSamplesPerSec=23.82827570892168, CurrSamplesPerSec=23.785698956420678, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:51:07,184] [INFO] [logging.py:96:log_dist] [Rank 0] step=10920, skipped=209, lr=[1.6609834409492537e-06, 1.6609834409492537e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:51:07,427] [INFO] [timer.py:199:stop] epoch=11/micro_step=3200/global_step=10920, RunningAvgSamplesPerSec=23.82830546303531, CurrSamplesPerSec=23.794689237551854, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:51:34,075] [INFO] [logging.py:96:log_dist] [Rank 0] step=10930, skipped=209, lr=[1.6532161723943139e-06, 1.6532161723943139e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:51:34,318] [INFO] [timer.py:199:stop] epoch=11/micro_step=3240/global_step=10930, RunningAvgSamplesPerSec=23.828315304123734, CurrSamplesPerSec=23.781528711118373, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:52:00,958] [INFO] [logging.py:96:log_dist] [Rank 0] step=10940, skipped=209, lr=[1.645463351167647e-06, 1.645463351167647e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:52:01,201] [INFO] [timer.py:199:stop] epoch=11/micro_step=3280/global_step=10940, RunningAvgSamplesPerSec=23.82833202834316, CurrSamplesPerSec=23.808442916198697, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:52:27,854] [INFO] [logging.py:96:log_dist] [Rank 0] step=10950, skipped=209, lr=[1.63772501258299e-06, 1.63772501258299e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:52:28,095] [INFO] [timer.py:199:stop] epoch=11/micro_step=3320/global_step=10950, RunningAvgSamplesPerSec=23.82834836770012, CurrSamplesPerSec=23.818433103178045, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:52:54,755] [INFO] [logging.py:96:log_dist] [Rank 0] step=10960, skipped=209, lr=[1.6300011918881226e-06, 1.6300011918881226e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:52:54,997] [INFO] [timer.py:199:stop] epoch=11/micro_step=3360/global_step=10960, RunningAvgSamplesPerSec=23.82835290669061, CurrSamplesPerSec=23.8708681908094, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:53:21,647] [INFO] [logging.py:96:log_dist] [Rank 0] step=10970, skipped=209, lr=[1.6222919242646851e-06, 1.6222919242646851e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:53:21,891] [INFO] [timer.py:199:stop] epoch=11/micro_step=3400/global_step=10970, RunningAvgSamplesPerSec=23.828362756804065, CurrSamplesPerSec=23.810776519075908, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:53:49,132] [INFO] [logging.py:96:log_dist] [Rank 0] step=10980, skipped=209, lr=[1.6145972448280358e-06, 1.6145972448280358e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:53:49,374] [INFO] [timer.py:199:stop] epoch=11/micro_step=3440/global_step=10980, RunningAvgSamplesPerSec=23.827941707168293, CurrSamplesPerSec=23.76709271492278, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:54:16,040] [INFO] [logging.py:96:log_dist] [Rank 0] step=10990, skipped=209, lr=[1.6069171886270823e-06, 1.6069171886270823e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:54:16,282] [INFO] [timer.py:199:stop] epoch=11/micro_step=3480/global_step=10990, RunningAvgSamplesPerSec=23.82794227164881, CurrSamplesPerSec=23.784426023104704, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:54:42,990] [INFO] [logging.py:96:log_dist] [Rank 0] step=11000, skipped=209, lr=[1.5992517906441265e-06, 1.5992517906441265e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:54:43,232] [INFO] [timer.py:199:stop] epoch=11/micro_step=3520/global_step=11000, RunningAvgSamplesPerSec=23.827910993277886, CurrSamplesPerSec=23.856179391836648, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:55:09,845] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-23 23:55:09,845] [INFO] [logging.py:96:log_dist] [Rank 0] step=11010, skipped=210, lr=[1.5923654940949206e-06, 1.5923654940949206e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:55:09,846] [INFO] [timer.py:199:stop] epoch=11/micro_step=3560/global_step=11010, RunningAvgSamplesPerSec=23.828148285530112, CurrSamplesPerSec=26.637448993393217, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:55:12,244] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-23 23:55:36,219] [INFO] [logging.py:96:log_dist] [Rank 0] step=11020, skipped=211, lr=[1.5854911243918115e-06, 1.5854911243918115e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:55:36,461] [INFO] [timer.py:199:stop] epoch=11/micro_step=3600/global_step=11020, RunningAvgSamplesPerSec=23.828383346611407, CurrSamplesPerSec=23.82687499350928, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:56:03,057] [INFO] [logging.py:96:log_dist] [Rank 0] step=11030, skipped=211, lr=[1.5778669549561445e-06, 1.5778669549561445e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:56:03,298] [INFO] [timer.py:199:stop] epoch=11/micro_step=3640/global_step=11030, RunningAvgSamplesPerSec=23.82843329677652, CurrSamplesPerSec=23.938306540358543, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:56:30,002] [INFO] [logging.py:96:log_dist] [Rank 0] step=11040, skipped=211, lr=[1.5702575760609407e-06, 1.5702575760609407e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:56:30,244] [INFO] [timer.py:199:stop] epoch=11/micro_step=3680/global_step=11040, RunningAvgSamplesPerSec=23.82840852080162, CurrSamplesPerSec=23.840540991566236, MemAllocated=11.44GB, MaxMemAllocated=31.66GB ***** Evaluating perplexity, Epoch 12/16 ***** ppl: 1.782162070274353 saving the final model ... Beginning of Epoch 13/16, Total Micro Batches 3680 [2023-04-23 23:57:48,717] [INFO] [logging.py:96:log_dist] [Rank 0] step=11050, skipped=211, lr=[1.562663022366569e-06, 1.562663022366569e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:57:48,958] [INFO] [timer.py:199:stop] epoch=12/micro_step=40/global_step=11050, RunningAvgSamplesPerSec=23.82836339830882, CurrSamplesPerSec=23.88896338618884, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:58:15,586] [INFO] [logging.py:96:log_dist] [Rank 0] step=11060, skipped=211, lr=[1.5550833284658698e-06, 1.5550833284658698e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:58:15,827] [INFO] [timer.py:199:stop] epoch=12/micro_step=80/global_step=11060, RunningAvgSamplesPerSec=23.828390311929176, CurrSamplesPerSec=23.93313515914993, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:58:42,343] [INFO] [logging.py:96:log_dist] [Rank 0] step=11070, skipped=211, lr=[1.5475185288839937e-06, 1.5475185288839937e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:58:42,583] [INFO] [timer.py:199:stop] epoch=12/micro_step=120/global_step=11070, RunningAvgSamplesPerSec=23.82850893511444, CurrSamplesPerSec=23.91221847393565, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:59:09,093] [INFO] [logging.py:96:log_dist] [Rank 0] step=11080, skipped=211, lr=[1.5399686580782573e-06, 1.5399686580782573e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:59:09,332] [INFO] [timer.py:199:stop] epoch=12/micro_step=160/global_step=11080, RunningAvgSamplesPerSec=23.82863052465923, CurrSamplesPerSec=23.971560839000972, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-23 23:59:35,855] [INFO] [logging.py:96:log_dist] [Rank 0] step=11090, skipped=211, lr=[1.5324337504379679e-06, 1.5324337504379679e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-23 23:59:36,095] [INFO] [timer.py:199:stop] epoch=12/micro_step=200/global_step=11090, RunningAvgSamplesPerSec=23.82874214474197, CurrSamplesPerSec=23.97228655315389, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:00:02,626] [INFO] [logging.py:96:log_dist] [Rank 0] step=11100, skipped=211, lr=[1.5249138402842819e-06, 1.5249138402842819e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:00:02,866] [INFO] [timer.py:199:stop] epoch=12/micro_step=240/global_step=11100, RunningAvgSamplesPerSec=23.828847914012844, CurrSamplesPerSec=23.958888265203072, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:00:29,432] [INFO] [logging.py:96:log_dist] [Rank 0] step=11110, skipped=211, lr=[1.5174089618700437e-06, 1.5174089618700437e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:00:29,672] [INFO] [timer.py:199:stop] epoch=12/micro_step=280/global_step=11110, RunningAvgSamplesPerSec=23.828931039691593, CurrSamplesPerSec=23.9602312695951, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:00:34,742] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 00:00:37,130] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 00:00:55,641] [INFO] [logging.py:96:log_dist] [Rank 0] step=11120, skipped=213, lr=[1.5114159049652562e-06, 1.5114159049652562e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:00:55,882] [INFO] [timer.py:199:stop] epoch=12/micro_step=320/global_step=11120, RunningAvgSamplesPerSec=23.82948709724179, CurrSamplesPerSec=23.896900098334683, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:01:22,424] [INFO] [logging.py:96:log_dist] [Rank 0] step=11130, skipped=213, lr=[1.5039381697805262e-06, 1.5039381697805262e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:01:22,666] [INFO] [timer.py:199:stop] epoch=12/micro_step=360/global_step=11130, RunningAvgSamplesPerSec=23.829580114023386, CurrSamplesPerSec=23.90237294993435, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:01:50,129] [INFO] [logging.py:96:log_dist] [Rank 0] step=11140, skipped=213, lr=[1.4964755618784517e-06, 1.4964755618784517e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:01:50,371] [INFO] [timer.py:199:stop] epoch=12/micro_step=400/global_step=11140, RunningAvgSamplesPerSec=23.829065873870483, CurrSamplesPerSec=23.908597853319996, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:02:16,901] [INFO] [logging.py:96:log_dist] [Rank 0] step=11150, skipped=213, lr=[1.4890281152508603e-06, 1.4890281152508603e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:02:17,142] [INFO] [timer.py:199:stop] epoch=12/micro_step=440/global_step=11150, RunningAvgSamplesPerSec=23.829171363179075, CurrSamplesPerSec=23.985416934166135, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:02:43,678] [INFO] [logging.py:96:log_dist] [Rank 0] step=11160, skipped=213, lr=[1.4815958638205285e-06, 1.4815958638205285e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:02:43,919] [INFO] [timer.py:199:stop] epoch=12/micro_step=480/global_step=11160, RunningAvgSamplesPerSec=23.82927584193269, CurrSamplesPerSec=23.95248326514529, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:03:10,459] [INFO] [logging.py:96:log_dist] [Rank 0] step=11170, skipped=213, lr=[1.4741788414410168e-06, 1.4741788414410168e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:03:10,701] [INFO] [timer.py:199:stop] epoch=12/micro_step=520/global_step=11170, RunningAvgSamplesPerSec=23.82937339612409, CurrSamplesPerSec=23.917421308356058, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:03:37,272] [INFO] [logging.py:96:log_dist] [Rank 0] step=11180, skipped=213, lr=[1.4667770818965194e-06, 1.4667770818965194e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:03:37,514] [INFO] [timer.py:199:stop] epoch=12/micro_step=560/global_step=11180, RunningAvgSamplesPerSec=23.829449132993048, CurrSamplesPerSec=23.917824079127247, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:04:04,040] [INFO] [logging.py:96:log_dist] [Rank 0] step=11190, skipped=213, lr=[1.4593906189017028e-06, 1.4593906189017028e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:04:04,281] [INFO] [timer.py:199:stop] epoch=12/micro_step=600/global_step=11190, RunningAvgSamplesPerSec=23.829557670110685, CurrSamplesPerSec=23.937424921185915, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:04:30,828] [INFO] [logging.py:96:log_dist] [Rank 0] step=11200, skipped=213, lr=[1.452019486101571e-06, 1.452019486101571e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:04:31,070] [INFO] [timer.py:199:stop] epoch=12/micro_step=640/global_step=11200, RunningAvgSamplesPerSec=23.8296480935931, CurrSamplesPerSec=23.88101282331746, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:04:57,594] [INFO] [logging.py:96:log_dist] [Rank 0] step=11210, skipped=213, lr=[1.4446637170712862e-06, 1.4446637170712862e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:04:57,835] [INFO] [timer.py:199:stop] epoch=12/micro_step=680/global_step=11210, RunningAvgSamplesPerSec=23.829756722420953, CurrSamplesPerSec=23.92254965724156, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:05:08,252] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 00:05:10,642] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 00:05:23,798] [INFO] [logging.py:96:log_dist] [Rank 0] step=11220, skipped=215, lr=[1.4387901862791912e-06, 1.4387901862791912e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:05:24,038] [INFO] [timer.py:199:stop] epoch=12/micro_step=720/global_step=11220, RunningAvgSamplesPerSec=23.83030944847321, CurrSamplesPerSec=23.902836938290744, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:05:50,578] [INFO] [logging.py:96:log_dist] [Rank 0] step=11230, skipped=215, lr=[1.431462156420581e-06, 1.431462156420581e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:05:50,818] [INFO] [timer.py:199:stop] epoch=12/micro_step=760/global_step=11230, RunningAvgSamplesPerSec=23.830409013430167, CurrSamplesPerSec=24.001346902138494, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:06:17,345] [INFO] [logging.py:96:log_dist] [Rank 0] step=11240, skipped=215, lr=[1.4241495839695046e-06, 1.4241495839695046e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:06:17,586] [INFO] [timer.py:199:stop] epoch=12/micro_step=800/global_step=11240, RunningAvgSamplesPerSec=23.83051386093251, CurrSamplesPerSec=23.993222720475654, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:06:44,130] [INFO] [logging.py:96:log_dist] [Rank 0] step=11250, skipped=215, lr=[1.4168525022343865e-06, 1.4168525022343865e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:06:44,369] [INFO] [timer.py:199:stop] epoch=12/micro_step=840/global_step=11250, RunningAvgSamplesPerSec=23.830605325534467, CurrSamplesPerSec=23.991475035046253, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:07:10,892] [INFO] [logging.py:96:log_dist] [Rank 0] step=11260, skipped=215, lr=[1.409570944453101e-06, 1.409570944453101e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:07:11,132] [INFO] [timer.py:199:stop] epoch=12/micro_step=880/global_step=11260, RunningAvgSamplesPerSec=23.83071270357579, CurrSamplesPerSec=23.952705544220766, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:07:37,655] [INFO] [logging.py:96:log_dist] [Rank 0] step=11270, skipped=215, lr=[1.4023049437928008e-06, 1.4023049437928008e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:07:37,895] [INFO] [timer.py:199:stop] epoch=12/micro_step=920/global_step=11270, RunningAvgSamplesPerSec=23.83082502126975, CurrSamplesPerSec=23.925261792820695, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:08:04,395] [INFO] [logging.py:96:log_dist] [Rank 0] step=11280, skipped=215, lr=[1.395054533349788e-06, 1.395054533349788e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:08:04,635] [INFO] [timer.py:199:stop] epoch=12/micro_step=960/global_step=11280, RunningAvgSamplesPerSec=23.830951545716257, CurrSamplesPerSec=23.98428539888577, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:08:31,165] [INFO] [logging.py:96:log_dist] [Rank 0] step=11290, skipped=215, lr=[1.3878197461493411e-06, 1.3878197461493411e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:08:31,405] [INFO] [timer.py:199:stop] epoch=12/micro_step=1000/global_step=11290, RunningAvgSamplesPerSec=23.831053382412602, CurrSamplesPerSec=23.908774599338212, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:08:57,932] [INFO] [logging.py:96:log_dist] [Rank 0] step=11300, skipped=215, lr=[1.3806006151455816e-06, 1.3806006151455816e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:08:58,173] [INFO] [timer.py:199:stop] epoch=12/micro_step=1040/global_step=11300, RunningAvgSamplesPerSec=23.831158242813295, CurrSamplesPerSec=23.95130781939923, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:09:24,679] [INFO] [logging.py:96:log_dist] [Rank 0] step=11310, skipped=215, lr=[1.3733971732213192e-06, 1.3733971732213192e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:09:24,919] [INFO] [timer.py:199:stop] epoch=12/micro_step=1080/global_step=11310, RunningAvgSamplesPerSec=23.83127885323208, CurrSamplesPerSec=23.92824329277598, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:09:40,682] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 00:09:43,072] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 00:09:50,859] [INFO] [logging.py:96:log_dist] [Rank 0] step=11320, skipped=217, lr=[1.3676457378707728e-06, 1.3676457378707728e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:09:51,101] [INFO] [timer.py:199:stop] epoch=12/micro_step=1120/global_step=11320, RunningAvgSamplesPerSec=23.831839663595883, CurrSamplesPerSec=23.90722017574117, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:10:17,631] [INFO] [logging.py:96:log_dist] [Rank 0] step=11330, skipped=217, lr=[1.360470618926066e-06, 1.360470618926066e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:10:17,871] [INFO] [timer.py:199:stop] epoch=12/micro_step=1160/global_step=11330, RunningAvgSamplesPerSec=23.83193835150394, CurrSamplesPerSec=23.900738492254924, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:10:44,429] [INFO] [logging.py:96:log_dist] [Rank 0] step=11340, skipped=217, lr=[1.3533112807520563e-06, 1.3533112807520563e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:10:44,671] [INFO] [timer.py:199:stop] epoch=12/micro_step=1200/global_step=11340, RunningAvgSamplesPerSec=23.832015268117324, CurrSamplesPerSec=23.903937386367776, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:11:11,192] [INFO] [logging.py:96:log_dist] [Rank 0] step=11350, skipped=217, lr=[1.346167755959193e-06, 1.346167755959193e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:11:11,434] [INFO] [timer.py:199:stop] epoch=12/micro_step=1240/global_step=11350, RunningAvgSamplesPerSec=23.83212104006266, CurrSamplesPerSec=23.908768210882396, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:11:37,960] [INFO] [logging.py:96:log_dist] [Rank 0] step=11360, skipped=217, lr=[1.3390400770859014e-06, 1.3390400770859014e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:11:38,202] [INFO] [timer.py:199:stop] epoch=12/micro_step=1280/global_step=11360, RunningAvgSamplesPerSec=23.832223570248157, CurrSamplesPerSec=23.95358401439738, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:12:04,729] [INFO] [logging.py:96:log_dist] [Rank 0] step=11370, skipped=217, lr=[1.3319282765984288e-06, 1.3319282765984288e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:12:04,971] [INFO] [timer.py:199:stop] epoch=12/micro_step=1320/global_step=11370, RunningAvgSamplesPerSec=23.83232555351538, CurrSamplesPerSec=23.90474416218883, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:12:31,476] [INFO] [logging.py:96:log_dist] [Rank 0] step=11380, skipped=217, lr=[1.3248323868906977e-06, 1.3248323868906977e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:12:31,718] [INFO] [timer.py:199:stop] epoch=12/micro_step=1360/global_step=11380, RunningAvgSamplesPerSec=23.832442348649206, CurrSamplesPerSec=23.951294996990868, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:12:58,254] [INFO] [logging.py:96:log_dist] [Rank 0] step=11390, skipped=217, lr=[1.317752440284152e-06, 1.317752440284152e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:12:58,496] [INFO] [timer.py:199:stop] epoch=12/micro_step=1400/global_step=11390, RunningAvgSamplesPerSec=23.832533883107523, CurrSamplesPerSec=23.940699834069342, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:13:25,007] [INFO] [logging.py:96:log_dist] [Rank 0] step=11400, skipped=217, lr=[1.310688469027627e-06, 1.310688469027627e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:13:25,250] [INFO] [timer.py:199:stop] epoch=12/micro_step=1440/global_step=11400, RunningAvgSamplesPerSec=23.832646461465743, CurrSamplesPerSec=23.91383958944185, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:13:51,780] [INFO] [logging.py:96:log_dist] [Rank 0] step=11410, skipped=217, lr=[1.3036405052971792e-06, 1.3036405052971792e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:13:52,021] [INFO] [timer.py:199:stop] epoch=12/micro_step=1480/global_step=11410, RunningAvgSamplesPerSec=23.832744889053846, CurrSamplesPerSec=23.93939104235678, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:14:13,168] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 00:14:15,561] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 00:14:18,000] [INFO] [logging.py:96:log_dist] [Rank 0] step=11420, skipped=219, lr=[1.2980136813073676e-06, 1.2980136813073676e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:14:18,241] [INFO] [timer.py:199:stop] epoch=12/micro_step=1520/global_step=11420, RunningAvgSamplesPerSec=23.833272719791488, CurrSamplesPerSec=23.932482225747826, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:14:44,760] [INFO] [logging.py:96:log_dist] [Rank 0] step=11430, skipped=219, lr=[1.2909946119747033e-06, 1.2909946119747033e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:14:45,002] [INFO] [timer.py:199:stop] epoch=12/micro_step=1560/global_step=11430, RunningAvgSamplesPerSec=23.833380368635726, CurrSamplesPerSec=23.960308261670917, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:15:11,541] [INFO] [logging.py:96:log_dist] [Rank 0] step=11440, skipped=219, lr=[1.2839916398727251e-06, 1.2839916398727251e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:15:11,781] [INFO] [timer.py:199:stop] epoch=12/micro_step=1600/global_step=11440, RunningAvgSamplesPerSec=23.833470088181155, CurrSamplesPerSec=23.958682978434947, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:15:38,323] [INFO] [logging.py:96:log_dist] [Rank 0] step=11450, skipped=219, lr=[1.277004796899642e-06, 1.277004796899642e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:15:38,563] [INFO] [timer.py:199:stop] epoch=12/micro_step=1640/global_step=11450, RunningAvgSamplesPerSec=23.833565868744728, CurrSamplesPerSec=23.968384483538824, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:16:05,082] [INFO] [logging.py:96:log_dist] [Rank 0] step=11460, skipped=219, lr=[1.2700341148802053e-06, 1.2700341148802053e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:16:05,323] [INFO] [timer.py:199:stop] epoch=12/micro_step=1680/global_step=11460, RunningAvgSamplesPerSec=23.833670092747813, CurrSamplesPerSec=23.925242601022, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:16:31,869] [INFO] [logging.py:96:log_dist] [Rank 0] step=11470, skipped=219, lr=[1.2630796255655441e-06, 1.2630796255655441e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:16:32,109] [INFO] [timer.py:199:stop] epoch=12/micro_step=1720/global_step=11470, RunningAvgSamplesPerSec=23.833762684019156, CurrSamplesPerSec=23.981214932999045, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:16:58,645] [INFO] [logging.py:96:log_dist] [Rank 0] step=11480, skipped=219, lr=[1.2561413606330367e-06, 1.2561413606330367e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:16:58,886] [INFO] [timer.py:199:stop] epoch=12/micro_step=1760/global_step=11480, RunningAvgSamplesPerSec=23.833855896754677, CurrSamplesPerSec=23.947602714642855, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:17:25,439] [INFO] [logging.py:96:log_dist] [Rank 0] step=11490, skipped=219, lr=[1.2492193516861573e-06, 1.2492193516861573e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:17:25,673] [INFO] [timer.py:199:stop] epoch=12/micro_step=1800/global_step=11490, RunningAvgSamplesPerSec=23.833940887750973, CurrSamplesPerSec=23.837945404994848, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:17:52,218] [INFO] [logging.py:96:log_dist] [Rank 0] step=11500, skipped=219, lr=[1.2423136302543378e-06, 1.2423136302543378e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:17:52,459] [INFO] [timer.py:199:stop] epoch=12/micro_step=1840/global_step=11500, RunningAvgSamplesPerSec=23.83403052856056, CurrSamplesPerSec=23.94702376205155, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:18:18,980] [INFO] [logging.py:96:log_dist] [Rank 0] step=11510, skipped=219, lr=[1.235424227792816e-06, 1.235424227792816e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:18:19,220] [INFO] [timer.py:199:stop] epoch=12/micro_step=1880/global_step=11510, RunningAvgSamplesPerSec=23.834136516385296, CurrSamplesPerSec=23.907824886379355, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:18:45,695] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 00:18:45,696] [INFO] [logging.py:96:log_dist] [Rank 0] step=11520, skipped=220, lr=[1.229237744234911e-06, 1.229237744234911e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:18:45,696] [INFO] [timer.py:199:stop] epoch=12/micro_step=1920/global_step=11520, RunningAvgSamplesPerSec=23.83446310648148, CurrSamplesPerSec=26.7767910598877, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:18:48,079] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 00:19:11,933] [INFO] [logging.py:96:log_dist] [Rank 0] step=11530, skipped=221, lr=[1.223064527287551e-06, 1.223064527287551e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:19:12,175] [INFO] [timer.py:199:stop] epoch=12/micro_step=1960/global_step=11530, RunningAvgSamplesPerSec=23.834783550867293, CurrSamplesPerSec=23.89755747409118, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:19:38,708] [INFO] [logging.py:96:log_dist] [Rank 0] step=11540, skipped=221, lr=[1.2162209846514856e-06, 1.2162209846514856e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:19:38,950] [INFO] [timer.py:199:stop] epoch=12/micro_step=2000/global_step=11540, RunningAvgSamplesPerSec=23.83487665587191, CurrSamplesPerSec=23.90759918094335, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:20:05,491] [INFO] [logging.py:96:log_dist] [Rank 0] step=11550, skipped=221, lr=[1.2093938798365108e-06, 1.2093938798365108e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:20:05,734] [INFO] [timer.py:199:stop] epoch=12/micro_step=2040/global_step=11550, RunningAvgSamplesPerSec=23.83496308450598, CurrSamplesPerSec=23.875938367656538, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:20:32,278] [INFO] [logging.py:96:log_dist] [Rank 0] step=11560, skipped=221, lr=[1.2025832439397735e-06, 1.2025832439397735e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:20:32,519] [INFO] [timer.py:199:stop] epoch=12/micro_step=2080/global_step=11560, RunningAvgSamplesPerSec=23.835049175789752, CurrSamplesPerSec=23.939882089358164, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:20:59,062] [INFO] [logging.py:96:log_dist] [Rank 0] step=11570, skipped=221, lr=[1.1957891079834e-06, 1.1957891079834e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:20:59,304] [INFO] [timer.py:199:stop] epoch=12/micro_step=2120/global_step=11570, RunningAvgSamplesPerSec=23.83513280615442, CurrSamplesPerSec=23.939066535393746, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:21:25,853] [INFO] [logging.py:96:log_dist] [Rank 0] step=11580, skipped=221, lr=[1.1890115029143701e-06, 1.1890115029143701e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:21:26,095] [INFO] [timer.py:199:stop] epoch=12/micro_step=2160/global_step=11580, RunningAvgSamplesPerSec=23.83521158475091, CurrSamplesPerSec=23.93182719437637, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:21:52,640] [INFO] [logging.py:96:log_dist] [Rank 0] step=11590, skipped=221, lr=[1.1822504596043568e-06, 1.1822504596043568e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:21:52,883] [INFO] [timer.py:199:stop] epoch=12/micro_step=2200/global_step=11590, RunningAvgSamplesPerSec=23.835296949733202, CurrSamplesPerSec=23.934748441482142, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:22:19,399] [INFO] [logging.py:96:log_dist] [Rank 0] step=11600, skipped=221, lr=[1.1755060088495999e-06, 1.1755060088495999e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:22:19,640] [INFO] [timer.py:199:stop] epoch=12/micro_step=2240/global_step=11600, RunningAvgSamplesPerSec=23.835401941028245, CurrSamplesPerSec=23.972605538762263, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:22:46,176] [INFO] [logging.py:96:log_dist] [Rank 0] step=11610, skipped=221, lr=[1.1687781813707604e-06, 1.1687781813707604e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:22:46,418] [INFO] [timer.py:199:stop] epoch=12/micro_step=2280/global_step=11610, RunningAvgSamplesPerSec=23.835491677728744, CurrSamplesPerSec=23.989468191387434, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:23:12,922] [INFO] [logging.py:96:log_dist] [Rank 0] step=11620, skipped=221, lr=[1.1620670078127814e-06, 1.1620670078127814e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:23:13,163] [INFO] [timer.py:199:stop] epoch=12/micro_step=2320/global_step=11620, RunningAvgSamplesPerSec=23.835603729850757, CurrSamplesPerSec=23.9413638924158, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:23:18,219] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 00:23:20,609] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 00:23:39,111] [INFO] [logging.py:96:log_dist] [Rank 0] step=11630, skipped=223, lr=[1.1567100803343919e-06, 1.1567100803343919e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:23:39,351] [INFO] [timer.py:199:stop] epoch=12/micro_step=2360/global_step=11630, RunningAvgSamplesPerSec=23.83614000079494, CurrSamplesPerSec=23.928955721055182, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:24:05,873] [INFO] [logging.py:96:log_dist] [Rank 0] step=11640, skipped=223, lr=[1.1500289608169872e-06, 1.1500289608169872e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:24:06,113] [INFO] [timer.py:199:stop] epoch=12/micro_step=2400/global_step=11640, RunningAvgSamplesPerSec=23.83624207969904, CurrSamplesPerSec=23.94864319057997, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:24:32,636] [INFO] [logging.py:96:log_dist] [Rank 0] step=11650, skipped=223, lr=[1.14336458062227e-06, 1.14336458062227e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:24:32,876] [INFO] [timer.py:199:stop] epoch=12/micro_step=2440/global_step=11650, RunningAvgSamplesPerSec=23.83634265042272, CurrSamplesPerSec=23.922479303416658, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:24:59,412] [INFO] [logging.py:96:log_dist] [Rank 0] step=11660, skipped=223, lr=[1.136716970106186e-06, 1.136716970106186e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:24:59,653] [INFO] [timer.py:199:stop] epoch=12/micro_step=2480/global_step=11660, RunningAvgSamplesPerSec=23.836431934390593, CurrSamplesPerSec=23.952205422102306, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:25:26,183] [INFO] [logging.py:96:log_dist] [Rank 0] step=11670, skipped=223, lr=[1.1300861595482913e-06, 1.1300861595482913e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:25:26,425] [INFO] [timer.py:199:stop] epoch=12/micro_step=2520/global_step=11670, RunningAvgSamplesPerSec=23.836524202297696, CurrSamplesPerSec=23.91798178143153, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:25:52,953] [INFO] [logging.py:96:log_dist] [Rank 0] step=11680, skipped=223, lr=[1.123472179151622e-06, 1.123472179151622e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:25:53,193] [INFO] [timer.py:199:stop] epoch=12/micro_step=2560/global_step=11680, RunningAvgSamplesPerSec=23.83661969450113, CurrSamplesPerSec=23.9425917500481, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:26:19,725] [INFO] [logging.py:96:log_dist] [Rank 0] step=11690, skipped=223, lr=[1.116875059042551e-06, 1.116875059042551e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:26:19,965] [INFO] [timer.py:199:stop] epoch=12/micro_step=2600/global_step=11690, RunningAvgSamplesPerSec=23.836710843065717, CurrSamplesPerSec=23.969931888487388, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:26:46,506] [INFO] [logging.py:96:log_dist] [Rank 0] step=11700, skipped=223, lr=[1.1102948292706564e-06, 1.1102948292706564e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:26:46,747] [INFO] [timer.py:199:stop] epoch=12/micro_step=2640/global_step=11700, RunningAvgSamplesPerSec=23.83679393864309, CurrSamplesPerSec=23.904188566669227, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:27:13,311] [INFO] [logging.py:96:log_dist] [Rank 0] step=11710, skipped=223, lr=[1.1037315198085772e-06, 1.1037315198085772e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:27:13,551] [INFO] [timer.py:199:stop] epoch=12/micro_step=2680/global_step=11710, RunningAvgSamplesPerSec=23.836862009627186, CurrSamplesPerSec=23.90683053685298, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:27:40,080] [INFO] [logging.py:96:log_dist] [Rank 0] step=11720, skipped=223, lr=[1.097185160551884e-06, 1.097185160551884e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:27:40,320] [INFO] [timer.py:199:stop] epoch=12/micro_step=2720/global_step=11720, RunningAvgSamplesPerSec=23.836954043603807, CurrSamplesPerSec=23.984578987745, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:27:50,745] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 00:27:53,138] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 00:28:06,297] [INFO] [logging.py:96:log_dist] [Rank 0] step=11730, skipped=225, lr=[1.0919602973349466e-06, 1.0919602973349466e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:28:06,538] [INFO] [timer.py:199:stop] epoch=12/micro_step=2760/global_step=11730, RunningAvgSamplesPerSec=23.837464125157943, CurrSamplesPerSec=23.908510546010188, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:28:33,091] [INFO] [logging.py:96:log_dist] [Rank 0] step=11740, skipped=225, lr=[1.0854445235382546e-06, 1.0854445235382546e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:28:33,333] [INFO] [timer.py:199:stop] epoch=12/micro_step=2800/global_step=11740, RunningAvgSamplesPerSec=23.837537890572957, CurrSamplesPerSec=23.936304312584678, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:28:59,870] [INFO] [logging.py:96:log_dist] [Rank 0] step=11750, skipped=225, lr=[1.0789457832433692e-06, 1.0789457832433692e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:29:00,112] [INFO] [timer.py:199:stop] epoch=12/micro_step=2840/global_step=11750, RunningAvgSamplesPerSec=23.837624565004084, CurrSamplesPerSec=23.93599909832767, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:29:26,644] [INFO] [logging.py:96:log_dist] [Rank 0] step=11760, skipped=225, lr=[1.072464106051753e-06, 1.072464106051753e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:29:26,886] [INFO] [timer.py:199:stop] epoch=12/micro_step=2880/global_step=11760, RunningAvgSamplesPerSec=23.837712560814893, CurrSamplesPerSec=23.937740845045806, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:29:53,417] [INFO] [logging.py:96:log_dist] [Rank 0] step=11770, skipped=225, lr=[1.0659995214871404e-06, 1.0659995214871404e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:29:53,659] [INFO] [timer.py:199:stop] epoch=12/micro_step=2920/global_step=11770, RunningAvgSamplesPerSec=23.837803935945946, CurrSamplesPerSec=23.921347303313453, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:30:20,206] [INFO] [logging.py:96:log_dist] [Rank 0] step=11780, skipped=225, lr=[1.0595520589954201e-06, 1.0595520589954201e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:30:20,447] [INFO] [timer.py:199:stop] epoch=12/micro_step=2960/global_step=11780, RunningAvgSamplesPerSec=23.837884023685273, CurrSamplesPerSec=23.981307056961214, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:30:46,983] [INFO] [logging.py:96:log_dist] [Rank 0] step=11790, skipped=225, lr=[1.0531217479444813e-06, 1.0531217479444813e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:30:47,223] [INFO] [timer.py:199:stop] epoch=12/micro_step=3000/global_step=11790, RunningAvgSamplesPerSec=23.837969946583765, CurrSamplesPerSec=23.923001637136714, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:31:13,761] [INFO] [logging.py:96:log_dist] [Rank 0] step=11800, skipped=225, lr=[1.0467086176240936e-06, 1.0467086176240936e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:31:14,003] [INFO] [timer.py:199:stop] epoch=12/micro_step=3040/global_step=11800, RunningAvgSamplesPerSec=23.83805330501879, CurrSamplesPerSec=23.923926970005283, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:31:40,521] [INFO] [logging.py:96:log_dist] [Rank 0] step=11810, skipped=225, lr=[1.0403126972457675e-06, 1.0403126972457675e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:31:40,763] [INFO] [timer.py:199:stop] epoch=12/micro_step=3080/global_step=11810, RunningAvgSamplesPerSec=23.83814945027178, CurrSamplesPerSec=23.946878494207922, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:32:07,297] [INFO] [logging.py:96:log_dist] [Rank 0] step=11820, skipped=225, lr=[1.0339340159426263e-06, 1.0339340159426263e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:32:07,539] [INFO] [timer.py:199:stop] epoch=12/micro_step=3120/global_step=11820, RunningAvgSamplesPerSec=23.838234096192352, CurrSamplesPerSec=23.917389343002856, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:32:23,319] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 00:32:25,708] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 00:32:33,491] [INFO] [logging.py:96:log_dist] [Rank 0] step=11830, skipped=227, lr=[1.0288435025615746e-06, 1.0288435025615746e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:32:33,730] [INFO] [timer.py:199:stop] epoch=12/micro_step=3160/global_step=11830, RunningAvgSamplesPerSec=23.83875832500395, CurrSamplesPerSec=23.9719161979633, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:33:00,298] [INFO] [logging.py:96:log_dist] [Rank 0] step=11840, skipped=227, lr=[1.0224959247584964e-06, 1.0224959247584964e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:33:00,539] [INFO] [timer.py:199:stop] epoch=12/micro_step=3200/global_step=11840, RunningAvgSamplesPerSec=23.838818765890316, CurrSamplesPerSec=23.90003838280792, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:33:27,098] [INFO] [logging.py:96:log_dist] [Rank 0] step=11850, skipped=227, lr=[1.0161656671851728e-06, 1.0161656671851728e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:33:27,339] [INFO] [timer.py:199:stop] epoch=12/micro_step=3240/global_step=11850, RunningAvgSamplesPerSec=23.83888671719354, CurrSamplesPerSec=23.901832362368026, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:33:54,009] [INFO] [logging.py:96:log_dist] [Rank 0] step=11860, skipped=227, lr=[1.0098527586756348e-06, 1.0098527586756348e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:33:54,251] [INFO] [timer.py:199:stop] epoch=12/micro_step=3280/global_step=11860, RunningAvgSamplesPerSec=23.838873358314185, CurrSamplesPerSec=23.785380710318062, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:34:20,924] [INFO] [logging.py:96:log_dist] [Rank 0] step=11870, skipped=227, lr=[1.0035572279848869e-06, 1.0035572279848869e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:34:21,166] [INFO] [timer.py:199:stop] epoch=12/micro_step=3320/global_step=11870, RunningAvgSamplesPerSec=23.8388536476245, CurrSamplesPerSec=23.810235842763923, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:34:48,122] [INFO] [logging.py:96:log_dist] [Rank 0] step=11880, skipped=227, lr=[9.972791037887828e-07, 9.972791037887828e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:34:48,364] [INFO] [timer.py:199:stop] epoch=12/micro_step=3360/global_step=11880, RunningAvgSamplesPerSec=23.83866990776447, CurrSamplesPerSec=23.904165151333274, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:35:15,034] [INFO] [logging.py:96:log_dist] [Rank 0] step=11890, skipped=227, lr=[9.91018414683881e-07, 9.91018414683881e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:35:15,275] [INFO] [timer.py:199:stop] epoch=12/micro_step=3400/global_step=11890, RunningAvgSamplesPerSec=23.83865658713427, CurrSamplesPerSec=23.806665038969395, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:35:41,957] [INFO] [logging.py:96:log_dist] [Rank 0] step=11900, skipped=227, lr=[9.847751891873368e-07, 9.847751891873368e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:35:42,198] [INFO] [timer.py:199:stop] epoch=12/micro_step=3440/global_step=11900, RunningAvgSamplesPerSec=23.838635011608652, CurrSamplesPerSec=23.818135114216563, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:36:08,851] [INFO] [logging.py:96:log_dist] [Rank 0] step=11910, skipped=227, lr=[9.785494557367486e-07, 9.785494557367486e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:36:09,093] [INFO] [timer.py:199:stop] epoch=12/micro_step=3480/global_step=11910, RunningAvgSamplesPerSec=23.838633372730555, CurrSamplesPerSec=23.9302655052809, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:36:35,783] [INFO] [logging.py:96:log_dist] [Rank 0] step=11920, skipped=227, lr=[9.723412426900448e-07, 9.723412426900448e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:36:36,024] [INFO] [timer.py:199:stop] epoch=12/micro_step=3520/global_step=11920, RunningAvgSamplesPerSec=23.838608174728122, CurrSamplesPerSec=23.81724753170474, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:36:57,295] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 00:36:59,704] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 00:37:02,149] [INFO] [logging.py:96:log_dist] [Rank 0] step=11930, skipped=229, lr=[9.67387305949062e-07, 9.67387305949062e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:37:02,392] [INFO] [timer.py:199:stop] epoch=12/micro_step=3560/global_step=11930, RunningAvgSamplesPerSec=23.839000317728257, CurrSamplesPerSec=23.8656092030135, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:37:29,434] [INFO] [logging.py:96:log_dist] [Rank 0] step=11940, skipped=229, lr=[9.612107008365076e-07, 9.612107008365076e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:37:29,675] [INFO] [timer.py:199:stop] epoch=12/micro_step=3600/global_step=11940, RunningAvgSamplesPerSec=23.838801022091754, CurrSamplesPerSec=23.828234965783835, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:37:56,355] [INFO] [logging.py:96:log_dist] [Rank 0] step=11950, skipped=229, lr=[9.550516951050626e-07, 9.550516951050626e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:37:56,597] [INFO] [timer.py:199:stop] epoch=12/micro_step=3640/global_step=11950, RunningAvgSamplesPerSec=23.83877951050344, CurrSamplesPerSec=23.84980375376891, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:38:23,235] [INFO] [logging.py:96:log_dist] [Rank 0] step=11960, skipped=229, lr=[9.489103168087133e-07, 9.489103168087133e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:38:23,477] [INFO] [timer.py:199:stop] epoch=12/micro_step=3680/global_step=11960, RunningAvgSamplesPerSec=23.83879112940027, CurrSamplesPerSec=23.843397636866474, MemAllocated=11.44GB, MaxMemAllocated=31.66GB ***** Evaluating perplexity, Epoch 13/16 ***** ppl: 1.7864570617675781 saving the final model ... Beginning of Epoch 14/16, Total Micro Batches 3680 [2023-04-24 00:39:42,132] [INFO] [logging.py:96:log_dist] [Rank 0] step=11970, skipped=229, lr=[9.427865939211512e-07, 9.427865939211512e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:39:42,373] [INFO] [timer.py:199:stop] epoch=13/micro_step=40/global_step=11970, RunningAvgSamplesPerSec=23.838749504955796, CurrSamplesPerSec=23.873001733943504, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:40:09,045] [INFO] [logging.py:96:log_dist] [Rank 0] step=11980, skipped=229, lr=[9.366805543356507e-07, 9.366805543356507e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:40:09,286] [INFO] [timer.py:199:stop] epoch=13/micro_step=80/global_step=11980, RunningAvgSamplesPerSec=23.838734480665497, CurrSamplesPerSec=23.740546912075885, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:40:35,994] [INFO] [logging.py:96:log_dist] [Rank 0] step=11990, skipped=229, lr=[9.305922258649389e-07, 9.305922258649389e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:40:36,237] [INFO] [timer.py:199:stop] epoch=13/micro_step=120/global_step=11990, RunningAvgSamplesPerSec=23.8386915462291, CurrSamplesPerSec=23.807866449479622, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:41:02,962] [INFO] [logging.py:96:log_dist] [Rank 0] step=12000, skipped=229, lr=[9.245216362410713e-07, 9.245216362410713e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:41:03,205] [INFO] [timer.py:199:stop] epoch=13/micro_step=160/global_step=12000, RunningAvgSamplesPerSec=23.838636187891954, CurrSamplesPerSec=23.805476416447164, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:41:29,911] [INFO] [logging.py:96:log_dist] [Rank 0] step=12010, skipped=229, lr=[9.184688131152987e-07, 9.184688131152987e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:41:30,153] [INFO] [timer.py:199:stop] epoch=13/micro_step=200/global_step=12010, RunningAvgSamplesPerSec=23.838596502285732, CurrSamplesPerSec=23.91168383095105, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:41:56,826] [INFO] [logging.py:96:log_dist] [Rank 0] step=12020, skipped=229, lr=[9.124337840579539e-07, 9.124337840579539e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:41:57,067] [INFO] [timer.py:199:stop] epoch=13/micro_step=240/global_step=12020, RunningAvgSamplesPerSec=23.838577974446185, CurrSamplesPerSec=23.84267970533843, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:42:23,718] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 00:42:23,719] [INFO] [logging.py:96:log_dist] [Rank 0] step=12030, skipped=230, lr=[9.070174945564125e-07, 9.070174945564125e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:42:23,720] [INFO] [timer.py:199:stop] epoch=13/micro_step=280/global_step=12030, RunningAvgSamplesPerSec=23.83875291173475, CurrSamplesPerSec=26.63359562901603, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:42:26,125] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 00:42:50,110] [INFO] [logging.py:96:log_dist] [Rank 0] step=12040, skipped=231, lr=[9.016156605011192e-07, 9.016156605011192e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:42:50,352] [INFO] [timer.py:199:stop] epoch=13/micro_step=320/global_step=12040, RunningAvgSamplesPerSec=23.838944330377732, CurrSamplesPerSec=23.830392627545006, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:43:17,047] [INFO] [logging.py:96:log_dist] [Rank 0] step=12050, skipped=231, lr=[8.956306008191278e-07, 8.956306008191278e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:43:17,289] [INFO] [timer.py:199:stop] epoch=13/micro_step=360/global_step=12050, RunningAvgSamplesPerSec=23.838911358185218, CurrSamplesPerSec=23.758375601990377, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:43:43,980] [INFO] [logging.py:96:log_dist] [Rank 0] step=12060, skipped=231, lr=[8.896634392325615e-07, 8.896634392325615e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:43:44,221] [INFO] [timer.py:199:stop] epoch=13/micro_step=400/global_step=12060, RunningAvgSamplesPerSec=23.838880897383607, CurrSamplesPerSec=23.821880595549995, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:44:10,907] [INFO] [logging.py:96:log_dist] [Rank 0] step=12070, skipped=231, lr=[8.837142029215609e-07, 8.837142029215609e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:44:11,149] [INFO] [timer.py:199:stop] epoch=13/micro_step=440/global_step=12070, RunningAvgSamplesPerSec=23.838855187347328, CurrSamplesPerSec=23.754261176346077, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:44:37,827] [INFO] [logging.py:96:log_dist] [Rank 0] step=12080, skipped=231, lr=[8.777829189846264e-07, 8.777829189846264e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:44:38,070] [INFO] [timer.py:199:stop] epoch=13/micro_step=480/global_step=12080, RunningAvgSamplesPerSec=23.838834091445495, CurrSamplesPerSec=23.761917209508432, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:45:04,722] [INFO] [logging.py:96:log_dist] [Rank 0] step=12090, skipped=231, lr=[8.718696144384784e-07, 8.718696144384784e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:45:04,964] [INFO] [timer.py:199:stop] epoch=13/micro_step=520/global_step=12090, RunningAvgSamplesPerSec=23.838832270743918, CurrSamplesPerSec=23.852325623344043, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:45:31,935] [INFO] [logging.py:96:log_dist] [Rank 0] step=12100, skipped=231, lr=[8.659743162179453e-07, 8.659743162179453e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:45:32,177] [INFO] [timer.py:199:stop] epoch=13/micro_step=560/global_step=12100, RunningAvgSamplesPerSec=23.838642642331095, CurrSamplesPerSec=23.825466538916075, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:45:58,863] [INFO] [logging.py:96:log_dist] [Rank 0] step=12110, skipped=231, lr=[8.600970511758371e-07, 8.600970511758371e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:45:59,105] [INFO] [timer.py:199:stop] epoch=13/micro_step=600/global_step=12110, RunningAvgSamplesPerSec=23.83861694350752, CurrSamplesPerSec=23.747028121118618, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:46:25,805] [INFO] [logging.py:96:log_dist] [Rank 0] step=12120, skipped=231, lr=[8.542378460828245e-07, 8.542378460828245e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:46:26,049] [INFO] [timer.py:199:stop] epoch=13/micro_step=640/global_step=12120, RunningAvgSamplesPerSec=23.838577815599326, CurrSamplesPerSec=23.776172123835174, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:46:52,750] [INFO] [logging.py:96:log_dist] [Rank 0] step=12130, skipped=231, lr=[8.483967276273139e-07, 8.483967276273139e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:46:52,992] [INFO] [timer.py:199:stop] epoch=13/micro_step=680/global_step=12130, RunningAvgSamplesPerSec=23.83854028615164, CurrSamplesPerSec=23.79560678007209, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:46:58,097] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 00:47:00,502] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 00:47:19,144] [INFO] [logging.py:96:log_dist] [Rank 0] step=12140, skipped=233, lr=[8.437368731239274e-07, 8.437368731239274e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:47:19,387] [INFO] [timer.py:199:stop] epoch=13/micro_step=720/global_step=12140, RunningAvgSamplesPerSec=23.838904382588094, CurrSamplesPerSec=23.755356395227363, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:47:47,190] [INFO] [logging.py:96:log_dist] [Rank 0] step=12150, skipped=233, lr=[8.37928377607662e-07, 8.37928377607662e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:47:47,433] [INFO] [timer.py:199:stop] epoch=13/micro_step=760/global_step=12150, RunningAvgSamplesPerSec=23.83816394617534, CurrSamplesPerSec=23.740311756499732, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:48:14,255] [INFO] [logging.py:96:log_dist] [Rank 0] step=12160, skipped=233, lr=[8.321380430177733e-07, 8.321380430177733e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:48:14,497] [INFO] [timer.py:199:stop] epoch=13/micro_step=800/global_step=12160, RunningAvgSamplesPerSec=23.83804069926228, CurrSamplesPerSec=23.767126384126083, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:48:41,337] [INFO] [logging.py:96:log_dist] [Rank 0] step=12170, skipped=233, lr=[8.263658957289644e-07, 8.263658957289644e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:48:41,581] [INFO] [timer.py:199:stop] epoch=13/micro_step=840/global_step=12170, RunningAvgSamplesPerSec=23.837902914748923, CurrSamplesPerSec=23.704115635373597, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:49:08,358] [INFO] [logging.py:96:log_dist] [Rank 0] step=12180, skipped=233, lr=[8.206119620330992e-07, 8.206119620330992e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:49:08,601] [INFO] [timer.py:199:stop] epoch=13/micro_step=880/global_step=12180, RunningAvgSamplesPerSec=23.837812185761866, CurrSamplesPerSec=23.803521675546584, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:49:35,317] [INFO] [logging.py:96:log_dist] [Rank 0] step=12190, skipped=233, lr=[8.148762681390793e-07, 8.148762681390793e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:49:35,559] [INFO] [timer.py:199:stop] epoch=13/micro_step=920/global_step=12190, RunningAvgSamplesPerSec=23.83776975031243, CurrSamplesPerSec=23.758733079781006, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:50:02,239] [INFO] [logging.py:96:log_dist] [Rank 0] step=12200, skipped=233, lr=[8.09158840172726e-07, 8.09158840172726e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:50:02,481] [INFO] [timer.py:199:stop] epoch=13/micro_step=960/global_step=12200, RunningAvgSamplesPerSec=23.837747549683538, CurrSamplesPerSec=23.81961667905407, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:50:29,204] [INFO] [logging.py:96:log_dist] [Rank 0] step=12210, skipped=233, lr=[8.034597041766545e-07, 8.034597041766545e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:50:29,443] [INFO] [timer.py:199:stop] epoch=13/micro_step=1000/global_step=12210, RunningAvgSamplesPerSec=23.837697549324734, CurrSamplesPerSec=23.602591194072698, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:50:56,127] [INFO] [logging.py:96:log_dist] [Rank 0] step=12220, skipped=233, lr=[7.977788861101696e-07, 7.977788861101696e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:50:56,368] [INFO] [timer.py:199:stop] epoch=13/micro_step=1040/global_step=12220, RunningAvgSamplesPerSec=23.837674825741374, CurrSamplesPerSec=23.815081685409545, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:51:23,037] [INFO] [logging.py:96:log_dist] [Rank 0] step=12230, skipped=233, lr=[7.921164118491315e-07, 7.921164118491315e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:51:23,279] [INFO] [timer.py:199:stop] epoch=13/micro_step=1080/global_step=12230, RunningAvgSamplesPerSec=23.83766325795183, CurrSamplesPerSec=23.766404621477033, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:51:33,752] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 00:51:36,154] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 00:51:49,383] [INFO] [logging.py:96:log_dist] [Rank 0] step=12240, skipped=235, lr=[7.875996573154646e-07, 7.875996573154646e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:51:49,624] [INFO] [timer.py:199:stop] epoch=13/micro_step=1120/global_step=12240, RunningAvgSamplesPerSec=23.838060287254294, CurrSamplesPerSec=23.81406545820629, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:52:16,292] [INFO] [logging.py:96:log_dist] [Rank 0] step=12250, skipped=235, lr=[7.819702668446232e-07, 7.819702668446232e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:52:16,533] [INFO] [timer.py:199:stop] epoch=13/micro_step=1160/global_step=12250, RunningAvgSamplesPerSec=23.838047377426165, CurrSamplesPerSec=23.831321388529112, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:52:43,221] [INFO] [logging.py:96:log_dist] [Rank 0] step=12260, skipped=235, lr=[7.763592921867577e-07, 7.763592921867577e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:52:43,463] [INFO] [timer.py:199:stop] epoch=13/micro_step=1200/global_step=12260, RunningAvgSamplesPerSec=23.838018886497444, CurrSamplesPerSec=23.743843779775393, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:53:10,137] [INFO] [logging.py:96:log_dist] [Rank 0] step=12270, skipped=235, lr=[7.707667588995947e-07, 7.707667588995947e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:53:10,378] [INFO] [timer.py:199:stop] epoch=13/micro_step=1240/global_step=12270, RunningAvgSamplesPerSec=23.838001144075573, CurrSamplesPerSec=23.786727518618807, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:53:37,053] [INFO] [logging.py:96:log_dist] [Rank 0] step=12280, skipped=235, lr=[7.651926924568684e-07, 7.651926924568684e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:53:37,295] [INFO] [timer.py:199:stop] epoch=13/micro_step=1280/global_step=12280, RunningAvgSamplesPerSec=23.83798318127115, CurrSamplesPerSec=23.801579918234108, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:54:03,987] [INFO] [logging.py:96:log_dist] [Rank 0] step=12290, skipped=235, lr=[7.596371182481895e-07, 7.596371182481895e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:54:04,229] [INFO] [timer.py:199:stop] epoch=13/micro_step=1320/global_step=12290, RunningAvgSamplesPerSec=23.83795259841753, CurrSamplesPerSec=23.814933788324204, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:54:30,897] [INFO] [logging.py:96:log_dist] [Rank 0] step=12300, skipped=235, lr=[7.541000615789427e-07, 7.541000615789427e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:54:31,139] [INFO] [timer.py:199:stop] epoch=13/micro_step=1360/global_step=12300, RunningAvgSamplesPerSec=23.837938736061155, CurrSamplesPerSec=23.80584586899415, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:54:57,833] [INFO] [logging.py:96:log_dist] [Rank 0] step=12310, skipped=235, lr=[7.485815476701633e-07, 7.485815476701633e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:54:58,075] [INFO] [timer.py:199:stop] epoch=13/micro_step=1400/global_step=12310, RunningAvgSamplesPerSec=23.837906329841893, CurrSamplesPerSec=23.84459216742875, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:55:24,752] [INFO] [logging.py:96:log_dist] [Rank 0] step=12320, skipped=235, lr=[7.430816016584282e-07, 7.430816016584282e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:55:24,993] [INFO] [timer.py:199:stop] epoch=13/micro_step=1440/global_step=12320, RunningAvgSamplesPerSec=23.83788688414827, CurrSamplesPerSec=23.79427794835675, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:55:51,735] [INFO] [logging.py:96:log_dist] [Rank 0] step=12330, skipped=235, lr=[7.37600248595733e-07, 7.37600248595733e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:55:51,977] [INFO] [timer.py:199:stop] epoch=13/micro_step=1480/global_step=12330, RunningAvgSamplesPerSec=23.837821312044685, CurrSamplesPerSec=23.686035943665033, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:56:07,948] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 00:56:10,355] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 00:56:18,214] [INFO] [logging.py:96:log_dist] [Rank 0] step=12340, skipped=237, lr=[7.332285698497683e-07, 7.332285698497683e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:56:18,457] [INFO] [timer.py:199:stop] epoch=13/micro_step=1520/global_step=12340, RunningAvgSamplesPerSec=23.83811714108355, CurrSamplesPerSec=23.76354956457985, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:56:45,255] [INFO] [logging.py:96:log_dist] [Rank 0] step=12350, skipped=237, lr=[7.277807469559854e-07, 7.277807469559854e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:56:45,500] [INFO] [timer.py:199:stop] epoch=13/micro_step=1560/global_step=12350, RunningAvgSamplesPerSec=23.838008666975362, CurrSamplesPerSec=23.59365006818315, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:57:12,209] [INFO] [logging.py:96:log_dist] [Rank 0] step=12360, skipped=237, lr=[7.22351586705927e-07, 7.22351586705927e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:57:12,452] [INFO] [timer.py:199:stop] epoch=13/micro_step=1600/global_step=12360, RunningAvgSamplesPerSec=23.83796447926553, CurrSamplesPerSec=23.782618017613824, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:57:39,162] [INFO] [logging.py:96:log_dist] [Rank 0] step=12370, skipped=237, lr=[7.169411138291679e-07, 7.169411138291679e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:57:39,405] [INFO] [timer.py:199:stop] epoch=13/micro_step=1640/global_step=12370, RunningAvgSamplesPerSec=23.837921018367123, CurrSamplesPerSec=23.812487414491727, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:58:06,087] [INFO] [logging.py:96:log_dist] [Rank 0] step=12380, skipped=237, lr=[7.115493529701618e-07, 7.115493529701618e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:58:06,329] [INFO] [timer.py:199:stop] epoch=13/micro_step=1680/global_step=12380, RunningAvgSamplesPerSec=23.83789781103705, CurrSamplesPerSec=23.828905491366758, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:58:32,999] [INFO] [logging.py:96:log_dist] [Rank 0] step=12390, skipped=237, lr=[7.061763286881259e-07, 7.061763286881259e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:58:33,241] [INFO] [timer.py:199:stop] epoch=13/micro_step=1720/global_step=12390, RunningAvgSamplesPerSec=23.837883051989365, CurrSamplesPerSec=23.795883111338675, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:58:59,908] [INFO] [logging.py:96:log_dist] [Rank 0] step=12400, skipped=237, lr=[7.008220654569418e-07, 7.008220654569418e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:59:00,152] [INFO] [timer.py:199:stop] epoch=13/micro_step=1760/global_step=12400, RunningAvgSamplesPerSec=23.83786978362327, CurrSamplesPerSec=23.8033971402985, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:59:26,846] [INFO] [logging.py:96:log_dist] [Rank 0] step=12410, skipped=237, lr=[6.954865876650267e-07, 6.954865876650267e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:59:27,088] [INFO] [timer.py:199:stop] epoch=13/micro_step=1800/global_step=12410, RunningAvgSamplesPerSec=23.83784266406034, CurrSamplesPerSec=23.679869979046437, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 00:59:53,769] [INFO] [logging.py:96:log_dist] [Rank 0] step=12420, skipped=237, lr=[6.901699196152353e-07, 6.901699196152353e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 00:59:54,010] [INFO] [timer.py:199:stop] epoch=13/micro_step=1840/global_step=12420, RunningAvgSamplesPerSec=23.837821297703318, CurrSamplesPerSec=23.838576254141785, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:00:20,656] [INFO] [logging.py:96:log_dist] [Rank 0] step=12430, skipped=237, lr=[6.848720855247446e-07, 6.848720855247446e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:00:20,899] [INFO] [timer.py:199:stop] epoch=13/micro_step=1880/global_step=12430, RunningAvgSamplesPerSec=23.83782442637823, CurrSamplesPerSec=23.811084884530604, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:00:42,147] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 01:00:44,551] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 01:00:47,005] [INFO] [logging.py:96:log_dist] [Rank 0] step=12440, skipped=239, lr=[6.806473949222267e-07, 6.806473949222267e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:00:47,246] [INFO] [timer.py:199:stop] epoch=13/micro_step=1920/global_step=12440, RunningAvgSamplesPerSec=23.83821502083796, CurrSamplesPerSec=23.802911676348696, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:01:13,877] [INFO] [logging.py:96:log_dist] [Rank 0] step=12450, skipped=239, lr=[6.753835227118564e-07, 6.753835227118564e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:01:14,117] [INFO] [timer.py:199:stop] epoch=13/micro_step=1960/global_step=12450, RunningAvgSamplesPerSec=23.83822803713535, CurrSamplesPerSec=23.827090717476167, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:01:41,259] [INFO] [logging.py:96:log_dist] [Rank 0] step=12460, skipped=239, lr=[6.701385518121399e-07, 6.701385518121399e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:01:41,500] [INFO] [timer.py:199:stop] epoch=13/micro_step=2000/global_step=12460, RunningAvgSamplesPerSec=23.837886548261164, CurrSamplesPerSec=23.761464985856538, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:02:08,156] [INFO] [logging.py:96:log_dist] [Rank 0] step=12470, skipped=239, lr=[6.649125061136744e-07, 6.649125061136744e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:02:08,397] [INFO] [timer.py:199:stop] epoch=13/micro_step=2040/global_step=12470, RunningAvgSamplesPerSec=23.837884398112507, CurrSamplesPerSec=23.801554593057816, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:02:35,250] [INFO] [logging.py:96:log_dist] [Rank 0] step=12480, skipped=239, lr=[6.597054094208564e-07, 6.597054094208564e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:02:35,491] [INFO] [timer.py:199:stop] epoch=13/micro_step=2080/global_step=12480, RunningAvgSamplesPerSec=23.83774914088101, CurrSamplesPerSec=23.852567242204156, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:03:02,193] [INFO] [logging.py:96:log_dist] [Rank 0] step=12490, skipped=239, lr=[6.545172854517707e-07, 6.545172854517707e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:03:02,434] [INFO] [timer.py:199:stop] epoch=13/micro_step=2120/global_step=12490, RunningAvgSamplesPerSec=23.837713927263977, CurrSamplesPerSec=23.842099461891653, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:03:29,229] [INFO] [logging.py:96:log_dist] [Rank 0] step=12500, skipped=239, lr=[6.493481578380817e-07, 6.493481578380817e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:03:29,470] [INFO] [timer.py:199:stop] epoch=13/micro_step=2160/global_step=12500, RunningAvgSamplesPerSec=23.83761622795649, CurrSamplesPerSec=23.727427316066457, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:03:56,242] [INFO] [logging.py:96:log_dist] [Rank 0] step=12510, skipped=239, lr=[6.441980501249231e-07, 6.441980501249231e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:03:56,484] [INFO] [timer.py:199:stop] epoch=13/micro_step=2200/global_step=12510, RunningAvgSamplesPerSec=23.83753245408921, CurrSamplesPerSec=23.799142617029727, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:04:23,238] [INFO] [logging.py:96:log_dist] [Rank 0] step=12520, skipped=239, lr=[6.390669857708016e-07, 6.390669857708016e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:04:23,480] [INFO] [timer.py:199:stop] epoch=13/micro_step=2240/global_step=12520, RunningAvgSamplesPerSec=23.83745905250674, CurrSamplesPerSec=23.793782311039035, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:04:50,185] [INFO] [logging.py:96:log_dist] [Rank 0] step=12530, skipped=239, lr=[6.339549881474718e-07, 6.339549881474718e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:04:50,426] [INFO] [timer.py:199:stop] epoch=13/micro_step=2280/global_step=12530, RunningAvgSamplesPerSec=23.83742170534418, CurrSamplesPerSec=23.8825298436586, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:05:17,045] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 01:05:17,046] [INFO] [logging.py:96:log_dist] [Rank 0] step=12540, skipped=240, lr=[6.293705115880828e-07, 6.293705115880828e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:05:17,046] [INFO] [timer.py:199:stop] epoch=13/micro_step=2320/global_step=12540, RunningAvgSamplesPerSec=23.837616191675217, CurrSamplesPerSec=26.706373653006064, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:05:19,441] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 01:05:43,426] [INFO] [logging.py:96:log_dist] [Rank 0] step=12550, skipped=241, lr=[6.248015148570156e-07, 6.248015148570156e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:05:43,668] [INFO] [timer.py:199:stop] epoch=13/micro_step=2360/global_step=12550, RunningAvgSamplesPerSec=23.837807097199537, CurrSamplesPerSec=23.75184196245705, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:06:10,344] [INFO] [logging.py:96:log_dist] [Rank 0] step=12560, skipped=241, lr=[6.197430276780202e-07, 6.197430276780202e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:06:10,588] [INFO] [timer.py:199:stop] epoch=13/micro_step=2400/global_step=12560, RunningAvgSamplesPerSec=23.83778714688086, CurrSamplesPerSec=23.783840182363583, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:06:37,221] [INFO] [logging.py:96:log_dist] [Rank 0] step=12570, skipped=241, lr=[6.14703695249552e-07, 6.14703695249552e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:06:37,464] [INFO] [timer.py:199:stop] epoch=13/micro_step=2440/global_step=12570, RunningAvgSamplesPerSec=23.83779799504922, CurrSamplesPerSec=23.820261354680945, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:07:04,131] [INFO] [logging.py:96:log_dist] [Rank 0] step=12580, skipped=241, lr=[6.096835405255384e-07, 6.096835405255384e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:07:04,372] [INFO] [timer.py:199:stop] epoch=13/micro_step=2480/global_step=12580, RunningAvgSamplesPerSec=23.83778766695814, CurrSamplesPerSec=23.792603409923167, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:07:31,021] [INFO] [logging.py:96:log_dist] [Rank 0] step=12590, skipped=241, lr=[6.04682586372548e-07, 6.04682586372548e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:07:31,261] [INFO] [timer.py:199:stop] epoch=13/micro_step=2520/global_step=12590, RunningAvgSamplesPerSec=23.83779436179634, CurrSamplesPerSec=23.869615840644588, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:07:57,912] [INFO] [logging.py:96:log_dist] [Rank 0] step=12600, skipped=241, lr=[5.997008555696994e-07, 5.997008555696994e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:07:58,153] [INFO] [timer.py:199:stop] epoch=13/micro_step=2560/global_step=12600, RunningAvgSamplesPerSec=23.837797665719158, CurrSamplesPerSec=23.815580323405513, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:08:24,801] [INFO] [logging.py:96:log_dist] [Rank 0] step=12610, skipped=241, lr=[5.947383708085422e-07, 5.947383708085422e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:08:25,043] [INFO] [timer.py:199:stop] epoch=13/micro_step=2600/global_step=12610, RunningAvgSamplesPerSec=23.83779953105419, CurrSamplesPerSec=23.86226146668454, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:08:51,698] [INFO] [logging.py:96:log_dist] [Rank 0] step=12620, skipped=241, lr=[5.89795154692966e-07, 5.89795154692966e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:08:51,939] [INFO] [timer.py:199:stop] epoch=13/micro_step=2640/global_step=12620, RunningAvgSamplesPerSec=23.837796696845814, CurrSamplesPerSec=23.830257233356534, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:09:18,626] [INFO] [logging.py:96:log_dist] [Rank 0] step=12630, skipped=241, lr=[5.848712297390894e-07, 5.848712297390894e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:09:18,867] [INFO] [timer.py:199:stop] epoch=13/micro_step=2680/global_step=12630, RunningAvgSamplesPerSec=23.837773905323317, CurrSamplesPerSec=23.880050443946583, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:09:45,521] [INFO] [logging.py:96:log_dist] [Rank 0] step=12640, skipped=241, lr=[5.799666183751652e-07, 5.799666183751652e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:09:45,761] [INFO] [timer.py:199:stop] epoch=13/micro_step=2720/global_step=12640, RunningAvgSamplesPerSec=23.83777130760209, CurrSamplesPerSec=23.872228943922345, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:09:50,841] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 01:09:53,239] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 01:10:11,817] [INFO] [logging.py:96:log_dist] [Rank 0] step=12650, skipped=243, lr=[5.760568500844135e-07, 5.760568500844135e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:10:12,058] [INFO] [timer.py:199:stop] epoch=13/micro_step=2760/global_step=12650, RunningAvgSamplesPerSec=23.838189458269568, CurrSamplesPerSec=23.841108454455597, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:10:38,694] [INFO] [logging.py:96:log_dist] [Rank 0] step=12660, skipped=243, lr=[5.71187059420716e-07, 5.71187059420716e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:10:38,937] [INFO] [timer.py:199:stop] epoch=13/micro_step=2800/global_step=12660, RunningAvgSamplesPerSec=23.8381995873853, CurrSamplesPerSec=23.855935579707538, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:11:05,655] [INFO] [logging.py:96:log_dist] [Rank 0] step=12670, skipped=243, lr=[5.663366446777296e-07, 5.663366446777296e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:11:05,897] [INFO] [timer.py:199:stop] epoch=13/micro_step=2840/global_step=12670, RunningAvgSamplesPerSec=23.83815366321794, CurrSamplesPerSec=23.696712259451434, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:11:32,764] [INFO] [logging.py:96:log_dist] [Rank 0] step=12680, skipped=243, lr=[5.615056279488694e-07, 5.615056279488694e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:11:33,008] [INFO] [timer.py:199:stop] epoch=13/micro_step=2880/global_step=12680, RunningAvgSamplesPerSec=23.838006614389197, CurrSamplesPerSec=23.70926808363532, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:11:59,752] [INFO] [logging.py:96:log_dist] [Rank 0] step=12690, skipped=243, lr=[5.566940312391926e-07, 5.566940312391926e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:11:59,993] [INFO] [timer.py:199:stop] epoch=13/micro_step=2920/global_step=12690, RunningAvgSamplesPerSec=23.837941678586173, CurrSamplesPerSec=23.72961500602665, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:12:26,733] [INFO] [logging.py:96:log_dist] [Rank 0] step=12700, skipped=243, lr=[5.519018764653001e-07, 5.519018764653001e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:12:26,975] [INFO] [timer.py:199:stop] epoch=13/micro_step=2960/global_step=12700, RunningAvgSamplesPerSec=23.83788486930343, CurrSamplesPerSec=23.599260818269254, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:12:53,635] [INFO] [logging.py:96:log_dist] [Rank 0] step=12710, skipped=243, lr=[5.471291854552315e-07, 5.471291854552315e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:12:53,877] [INFO] [timer.py:199:stop] epoch=13/micro_step=3000/global_step=12710, RunningAvgSamplesPerSec=23.83787884532931, CurrSamplesPerSec=23.792841711461442, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:13:20,553] [INFO] [logging.py:96:log_dist] [Rank 0] step=12720, skipped=243, lr=[5.423759799483769e-07, 5.423759799483769e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:13:20,795] [INFO] [timer.py:199:stop] epoch=13/micro_step=3040/global_step=12720, RunningAvgSamplesPerSec=23.837859881085638, CurrSamplesPerSec=23.819534247637726, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:13:47,465] [INFO] [logging.py:96:log_dist] [Rank 0] step=12730, skipped=243, lr=[5.376422815953646e-07, 5.376422815953646e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:13:47,708] [INFO] [timer.py:199:stop] epoch=13/micro_step=3080/global_step=12730, RunningAvgSamplesPerSec=23.83784691358864, CurrSamplesPerSec=23.788380149074246, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:14:14,391] [INFO] [logging.py:96:log_dist] [Rank 0] step=12740, skipped=243, lr=[5.329281119579718e-07, 5.329281119579718e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:14:14,633] [INFO] [timer.py:199:stop] epoch=13/micro_step=3120/global_step=12740, RunningAvgSamplesPerSec=23.837824625030976, CurrSamplesPerSec=23.86589989364007, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:14:25,112] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 01:14:27,513] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 01:14:40,725] [INFO] [logging.py:96:log_dist] [Rank 0] step=12750, skipped=245, lr=[5.291708513560332e-07, 5.291708513560332e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:14:40,967] [INFO] [timer.py:199:stop] epoch=13/micro_step=3160/global_step=12750, RunningAvgSamplesPerSec=23.838214293414588, CurrSamplesPerSec=23.832565491611806, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:15:07,641] [INFO] [logging.py:96:log_dist] [Rank 0] step=12760, skipped=245, lr=[5.244918874584335e-07, 5.244918874584335e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:15:07,882] [INFO] [timer.py:199:stop] epoch=13/micro_step=3200/global_step=12760, RunningAvgSamplesPerSec=23.838197891875335, CurrSamplesPerSec=23.817583537302866, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:15:34,532] [INFO] [logging.py:96:log_dist] [Rank 0] step=12770, skipped=245, lr=[5.198325121758892e-07, 5.198325121758892e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:15:34,772] [INFO] [timer.py:199:stop] epoch=13/micro_step=3240/global_step=12770, RunningAvgSamplesPerSec=23.838199157804485, CurrSamplesPerSec=23.806149884261878, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:16:01,455] [INFO] [logging.py:96:log_dist] [Rank 0] step=12780, skipped=245, lr=[5.151927467316391e-07, 5.151927467316391e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:16:01,698] [INFO] [timer.py:199:stop] epoch=13/micro_step=3280/global_step=12780, RunningAvgSamplesPerSec=23.838172567005937, CurrSamplesPerSec=23.76481816029714, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:16:28,361] [INFO] [logging.py:96:log_dist] [Rank 0] step=12790, skipped=245, lr=[5.105726122595984e-07, 5.105726122595984e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:16:28,602] [INFO] [timer.py:199:stop] epoch=13/micro_step=3320/global_step=12790, RunningAvgSamplesPerSec=23.838162490234662, CurrSamplesPerSec=23.836298580355855, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:16:55,288] [INFO] [logging.py:96:log_dist] [Rank 0] step=12800, skipped=245, lr=[5.059721298042654e-07, 5.059721298042654e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:16:55,530] [INFO] [timer.py:199:stop] epoch=13/micro_step=3360/global_step=12800, RunningAvgSamplesPerSec=23.83813690296833, CurrSamplesPerSec=23.8607724713633, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:17:22,211] [INFO] [logging.py:96:log_dist] [Rank 0] step=12810, skipped=245, lr=[5.013913203206249e-07, 5.013913203206249e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:17:22,451] [INFO] [timer.py:199:stop] epoch=13/micro_step=3400/global_step=12810, RunningAvgSamplesPerSec=23.83811450034546, CurrSamplesPerSec=23.88506926555995, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:17:49,148] [INFO] [logging.py:96:log_dist] [Rank 0] step=12820, skipped=245, lr=[4.968302046740528e-07, 4.968302046740528e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:17:49,390] [INFO] [timer.py:199:stop] epoch=13/micro_step=3440/global_step=12820, RunningAvgSamplesPerSec=23.838080296638996, CurrSamplesPerSec=23.759227257970238, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:18:16,109] [INFO] [logging.py:96:log_dist] [Rank 0] step=12830, skipped=245, lr=[4.922888036402167e-07, 4.922888036402167e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:18:16,350] [INFO] [timer.py:199:stop] epoch=13/micro_step=3480/global_step=12830, RunningAvgSamplesPerSec=23.83803199405806, CurrSamplesPerSec=23.86672107987094, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:18:43,187] [INFO] [logging.py:96:log_dist] [Rank 0] step=12840, skipped=245, lr=[4.877671379049906e-07, 4.877671379049906e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:18:43,430] [INFO] [timer.py:199:stop] epoch=13/micro_step=3520/global_step=12840, RunningAvgSamplesPerSec=23.837902977620455, CurrSamplesPerSec=23.715143476219698, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:18:59,364] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 01:19:01,790] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 01:19:09,660] [INFO] [logging.py:96:log_dist] [Rank 0] step=12850, skipped=247, lr=[4.841640285753278e-07, 4.841640285753278e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:19:09,902] [INFO] [timer.py:199:stop] epoch=13/micro_step=3560/global_step=12850, RunningAvgSamplesPerSec=23.83819283196187, CurrSamplesPerSec=23.702784444409126, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:19:36,671] [INFO] [logging.py:96:log_dist] [Rank 0] step=12860, skipped=247, lr=[4.796779382189927e-07, 4.796779382189927e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:19:36,912] [INFO] [timer.py:199:stop] epoch=13/micro_step=3600/global_step=12860, RunningAvgSamplesPerSec=23.83811267879995, CurrSamplesPerSec=23.77808656861716, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:20:03,649] [INFO] [logging.py:96:log_dist] [Rank 0] step=12870, skipped=247, lr=[4.7521164060317327e-07, 4.7521164060317327e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:20:03,890] [INFO] [timer.py:199:stop] epoch=13/micro_step=3640/global_step=12870, RunningAvgSamplesPerSec=23.838051833679717, CurrSamplesPerSec=23.740397839696893, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:20:30,561] [INFO] [logging.py:96:log_dist] [Rank 0] step=12880, skipped=247, lr=[4.707651560716487e-07, 4.707651560716487e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:20:30,803] [INFO] [timer.py:199:stop] epoch=13/micro_step=3680/global_step=12880, RunningAvgSamplesPerSec=23.83803656645882, CurrSamplesPerSec=23.79685559530335, MemAllocated=11.44GB, MaxMemAllocated=31.66GB ***** Evaluating perplexity, Epoch 14/16 ***** ppl: 1.7810802459716797 saving the final model ... Beginning of Epoch 15/16, Total Micro Batches 3680 [2023-04-24 01:21:49,038] [INFO] [logging.py:96:log_dist] [Rank 0] step=12890, skipped=247, lr=[4.6633850487794944e-07, 4.6633850487794944e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:21:49,279] [INFO] [timer.py:199:stop] epoch=14/micro_step=40/global_step=12890, RunningAvgSamplesPerSec=23.837991863438695, CurrSamplesPerSec=23.89251851737132, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:22:15,926] [INFO] [logging.py:96:log_dist] [Rank 0] step=12900, skipped=247, lr=[4.6193170718526755e-07, 4.6193170718526755e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:22:16,170] [INFO] [timer.py:199:stop] epoch=14/micro_step=80/global_step=12900, RunningAvgSamplesPerSec=23.83799292295908, CurrSamplesPerSec=23.858274261388868, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:22:42,832] [INFO] [logging.py:96:log_dist] [Rank 0] step=12910, skipped=247, lr=[4.5754478306636005e-07, 4.5754478306636005e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:22:43,075] [INFO] [timer.py:199:stop] epoch=14/micro_step=120/global_step=12910, RunningAvgSamplesPerSec=23.83798290699048, CurrSamplesPerSec=23.767256853189902, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:23:09,748] [INFO] [logging.py:96:log_dist] [Rank 0] step=12920, skipped=247, lr=[4.5317775250346414e-07, 4.5317775250346414e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:23:09,986] [INFO] [timer.py:199:stop] epoch=14/micro_step=160/global_step=12920, RunningAvgSamplesPerSec=23.8379696848236, CurrSamplesPerSec=23.80729634501617, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:23:36,650] [INFO] [logging.py:96:log_dist] [Rank 0] step=12930, skipped=247, lr=[4.488306353882012e-07, 4.488306353882012e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:23:36,891] [INFO] [timer.py:199:stop] epoch=14/micro_step=200/global_step=12930, RunningAvgSamplesPerSec=23.83796065959537, CurrSamplesPerSec=23.86335818178828, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:24:03,545] [INFO] [logging.py:96:log_dist] [Rank 0] step=12940, skipped=247, lr=[4.4450345152149004e-07, 4.4450345152149004e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:24:03,786] [INFO] [timer.py:199:stop] epoch=14/micro_step=240/global_step=12940, RunningAvgSamplesPerSec=23.837961741591435, CurrSamplesPerSec=23.81935247781273, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:24:25,028] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 01:24:27,429] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 01:24:29,884] [INFO] [logging.py:96:log_dist] [Rank 0] step=12950, skipped=249, lr=[4.4105606961533046e-07, 4.4105606961533046e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:24:30,124] [INFO] [timer.py:199:stop] epoch=14/micro_step=280/global_step=12950, RunningAvgSamplesPerSec=23.838342730688858, CurrSamplesPerSec=23.791142072143934, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:24:56,806] [INFO] [logging.py:96:log_dist] [Rank 0] step=12960, skipped=249, lr=[4.367648152044436e-07, 4.367648152044436e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:24:57,047] [INFO] [timer.py:199:stop] epoch=14/micro_step=320/global_step=12960, RunningAvgSamplesPerSec=23.838320600198784, CurrSamplesPerSec=23.83682774083982, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:25:23,737] [INFO] [logging.py:96:log_dist] [Rank 0] step=12970, skipped=249, lr=[4.324935490013594e-07, 4.324935490013594e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:25:23,978] [INFO] [timer.py:199:stop] epoch=14/micro_step=360/global_step=12970, RunningAvgSamplesPerSec=23.838294141344758, CurrSamplesPerSec=23.75549724644312, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:25:50,664] [INFO] [logging.py:96:log_dist] [Rank 0] step=12980, skipped=249, lr=[4.282422904614955e-07, 4.282422904614955e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:25:50,906] [INFO] [timer.py:199:stop] epoch=14/micro_step=400/global_step=12980, RunningAvgSamplesPerSec=23.838268895816096, CurrSamplesPerSec=23.812248719262726, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:26:17,641] [INFO] [logging.py:96:log_dist] [Rank 0] step=12990, skipped=249, lr=[4.2401105894913803e-07, 4.2401105894913803e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:26:17,884] [INFO] [timer.py:199:stop] epoch=14/micro_step=440/global_step=12990, RunningAvgSamplesPerSec=23.83821046089337, CurrSamplesPerSec=23.76040706008668, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:26:44,621] [INFO] [logging.py:96:log_dist] [Rank 0] step=13000, skipped=249, lr=[4.1979987373735076e-07, 4.1979987373735076e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:26:44,863] [INFO] [timer.py:199:stop] epoch=14/micro_step=480/global_step=13000, RunningAvgSamplesPerSec=23.83815321192848, CurrSamplesPerSec=23.73426649848568, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:27:11,609] [INFO] [logging.py:96:log_dist] [Rank 0] step=13010, skipped=249, lr=[4.156087540078871e-07, 4.156087540078871e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:27:11,851] [INFO] [timer.py:199:stop] epoch=14/micro_step=520/global_step=13010, RunningAvgSamplesPerSec=23.838087817298998, CurrSamplesPerSec=23.795478108924218, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:27:38,554] [INFO] [logging.py:96:log_dist] [Rank 0] step=13020, skipped=249, lr=[4.114377188511033e-07, 4.114377188511033e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:27:38,796] [INFO] [timer.py:199:stop] epoch=14/micro_step=560/global_step=13020, RunningAvgSamplesPerSec=23.838051917690873, CurrSamplesPerSec=23.83024877127084, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:28:05,445] [INFO] [logging.py:96:log_dist] [Rank 0] step=13030, skipped=249, lr=[4.0728678726586917e-07, 4.0728678726586917e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:28:05,688] [INFO] [timer.py:199:stop] epoch=14/micro_step=600/global_step=13030, RunningAvgSamplesPerSec=23.83805368619582, CurrSamplesPerSec=23.78427007706716, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:28:32,374] [INFO] [logging.py:96:log_dist] [Rank 0] step=13040, skipped=249, lr=[4.031559781594886e-07, 4.031559781594886e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:28:32,616] [INFO] [timer.py:199:stop] epoch=14/micro_step=640/global_step=13040, RunningAvgSamplesPerSec=23.83802820077944, CurrSamplesPerSec=23.843944056207427, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:28:59,272] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 01:28:59,273] [INFO] [logging.py:96:log_dist] [Rank 0] step=13050, skipped=250, lr=[3.9945547023619014e-07, 3.9945547023619014e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:28:59,273] [INFO] [timer.py:199:stop] epoch=14/micro_step=680/global_step=13050, RunningAvgSamplesPerSec=23.838187587293895, CurrSamplesPerSec=26.64446085295282, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:29:01,672] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 01:29:25,665] [INFO] [logging.py:96:log_dist] [Rank 0] step=13060, skipped=251, lr=[3.957712904156798e-07, 3.957712904156798e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:29:25,906] [INFO] [timer.py:199:stop] epoch=14/micro_step=720/global_step=13060, RunningAvgSamplesPerSec=23.8383628474932, CurrSamplesPerSec=23.803749641749455, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:29:52,588] [INFO] [logging.py:96:log_dist] [Rank 0] step=13070, skipped=251, lr=[3.916969240564129e-07, 3.916969240564129e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:29:52,831] [INFO] [timer.py:199:stop] epoch=14/micro_step=760/global_step=13070, RunningAvgSamplesPerSec=23.838339853352124, CurrSamplesPerSec=23.8589422374332, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:30:19,506] [INFO] [logging.py:96:log_dist] [Rank 0] step=13080, skipped=251, lr=[3.87642751187103e-07, 3.87642751187103e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:30:19,747] [INFO] [timer.py:199:stop] epoch=14/micro_step=800/global_step=13080, RunningAvgSamplesPerSec=23.838322008907927, CurrSamplesPerSec=23.85629811954728, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:30:46,405] [INFO] [logging.py:96:log_dist] [Rank 0] step=13090, skipped=251, lr=[3.8360879027431796e-07, 3.8360879027431796e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:30:46,645] [INFO] [timer.py:199:stop] epoch=14/micro_step=840/global_step=13090, RunningAvgSamplesPerSec=23.838316259259873, CurrSamplesPerSec=23.81795125240677, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:31:13,306] [INFO] [logging.py:96:log_dist] [Rank 0] step=13100, skipped=251, lr=[3.7959505969256485e-07, 3.7959505969256485e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:31:13,549] [INFO] [timer.py:199:stop] epoch=14/micro_step=880/global_step=13100, RunningAvgSamplesPerSec=23.838307817332606, CurrSamplesPerSec=23.827657540240033, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:31:40,186] [INFO] [logging.py:96:log_dist] [Rank 0] step=13110, skipped=251, lr=[3.7560157772419825e-07, 3.7560157772419825e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:31:40,427] [INFO] [timer.py:199:stop] epoch=14/micro_step=920/global_step=13110, RunningAvgSamplesPerSec=23.838314893493216, CurrSamplesPerSec=23.869579757926495, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:32:07,107] [INFO] [logging.py:96:log_dist] [Rank 0] step=13120, skipped=251, lr=[3.7162836255934375e-07, 3.7162836255934375e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:32:07,349] [INFO] [timer.py:199:stop] epoch=14/micro_step=960/global_step=13120, RunningAvgSamplesPerSec=23.838293704712118, CurrSamplesPerSec=23.803800301498626, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:32:34,016] [INFO] [logging.py:96:log_dist] [Rank 0] step=13130, skipped=251, lr=[3.6767543229581235e-07, 3.6767543229581235e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:32:34,260] [INFO] [timer.py:199:stop] epoch=14/micro_step=1000/global_step=13130, RunningAvgSamplesPerSec=23.838279144193724, CurrSamplesPerSec=23.756863800171338, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:33:00,954] [INFO] [logging.py:96:log_dist] [Rank 0] step=13140, skipped=251, lr=[3.637428049390193e-07, 3.637428049390193e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:33:01,196] [INFO] [timer.py:199:stop] epoch=14/micro_step=1040/global_step=13140, RunningAvgSamplesPerSec=23.838248702201636, CurrSamplesPerSec=23.763762039660058, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:33:27,896] [INFO] [logging.py:96:log_dist] [Rank 0] step=13150, skipped=251, lr=[3.598304984018975e-07, 3.598304984018975e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:33:28,137] [INFO] [timer.py:199:stop] epoch=14/micro_step=1080/global_step=13150, RunningAvgSamplesPerSec=23.83821689583995, CurrSamplesPerSec=23.73310397899597, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:33:33,234] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 01:33:35,636] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 01:33:54,294] [INFO] [logging.py:96:log_dist] [Rank 0] step=13160, skipped=253, lr=[3.5671529614076906e-07, 3.5671529614076906e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:33:54,537] [INFO] [timer.py:199:stop] epoch=14/micro_step=1120/global_step=13160, RunningAvgSamplesPerSec=23.838546798474745, CurrSamplesPerSec=23.684968005252358, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:34:21,306] [INFO] [logging.py:96:log_dist] [Rank 0] step=13170, skipped=253, lr=[3.528396119241522e-07, 3.528396119241522e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:34:21,547] [INFO] [timer.py:199:stop] epoch=14/micro_step=1160/global_step=13170, RunningAvgSamplesPerSec=23.838464613810174, CurrSamplesPerSec=23.733196304900524, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:34:48,340] [INFO] [logging.py:96:log_dist] [Rank 0] step=13180, skipped=253, lr=[3.4898429819074517e-07, 3.4898429819074517e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:34:48,586] [INFO] [timer.py:199:stop] epoch=14/micro_step=1200/global_step=13180, RunningAvgSamplesPerSec=23.838363787716375, CurrSamplesPerSec=23.683158372553752, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:35:15,372] [INFO] [logging.py:96:log_dist] [Rank 0] step=13190, skipped=253, lr=[3.451493725013248e-07, 3.451493725013248e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:35:15,611] [INFO] [timer.py:199:stop] epoch=14/micro_step=1240/global_step=13190, RunningAvgSamplesPerSec=23.838281799173167, CurrSamplesPerSec=23.824257008123585, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:35:42,337] [INFO] [logging.py:96:log_dist] [Rank 0] step=13200, skipped=253, lr=[3.413348523238008e-07, 3.413348523238008e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:35:42,583] [INFO] [timer.py:199:stop] epoch=14/micro_step=1280/global_step=13200, RunningAvgSamplesPerSec=23.83823636556419, CurrSamplesPerSec=23.794585886621718, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:36:09,286] [INFO] [logging.py:96:log_dist] [Rank 0] step=13210, skipped=253, lr=[3.375407550331329e-07, 3.375407550331329e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:36:09,528] [INFO] [timer.py:199:stop] epoch=14/micro_step=1320/global_step=13210, RunningAvgSamplesPerSec=23.838204683410957, CurrSamplesPerSec=23.76402711290866, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:36:36,224] [INFO] [logging.py:96:log_dist] [Rank 0] step=13220, skipped=253, lr=[3.3376709791126166e-07, 3.3376709791126166e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:36:36,467] [INFO] [timer.py:199:stop] epoch=14/micro_step=1360/global_step=13220, RunningAvgSamplesPerSec=23.83817682429208, CurrSamplesPerSec=23.799530863621786, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:37:03,138] [INFO] [logging.py:96:log_dist] [Rank 0] step=13230, skipped=253, lr=[3.30013898147018e-07, 3.30013898147018e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:37:03,380] [INFO] [timer.py:199:stop] epoch=14/micro_step=1400/global_step=13230, RunningAvgSamplesPerSec=23.838167565502783, CurrSamplesPerSec=23.793499701556886, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:37:30,065] [INFO] [logging.py:96:log_dist] [Rank 0] step=13240, skipped=253, lr=[3.2628117283605354e-07, 3.2628117283605354e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:37:30,308] [INFO] [timer.py:199:stop] epoch=14/micro_step=1440/global_step=13240, RunningAvgSamplesPerSec=23.838150038669497, CurrSamplesPerSec=23.762190655416113, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:37:56,993] [INFO] [logging.py:96:log_dist] [Rank 0] step=13250, skipped=253, lr=[3.225689389807591e-07, 3.225689389807591e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:37:57,235] [INFO] [timer.py:199:stop] epoch=14/micro_step=1480/global_step=13250, RunningAvgSamplesPerSec=23.838132456827744, CurrSamplesPerSec=23.85780351978772, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:38:07,717] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 01:38:10,121] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 01:38:23,376] [INFO] [logging.py:96:log_dist] [Rank 0] step=13260, skipped=255, lr=[3.1961391711062754e-07, 3.1961391711062754e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:38:23,619] [INFO] [timer.py:199:stop] epoch=14/micro_step=1520/global_step=13260, RunningAvgSamplesPerSec=23.838474841228983, CurrSamplesPerSec=23.75480982423674, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:38:50,447] [INFO] [logging.py:96:log_dist] [Rank 0] step=13270, skipped=255, lr=[3.159386104235807e-07, 3.159386104235807e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:38:50,686] [INFO] [timer.py:199:stop] epoch=14/micro_step=1560/global_step=13270, RunningAvgSamplesPerSec=23.83836634978057, CurrSamplesPerSec=23.68486560525897, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:39:17,431] [INFO] [logging.py:96:log_dist] [Rank 0] step=13280, skipped=255, lr=[3.1228384230209573e-07, 3.1228384230209573e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:39:17,673] [INFO] [timer.py:199:stop] epoch=14/micro_step=1600/global_step=13280, RunningAvgSamplesPerSec=23.838305705222275, CurrSamplesPerSec=23.774742283152552, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:39:44,375] [INFO] [logging.py:96:log_dist] [Rank 0] step=13290, skipped=255, lr=[3.0864962939347304e-07, 3.0864962939347304e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:39:44,616] [INFO] [timer.py:199:stop] epoch=14/micro_step=1640/global_step=13290, RunningAvgSamplesPerSec=23.838274634762595, CurrSamplesPerSec=23.79466603624449, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:40:11,300] [INFO] [logging.py:96:log_dist] [Rank 0] step=13300, skipped=255, lr=[3.0503598825138243e-07, 3.0503598825138243e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:40:11,543] [INFO] [timer.py:199:stop] epoch=14/micro_step=1680/global_step=13300, RunningAvgSamplesPerSec=23.83825505195735, CurrSamplesPerSec=23.791411973830776, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:40:38,363] [INFO] [logging.py:96:log_dist] [Rank 0] step=13310, skipped=255, lr=[3.014429353357927e-07, 3.014429353357927e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:40:38,612] [INFO] [timer.py:199:stop] epoch=14/micro_step=1720/global_step=13310, RunningAvgSamplesPerSec=23.83814003488079, CurrSamplesPerSec=23.514771089762178, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:41:05,462] [INFO] [logging.py:96:log_dist] [Rank 0] step=13320, skipped=255, lr=[2.9787048701289346e-07, 2.9787048701289346e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:41:05,708] [INFO] [timer.py:199:stop] epoch=14/micro_step=1760/global_step=13320, RunningAvgSamplesPerSec=23.838004049978135, CurrSamplesPerSec=23.72452290532576, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:41:32,535] [INFO] [logging.py:96:log_dist] [Rank 0] step=13330, skipped=255, lr=[2.943186595550194e-07, 2.943186595550194e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:41:32,781] [INFO] [timer.py:199:stop] epoch=14/micro_step=1800/global_step=13330, RunningAvgSamplesPerSec=23.837888943414434, CurrSamplesPerSec=23.606453947024757, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:41:59,724] [INFO] [logging.py:96:log_dist] [Rank 0] step=13340, skipped=255, lr=[2.9078746914058296e-07, 2.9078746914058296e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:41:59,970] [INFO] [timer.py:199:stop] epoch=14/micro_step=1840/global_step=13340, RunningAvgSamplesPerSec=23.837695853592596, CurrSamplesPerSec=23.569099968953307, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:42:26,934] [INFO] [logging.py:96:log_dist] [Rank 0] step=13350, skipped=255, lr=[2.872769318539902e-07, 2.872769318539902e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:42:27,182] [INFO] [timer.py:199:stop] epoch=14/micro_step=1880/global_step=13350, RunningAvgSamplesPerSec=23.8374841717762, CurrSamplesPerSec=23.556985676278327, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:42:43,207] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 01:42:45,646] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 01:42:53,553] [INFO] [logging.py:96:log_dist] [Rank 0] step=13360, skipped=257, lr=[2.8448338302543117e-07, 2.8448338302543117e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:42:53,796] [INFO] [timer.py:199:stop] epoch=14/micro_step=1920/global_step=13360, RunningAvgSamplesPerSec=23.837670151795734, CurrSamplesPerSec=23.636633527799095, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:43:20,751] [INFO] [logging.py:96:log_dist] [Rank 0] step=13370, skipped=257, lr=[2.8101006160133776e-07, 2.8101006160133776e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:43:20,996] [INFO] [timer.py:199:stop] epoch=14/micro_step=1960/global_step=13370, RunningAvgSamplesPerSec=23.83746780197354, CurrSamplesPerSec=23.58493121955681, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:43:48,656] [INFO] [logging.py:96:log_dist] [Rank 0] step=13380, skipped=257, lr=[2.7755743784072665e-07, 2.7755743784072665e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:43:48,902] [INFO] [timer.py:199:stop] epoch=14/micro_step=2000/global_step=13380, RunningAvgSamplesPerSec=23.83689059834418, CurrSamplesPerSec=23.509902544882305, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:44:15,817] [INFO] [logging.py:96:log_dist] [Rank 0] step=13390, skipped=257, lr=[2.74125527470139e-07, 2.74125527470139e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:44:16,062] [INFO] [timer.py:199:stop] epoch=14/micro_step=2040/global_step=13390, RunningAvgSamplesPerSec=23.836716816295038, CurrSamplesPerSec=23.515570349836384, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:44:42,981] [INFO] [logging.py:96:log_dist] [Rank 0] step=13400, skipped=257, lr=[2.707143461217687e-07, 2.707143461217687e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:44:43,221] [INFO] [timer.py:199:stop] epoch=14/micro_step=2080/global_step=13400, RunningAvgSamplesPerSec=23.836541535957384, CurrSamplesPerSec=23.493036818720814, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:45:10,164] [INFO] [logging.py:96:log_dist] [Rank 0] step=13410, skipped=257, lr=[2.6732390933338675e-07, 2.6732390933338675e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:45:10,409] [INFO] [timer.py:199:stop] epoch=14/micro_step=2120/global_step=13410, RunningAvgSamplesPerSec=23.836350048548272, CurrSamplesPerSec=23.57571151215754, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:45:38,058] [INFO] [logging.py:96:log_dist] [Rank 0] step=13420, skipped=257, lr=[2.6395423254827646e-07, 2.6395423254827646e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:45:38,305] [INFO] [timer.py:199:stop] epoch=14/micro_step=2160/global_step=13420, RunningAvgSamplesPerSec=23.83599168845072, CurrSamplesPerSec=23.59389684359928, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:46:05,224] [INFO] [logging.py:96:log_dist] [Rank 0] step=13430, skipped=257, lr=[2.6060533111515885e-07, 2.6060533111515885e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:46:05,470] [INFO] [timer.py:199:stop] epoch=14/micro_step=2200/global_step=13430, RunningAvgSamplesPerSec=23.83581479675128, CurrSamplesPerSec=23.575140048925128, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:46:32,381] [INFO] [logging.py:96:log_dist] [Rank 0] step=13440, skipped=257, lr=[2.572772202881254e-07, 2.572772202881254e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:46:32,626] [INFO] [timer.py:199:stop] epoch=14/micro_step=2240/global_step=13440, RunningAvgSamplesPerSec=23.835650082317006, CurrSamplesPerSec=23.652768241946784, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:46:59,513] [INFO] [logging.py:96:log_dist] [Rank 0] step=13450, skipped=257, lr=[2.5396991522656607e-07, 2.5396991522656607e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:46:59,758] [INFO] [timer.py:199:stop] epoch=14/micro_step=2280/global_step=13450, RunningAvgSamplesPerSec=23.835498551788575, CurrSamplesPerSec=23.661149399188435, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:47:21,228] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 01:47:23,655] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 01:47:26,142] [INFO] [logging.py:96:log_dist] [Rank 0] step=13460, skipped=259, lr=[2.5133906145507185e-07, 2.5133906145507185e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:47:26,382] [INFO] [timer.py:199:stop] epoch=14/micro_step=2320/global_step=13460, RunningAvgSamplesPerSec=23.835681438893943, CurrSamplesPerSec=23.51959219399874, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:47:53,280] [INFO] [logging.py:96:log_dist] [Rank 0] step=13470, skipped=259, lr=[2.480692446704834e-07, 2.480692446704834e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:47:53,526] [INFO] [timer.py:199:stop] epoch=14/micro_step=2360/global_step=13470, RunningAvgSamplesPerSec=23.835524307676714, CurrSamplesPerSec=23.577821610962346, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:48:20,439] [INFO] [logging.py:96:log_dist] [Rank 0] step=13480, skipped=259, lr=[2.4482027559327107e-07, 2.4482027559327107e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:48:20,685] [INFO] [timer.py:199:stop] epoch=14/micro_step=2400/global_step=13480, RunningAvgSamplesPerSec=23.835352084783537, CurrSamplesPerSec=23.54529067356863, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:48:48,953] [INFO] [logging.py:96:log_dist] [Rank 0] step=13490, skipped=259, lr=[2.4159216902233913e-07, 2.4159216902233913e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:48:49,197] [INFO] [timer.py:199:stop] epoch=14/micro_step=2440/global_step=13490, RunningAvgSamplesPerSec=23.834409370170366, CurrSamplesPerSec=23.523778268074132, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:49:16,124] [INFO] [logging.py:96:log_dist] [Rank 0] step=13500, skipped=259, lr=[2.3838493966156187e-07, 2.3838493966156187e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:49:16,370] [INFO] [timer.py:199:stop] epoch=14/micro_step=2480/global_step=13500, RunningAvgSamplesPerSec=23.834231057034238, CurrSamplesPerSec=23.60109707746406, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:49:44,653] [INFO] [logging.py:96:log_dist] [Rank 0] step=13510, skipped=259, lr=[2.3519860211972186e-07, 2.3519860211972186e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:49:44,900] [INFO] [timer.py:199:stop] epoch=14/micro_step=2520/global_step=13510, RunningAvgSamplesPerSec=23.833243771453315, CurrSamplesPerSec=23.558709915002378, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:50:11,830] [INFO] [logging.py:96:log_dist] [Rank 0] step=13520, skipped=259, lr=[2.320331709104382e-07, 2.320331709104382e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:50:12,075] [INFO] [timer.py:199:stop] epoch=14/micro_step=2560/global_step=13520, RunningAvgSamplesPerSec=23.83306325521595, CurrSamplesPerSec=23.6286607373369, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:50:39,103] [INFO] [logging.py:96:log_dist] [Rank 0] step=13530, skipped=259, lr=[2.288886604521028e-07, 2.288886604521028e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:50:39,349] [INFO] [timer.py:199:stop] epoch=14/micro_step=2600/global_step=13530, RunningAvgSamplesPerSec=23.832838895751607, CurrSamplesPerSec=23.611765479936345, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:51:06,323] [INFO] [logging.py:96:log_dist] [Rank 0] step=13540, skipped=259, lr=[2.2576508506781592e-07, 2.2576508506781592e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:51:06,567] [INFO] [timer.py:199:stop] epoch=14/micro_step=2640/global_step=13540, RunningAvgSamplesPerSec=23.83263186669935, CurrSamplesPerSec=23.64961746245987, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:51:33,538] [INFO] [logging.py:96:log_dist] [Rank 0] step=13550, skipped=259, lr=[2.2266245898531847e-07, 2.2266245898531847e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:51:33,782] [INFO] [timer.py:199:stop] epoch=14/micro_step=2680/global_step=13550, RunningAvgSamplesPerSec=23.832434166330273, CurrSamplesPerSec=23.64214813566123, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:52:00,670] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 01:52:00,671] [INFO] [logging.py:96:log_dist] [Rank 0] step=13560, skipped=260, lr=[2.198880188464332e-07, 2.198880188464332e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:52:00,671] [INFO] [timer.py:199:stop] epoch=14/micro_step=2720/global_step=13560, RunningAvgSamplesPerSec=23.832442005347495, CurrSamplesPerSec=26.43226484545765, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:52:03,089] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 01:52:29,372] [INFO] [logging.py:96:log_dist] [Rank 0] step=13570, skipped=261, lr=[2.1713056932673222e-07, 2.1713056932673222e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:52:29,608] [INFO] [timer.py:199:stop] epoch=14/micro_step=2760/global_step=13570, RunningAvgSamplesPerSec=23.83115372007036, CurrSamplesPerSec=13.563148961356143, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:52:56,555] [INFO] [logging.py:96:log_dist] [Rank 0] step=13580, skipped=261, lr=[2.1408667616833898e-07, 2.1408667616833898e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:52:56,799] [INFO] [timer.py:199:stop] epoch=14/micro_step=2800/global_step=13580, RunningAvgSamplesPerSec=23.830971770448038, CurrSamplesPerSec=23.635118450100254, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:53:23,722] [INFO] [logging.py:96:log_dist] [Rank 0] step=13590, skipped=261, lr=[2.1106378550634563e-07, 2.1106378550634563e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:53:23,965] [INFO] [timer.py:199:stop] epoch=14/micro_step=2840/global_step=13590, RunningAvgSamplesPerSec=23.830800591802724, CurrSamplesPerSec=23.616470611627147, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:53:51,116] [INFO] [logging.py:96:log_dist] [Rank 0] step=13600, skipped=261, lr=[2.0806191110987962e-07, 2.0806191110987962e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:53:51,360] [INFO] [timer.py:199:stop] epoch=14/micro_step=2880/global_step=13600, RunningAvgSamplesPerSec=23.83048148081731, CurrSamplesPerSec=23.60800272423585, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:54:18,293] [INFO] [logging.py:96:log_dist] [Rank 0] step=13610, skipped=261, lr=[2.050810666523392e-07, 2.050810666523392e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:54:18,539] [INFO] [timer.py:199:stop] epoch=14/micro_step=2920/global_step=13610, RunningAvgSamplesPerSec=23.830300949098117, CurrSamplesPerSec=23.547393264528, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:54:45,472] [INFO] [logging.py:96:log_dist] [Rank 0] step=13620, skipped=261, lr=[2.021212657113329e-07, 2.021212657113329e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:54:45,718] [INFO] [timer.py:199:stop] epoch=14/micro_step=2960/global_step=13620, RunningAvgSamplesPerSec=23.83012341156267, CurrSamplesPerSec=23.597074278947034, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:55:12,648] [INFO] [logging.py:96:log_dist] [Rank 0] step=13630, skipped=261, lr=[1.9918252176861636e-07, 1.9918252176861636e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:55:12,894] [INFO] [timer.py:199:stop] epoch=14/micro_step=3000/global_step=13630, RunningAvgSamplesPerSec=23.82994412047407, CurrSamplesPerSec=23.5293290197253, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:55:39,871] [INFO] [logging.py:96:log_dist] [Rank 0] step=13640, skipped=261, lr=[1.9626484821003288e-07, 1.9626484821003288e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:55:40,115] [INFO] [timer.py:199:stop] epoch=14/micro_step=3040/global_step=13640, RunningAvgSamplesPerSec=23.82974519197351, CurrSamplesPerSec=23.50855396037475, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:56:07,058] [INFO] [logging.py:96:log_dist] [Rank 0] step=13650, skipped=261, lr=[1.9336825832544853e-07, 1.9336825832544853e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:56:07,301] [INFO] [timer.py:199:stop] epoch=14/micro_step=3080/global_step=13650, RunningAvgSamplesPerSec=23.829560967118905, CurrSamplesPerSec=23.586584942271024, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:56:34,243] [INFO] [logging.py:96:log_dist] [Rank 0] step=13660, skipped=261, lr=[1.9049276530869815e-07, 1.9049276530869815e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:56:34,486] [INFO] [timer.py:199:stop] epoch=14/micro_step=3120/global_step=13660, RunningAvgSamplesPerSec=23.82938248062973, CurrSamplesPerSec=23.630545259149443, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:56:39,620] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 01:56:42,049] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 01:57:00,841] [INFO] [logging.py:96:log_dist] [Rank 0] step=13670, skipped=263, lr=[1.8820756944504756e-07, 1.8820756944504756e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:57:01,087] [INFO] [timer.py:199:stop] epoch=14/micro_step=3160/global_step=13670, RunningAvgSamplesPerSec=23.82958210636512, CurrSamplesPerSec=23.454282204596556, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:57:28,479] [INFO] [logging.py:96:log_dist] [Rank 0] step=13680, skipped=263, lr=[1.853700837320792e-07, 1.853700837320792e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:57:28,724] [INFO] [timer.py:199:stop] epoch=14/micro_step=3200/global_step=13680, RunningAvgSamplesPerSec=23.829132667694072, CurrSamplesPerSec=20.50831722451242, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:57:55,671] [INFO] [logging.py:96:log_dist] [Rank 0] step=13690, skipped=263, lr=[1.8255373131825946e-07, 1.8255373131825946e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:57:55,918] [INFO] [timer.py:199:stop] epoch=14/micro_step=3240/global_step=13690, RunningAvgSamplesPerSec=23.828947088074038, CurrSamplesPerSec=23.552283550288713, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:58:22,896] [INFO] [logging.py:96:log_dist] [Rank 0] step=13700, skipped=263, lr=[1.797585250319408e-07, 1.797585250319408e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:58:23,142] [INFO] [timer.py:199:stop] epoch=14/micro_step=3280/global_step=13700, RunningAvgSamplesPerSec=23.828741290051386, CurrSamplesPerSec=23.522994940507566, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:58:50,707] [INFO] [logging.py:96:log_dist] [Rank 0] step=13710, skipped=263, lr=[1.7698447760515951e-07, 1.7698447760515951e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:58:50,952] [INFO] [timer.py:199:stop] epoch=14/micro_step=3320/global_step=13710, RunningAvgSamplesPerSec=23.82830520381882, CurrSamplesPerSec=23.569331744925712, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:59:17,909] [INFO] [logging.py:96:log_dist] [Rank 0] step=13720, skipped=263, lr=[1.7423160167357283e-07, 1.7423160167357283e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:59:18,155] [INFO] [timer.py:199:stop] epoch=14/micro_step=3360/global_step=13720, RunningAvgSamplesPerSec=23.828113453380386, CurrSamplesPerSec=23.59124687000268, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 01:59:45,097] [INFO] [logging.py:96:log_dist] [Rank 0] step=13730, skipped=263, lr=[1.7149990977640285e-07, 1.7149990977640285e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 01:59:45,341] [INFO] [timer.py:199:stop] epoch=14/micro_step=3400/global_step=13730, RunningAvgSamplesPerSec=23.82793654416165, CurrSamplesPerSec=23.639093861665984, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:00:12,278] [INFO] [logging.py:96:log_dist] [Rank 0] step=13740, skipped=263, lr=[1.6878941435637864e-07, 1.6878941435637864e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:00:12,523] [INFO] [timer.py:199:stop] epoch=14/micro_step=3440/global_step=13740, RunningAvgSamplesPerSec=23.827757442864137, CurrSamplesPerSec=23.590150145885456, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:00:39,435] [INFO] [logging.py:96:log_dist] [Rank 0] step=13750, skipped=263, lr=[1.6610012775968262e-07, 1.6610012775968262e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:00:39,680] [INFO] [timer.py:199:stop] epoch=14/micro_step=3480/global_step=13750, RunningAvgSamplesPerSec=23.827594371817277, CurrSamplesPerSec=23.641496407698185, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:01:06,593] [INFO] [logging.py:96:log_dist] [Rank 0] step=13760, skipped=263, lr=[1.6343206223589012e-07, 1.6343206223589012e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:01:06,838] [INFO] [timer.py:199:stop] epoch=14/micro_step=3520/global_step=13760, RunningAvgSamplesPerSec=23.827430526203067, CurrSamplesPerSec=23.621919689217428, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:01:17,423] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 02:01:19,844] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 02:01:33,182] [INFO] [logging.py:96:log_dist] [Rank 0] step=13770, skipped=265, lr=[1.613128971593536e-07, 1.613128971593536e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:01:33,426] [INFO] [timer.py:199:stop] epoch=14/micro_step=3560/global_step=13770, RunningAvgSamplesPerSec=23.82763470150914, CurrSamplesPerSec=23.652259726015423, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:02:00,324] [INFO] [logging.py:96:log_dist] [Rank 0] step=13780, skipped=265, lr=[1.586830601271403e-07, 1.586830601271403e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:02:00,569] [INFO] [timer.py:199:stop] epoch=14/micro_step=3600/global_step=13780, RunningAvgSamplesPerSec=23.827481429202564, CurrSamplesPerSec=23.617758877463512, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:02:27,454] [INFO] [logging.py:96:log_dist] [Rank 0] step=13790, skipped=265, lr=[1.5607447795222968e-07, 1.5607447795222968e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:02:27,697] [INFO] [timer.py:199:stop] epoch=14/micro_step=3640/global_step=13790, RunningAvgSamplesPerSec=23.82733589245733, CurrSamplesPerSec=23.62095521472387, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:02:54,601] [INFO] [logging.py:96:log_dist] [Rank 0] step=13800, skipped=265, lr=[1.5348716251659185e-07, 1.5348716251659185e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:02:54,846] [INFO] [timer.py:199:stop] epoch=14/micro_step=3680/global_step=13800, RunningAvgSamplesPerSec=23.82717803860556, CurrSamplesPerSec=23.55247573226548, MemAllocated=11.44GB, MaxMemAllocated=31.66GB ***** Evaluating perplexity, Epoch 15/16 ***** ppl: 1.7765765190124512 saving the final model ... Beginning of Epoch 16/16, Total Micro Batches 3680 [2023-04-24 02:04:13,036] [INFO] [logging.py:96:log_dist] [Rank 0] step=13810, skipped=265, lr=[1.5092112560532933e-07, 1.5092112560532933e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:04:13,280] [INFO] [timer.py:199:stop] epoch=15/micro_step=40/global_step=13810, RunningAvgSamplesPerSec=23.82696085048047, CurrSamplesPerSec=23.61228887209389, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:04:40,209] [INFO] [logging.py:96:log_dist] [Rank 0] step=13820, skipped=265, lr=[1.4837637890662103e-07, 1.4837637890662103e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:04:40,453] [INFO] [timer.py:199:stop] epoch=15/micro_step=80/global_step=13820, RunningAvgSamplesPerSec=23.82678902434906, CurrSamplesPerSec=23.599078245402918, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:05:07,382] [INFO] [logging.py:96:log_dist] [Rank 0] step=13830, skipped=265, lr=[1.4585293401167058e-07, 1.4585293401167058e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:05:07,627] [INFO] [timer.py:199:stop] epoch=15/micro_step=120/global_step=13830, RunningAvgSamplesPerSec=23.8266172636915, CurrSamplesPerSec=23.558842241111858, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:05:34,573] [INFO] [logging.py:96:log_dist] [Rank 0] step=13840, skipped=265, lr=[1.4335080241465192e-07, 1.4335080241465192e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:05:34,818] [INFO] [timer.py:199:stop] epoch=15/micro_step=160/global_step=13840, RunningAvgSamplesPerSec=23.826435254107402, CurrSamplesPerSec=23.57778433410472, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:06:01,767] [INFO] [logging.py:96:log_dist] [Rank 0] step=13850, skipped=265, lr=[1.4086999551265822e-07, 1.4086999551265822e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:06:02,011] [INFO] [timer.py:199:stop] epoch=15/micro_step=200/global_step=13850, RunningAvgSamplesPerSec=23.82625189763736, CurrSamplesPerSec=23.60256006471382, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:06:28,972] [INFO] [logging.py:96:log_dist] [Rank 0] step=13860, skipped=265, lr=[1.3841052460565164e-07, 1.3841052460565164e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:06:29,218] [INFO] [timer.py:199:stop] epoch=15/micro_step=240/global_step=13860, RunningAvgSamplesPerSec=23.826059776926, CurrSamplesPerSec=23.589986371639792, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:06:45,236] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 02:06:47,658] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 02:06:55,582] [INFO] [logging.py:96:log_dist] [Rank 0] step=13870, skipped=267, lr=[1.3645831732796759e-07, 1.3645831732796759e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:06:55,829] [INFO] [timer.py:199:stop] epoch=15/micro_step=280/global_step=13870, RunningAvgSamplesPerSec=23.82625032670732, CurrSamplesPerSec=23.56853917273589, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:07:22,721] [INFO] [logging.py:96:log_dist] [Rank 0] step=13880, skipped=267, lr=[1.340372793775895e-07, 1.340372793775895e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:07:22,967] [INFO] [timer.py:199:stop] epoch=15/micro_step=320/global_step=13880, RunningAvgSamplesPerSec=23.82610232023376, CurrSamplesPerSec=23.60812314706428, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:07:49,879] [INFO] [logging.py:96:log_dist] [Rank 0] step=13890, skipped=267, lr=[1.3163760854490226e-07, 1.3163760854490226e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:07:50,125] [INFO] [timer.py:199:stop] epoch=15/micro_step=360/global_step=13890, RunningAvgSamplesPerSec=23.825942417458773, CurrSamplesPerSec=23.590674653392952, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:08:17,058] [INFO] [logging.py:96:log_dist] [Rank 0] step=13900, skipped=267, lr=[1.2925931576029583e-07, 1.2925931576029583e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:08:17,300] [INFO] [timer.py:199:stop] epoch=15/micro_step=400/global_step=13900, RunningAvgSamplesPerSec=23.82577146575407, CurrSamplesPerSec=23.626400123784578, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:08:44,233] [INFO] [logging.py:96:log_dist] [Rank 0] step=13910, skipped=267, lr=[1.2690241185678313e-07, 1.2690241185678313e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:08:44,478] [INFO] [timer.py:199:stop] epoch=15/micro_step=440/global_step=13910, RunningAvgSamplesPerSec=23.825600217842464, CurrSamplesPerSec=23.602817403213532, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:09:11,405] [INFO] [logging.py:96:log_dist] [Rank 0] step=13920, skipped=267, lr=[1.2456690756995207e-07, 1.2456690756995207e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:09:11,651] [INFO] [timer.py:199:stop] epoch=15/micro_step=480/global_step=13920, RunningAvgSamplesPerSec=23.825431675904564, CurrSamplesPerSec=23.584879414947324, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:09:38,540] [INFO] [logging.py:96:log_dist] [Rank 0] step=13930, skipped=267, lr=[1.222528135379166e-07, 1.222528135379166e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:09:38,785] [INFO] [timer.py:199:stop] epoch=15/micro_step=520/global_step=13930, RunningAvgSamplesPerSec=23.825287597252903, CurrSamplesPerSec=23.601549442803563, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:10:05,705] [INFO] [logging.py:96:log_dist] [Rank 0] step=13940, skipped=267, lr=[1.1996014030126755e-07, 1.1996014030126755e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:10:05,950] [INFO] [timer.py:199:stop] epoch=15/micro_step=560/global_step=13940, RunningAvgSamplesPerSec=23.825124091892043, CurrSamplesPerSec=23.589099127916175, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:10:32,891] [INFO] [logging.py:96:log_dist] [Rank 0] step=13950, skipped=267, lr=[1.176888983030254e-07, 1.176888983030254e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:10:33,136] [INFO] [timer.py:199:stop] epoch=15/micro_step=600/global_step=13950, RunningAvgSamplesPerSec=23.82495043921482, CurrSamplesPerSec=23.60296475278413, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:11:00,057] [INFO] [logging.py:96:log_dist] [Rank 0] step=13960, skipped=267, lr=[1.1543909788859272e-07, 1.1543909788859272e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:11:00,302] [INFO] [timer.py:199:stop] epoch=15/micro_step=640/global_step=13960, RunningAvgSamplesPerSec=23.824786349601613, CurrSamplesPerSec=23.5130141445735, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:11:21,761] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 02:11:24,188] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 02:11:26,659] [INFO] [logging.py:96:log_dist] [Rank 0] step=13970, skipped=269, lr=[1.1365470238714959e-07, 1.1365470238714959e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:11:26,903] [INFO] [timer.py:199:stop] epoch=15/micro_step=680/global_step=13970, RunningAvgSamplesPerSec=23.824982257607374, CurrSamplesPerSec=23.61607584861702, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:11:53,821] [INFO] [logging.py:96:log_dist] [Rank 0] step=13980, skipped=269, lr=[1.11443522582207e-07, 1.11443522582207e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:11:54,067] [INFO] [timer.py:199:stop] epoch=15/micro_step=720/global_step=13980, RunningAvgSamplesPerSec=23.82481963048821, CurrSamplesPerSec=23.602423096510257, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:12:21,041] [INFO] [logging.py:96:log_dist] [Rank 0] step=13990, skipped=269, lr=[1.0925381280847098e-07, 1.0925381280847098e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:12:21,284] [INFO] [timer.py:199:stop] epoch=15/micro_step=760/global_step=13990, RunningAvgSamplesPerSec=23.82462521457072, CurrSamplesPerSec=23.641344412414046, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:12:48,206] [INFO] [logging.py:96:log_dist] [Rank 0] step=14000, skipped=269, lr=[1.0708558303996604e-07, 1.0708558303996604e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:12:48,452] [INFO] [timer.py:199:stop] epoch=15/micro_step=800/global_step=14000, RunningAvgSamplesPerSec=23.824460238809245, CurrSamplesPerSec=23.556220807066072, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:13:15,393] [INFO] [logging.py:96:log_dist] [Rank 0] step=14010, skipped=269, lr=[1.0493884315288011e-07, 1.0493884315288011e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:13:15,638] [INFO] [timer.py:199:stop] epoch=15/micro_step=840/global_step=14010, RunningAvgSamplesPerSec=23.82428418107013, CurrSamplesPerSec=23.52806275588538, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:13:42,626] [INFO] [logging.py:96:log_dist] [Rank 0] step=14020, skipped=269, lr=[1.0281360292551268e-07, 1.0281360292551268e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:13:42,872] [INFO] [timer.py:199:stop] epoch=15/micro_step=880/global_step=14020, RunningAvgSamplesPerSec=23.82408026961172, CurrSamplesPerSec=23.537088376756916, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:14:09,814] [INFO] [logging.py:96:log_dist] [Rank 0] step=14030, skipped=269, lr=[1.0070987203823452e-07, 1.0070987203823452e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:14:10,058] [INFO] [timer.py:199:stop] epoch=15/micro_step=920/global_step=14030, RunningAvgSamplesPerSec=23.823905198457634, CurrSamplesPerSec=23.539658076381713, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:14:37,038] [INFO] [logging.py:96:log_dist] [Rank 0] step=14040, skipped=269, lr=[9.862766007344329e-08, 9.862766007344329e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:14:37,284] [INFO] [timer.py:199:stop] epoch=15/micro_step=960/global_step=14040, RunningAvgSamplesPerSec=23.823706856083664, CurrSamplesPerSec=23.54376456819212, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:15:04,265] [INFO] [logging.py:96:log_dist] [Rank 0] step=14050, skipped=269, lr=[9.65669765155169e-08, 9.65669765155169e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:15:04,509] [INFO] [timer.py:199:stop] epoch=15/micro_step=1000/global_step=14050, RunningAvgSamplesPerSec=23.82350729897777, CurrSamplesPerSec=23.62078269849572, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:15:31,459] [INFO] [logging.py:96:log_dist] [Rank 0] step=14060, skipped=269, lr=[9.452783075077336e-08, 9.452783075077336e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:15:31,703] [INFO] [timer.py:199:stop] epoch=15/micro_step=1040/global_step=14060, RunningAvgSamplesPerSec=23.82332772308499, CurrSamplesPerSec=23.57142828290864, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:15:58,578] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 02:15:58,579] [INFO] [logging.py:96:log_dist] [Rank 0] step=14070, skipped=270, lr=[9.271102205441317e-08, 9.271102205441317e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:15:58,579] [INFO] [timer.py:199:stop] epoch=15/micro_step=1080/global_step=14070, RunningAvgSamplesPerSec=23.82334708643913, CurrSamplesPerSec=26.475308476986143, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:16:01,004] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 02:16:25,183] [INFO] [logging.py:96:log_dist] [Rank 0] step=14080, skipped=271, lr=[9.091167319838243e-08, 9.091167319838243e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:16:25,429] [INFO] [timer.py:199:stop] epoch=15/micro_step=1120/global_step=14080, RunningAvgSamplesPerSec=23.823383861314756, CurrSamplesPerSec=23.58019721825562, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:16:52,363] [INFO] [logging.py:96:log_dist] [Rank 0] step=14090, skipped=271, lr=[8.89328823545444e-08, 8.89328823545444e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:16:52,607] [INFO] [timer.py:199:stop] epoch=15/micro_step=1160/global_step=14090, RunningAvgSamplesPerSec=23.82321283183497, CurrSamplesPerSec=23.621605810640286, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:17:19,535] [INFO] [logging.py:96:log_dist] [Rank 0] step=14100, skipped=271, lr=[8.697566407683387e-08, 8.697566407683387e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:17:19,781] [INFO] [timer.py:199:stop] epoch=15/micro_step=1200/global_step=14100, RunningAvgSamplesPerSec=23.823045521881753, CurrSamplesPerSec=23.600163350639885, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:17:46,699] [INFO] [logging.py:96:log_dist] [Rank 0] step=14110, skipped=271, lr=[8.504002728029084e-08, 8.504002728029084e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:17:46,943] [INFO] [timer.py:199:stop] epoch=15/micro_step=1240/global_step=14110, RunningAvgSamplesPerSec=23.82288768278263, CurrSamplesPerSec=23.543681970026643, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:18:13,882] [INFO] [logging.py:96:log_dist] [Rank 0] step=14120, skipped=271, lr=[8.312598078165002e-08, 8.312598078165002e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:18:14,129] [INFO] [timer.py:199:stop] epoch=15/micro_step=1280/global_step=14120, RunningAvgSamplesPerSec=23.822716938153434, CurrSamplesPerSec=23.606557746114227, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:18:41,063] [INFO] [logging.py:96:log_dist] [Rank 0] step=14130, skipped=271, lr=[8.123353329930495e-08, 8.123353329930495e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:18:41,308] [INFO] [timer.py:199:stop] epoch=15/micro_step=1320/global_step=14130, RunningAvgSamplesPerSec=23.822547829205927, CurrSamplesPerSec=23.565973508286113, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:19:08,232] [INFO] [logging.py:96:log_dist] [Rank 0] step=14140, skipped=271, lr=[7.936269345326577e-08, 7.936269345326577e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:19:08,477] [INFO] [timer.py:199:stop] epoch=15/micro_step=1360/global_step=14140, RunningAvgSamplesPerSec=23.822384483156657, CurrSamplesPerSec=23.555701964605475, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:19:35,428] [INFO] [logging.py:96:log_dist] [Rank 0] step=14150, skipped=271, lr=[7.751346976512104e-08, 7.751346976512104e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:19:35,673] [INFO] [timer.py:199:stop] epoch=15/micro_step=1400/global_step=14150, RunningAvgSamplesPerSec=23.822205901113858, CurrSamplesPerSec=23.52479461027351, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:20:02,614] [INFO] [logging.py:96:log_dist] [Rank 0] step=14160, skipped=271, lr=[7.568587065800038e-08, 7.568587065800038e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:20:02,857] [INFO] [timer.py:199:stop] epoch=15/micro_step=1440/global_step=14160, RunningAvgSamplesPerSec=23.822036070875853, CurrSamplesPerSec=23.611559867936982, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:20:29,792] [INFO] [logging.py:96:log_dist] [Rank 0] step=14170, skipped=271, lr=[7.387990445653098e-08, 7.387990445653098e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:20:30,036] [INFO] [timer.py:199:stop] epoch=15/micro_step=1480/global_step=14170, RunningAvgSamplesPerSec=23.82186734836958, CurrSamplesPerSec=23.59919235311333, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:20:35,198] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 02:20:37,618] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 02:20:56,420] [INFO] [logging.py:96:log_dist] [Rank 0] step=14180, skipped=273, lr=[7.245071271867132e-08, 7.245071271867132e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:20:56,665] [INFO] [timer.py:199:stop] epoch=15/micro_step=1520/global_step=14180, RunningAvgSamplesPerSec=23.82204438996815, CurrSamplesPerSec=23.588245115224428, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:21:23,650] [INFO] [logging.py:96:log_dist] [Rank 0] step=14190, skipped=273, lr=[7.068370641088817e-08, 7.068370641088817e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:21:23,895] [INFO] [timer.py:199:stop] epoch=15/micro_step=1560/global_step=14190, RunningAvgSamplesPerSec=23.82184403939642, CurrSamplesPerSec=23.571715990963146, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:21:50,821] [INFO] [logging.py:96:log_dist] [Rank 0] step=14200, skipped=273, lr=[6.893835579338344e-08, 6.893835579338344e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:21:51,065] [INFO] [timer.py:199:stop] epoch=15/micro_step=1600/global_step=14200, RunningAvgSamplesPerSec=23.82168199731476, CurrSamplesPerSec=23.606954267039768, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:22:18,004] [INFO] [logging.py:96:log_dist] [Rank 0] step=14210, skipped=273, lr=[6.721466881614827e-08, 6.721466881614827e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:22:18,247] [INFO] [timer.py:199:stop] epoch=15/micro_step=1640/global_step=14210, RunningAvgSamplesPerSec=23.8215124774558, CurrSamplesPerSec=23.671857561506954, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:22:45,176] [INFO] [logging.py:96:log_dist] [Rank 0] step=14220, skipped=273, lr=[6.551265333049732e-08, 6.551265333049732e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:22:45,420] [INFO] [timer.py:199:stop] epoch=15/micro_step=1680/global_step=14220, RunningAvgSamplesPerSec=23.821349602432065, CurrSamplesPerSec=23.627109248536243, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:23:12,335] [INFO] [logging.py:96:log_dist] [Rank 0] step=14230, skipped=273, lr=[6.38323170890318e-08, 6.38323170890318e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:23:12,579] [INFO] [timer.py:199:stop] epoch=15/micro_step=1720/global_step=14230, RunningAvgSamplesPerSec=23.821196499120887, CurrSamplesPerSec=23.629261838025727, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:23:39,488] [INFO] [logging.py:96:log_dist] [Rank 0] step=14240, skipped=273, lr=[6.21736677456052e-08, 6.21736677456052e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:23:39,732] [INFO] [timer.py:199:stop] epoch=15/micro_step=1760/global_step=14240, RunningAvgSamplesPerSec=23.82104647975292, CurrSamplesPerSec=23.61183401806524, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:24:06,636] [INFO] [logging.py:96:log_dist] [Rank 0] step=14250, skipped=273, lr=[6.053671285528843e-08, 6.053671285528843e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:24:06,880] [INFO] [timer.py:199:stop] epoch=15/micro_step=1800/global_step=14250, RunningAvgSamplesPerSec=23.820900841824077, CurrSamplesPerSec=23.678221947482925, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:24:33,772] [INFO] [logging.py:96:log_dist] [Rank 0] step=14260, skipped=273, lr=[5.892145987433506e-08, 5.892145987433506e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:24:34,017] [INFO] [timer.py:199:stop] epoch=15/micro_step=1840/global_step=14260, RunningAvgSamplesPerSec=23.820759654089834, CurrSamplesPerSec=23.623859269679137, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:25:00,926] [INFO] [logging.py:96:log_dist] [Rank 0] step=14270, skipped=273, lr=[5.732791616014806e-08, 5.732791616014806e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:25:01,170] [INFO] [timer.py:199:stop] epoch=15/micro_step=1880/global_step=14270, RunningAvgSamplesPerSec=23.820608464443684, CurrSamplesPerSec=23.694743965942685, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:25:11,721] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 02:25:14,147] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 02:25:27,482] [INFO] [logging.py:96:log_dist] [Rank 0] step=14280, skipped=275, lr=[5.606871674191729e-08, 5.606871674191729e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:25:27,728] [INFO] [timer.py:199:stop] epoch=15/micro_step=1920/global_step=14280, RunningAvgSamplesPerSec=23.820828522631317, CurrSamplesPerSec=23.595396270749713, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:25:54,650] [INFO] [logging.py:96:log_dist] [Rank 0] step=14290, skipped=275, lr=[5.451426793290241e-08, 5.451426793290241e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:25:54,896] [INFO] [timer.py:199:stop] epoch=15/micro_step=1960/global_step=14290, RunningAvgSamplesPerSec=23.820669610730192, CurrSamplesPerSec=23.57800799701853, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:26:21,840] [INFO] [logging.py:96:log_dist] [Rank 0] step=14300, skipped=275, lr=[5.298154846520809e-08, 5.298154846520809e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:26:22,086] [INFO] [timer.py:199:stop] epoch=15/micro_step=2000/global_step=14300, RunningAvgSamplesPerSec=23.820500241030597, CurrSamplesPerSec=23.547678320496836, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:26:49,011] [INFO] [logging.py:96:log_dist] [Rank 0] step=14310, skipped=275, lr=[5.1470565320301137e-08, 5.1470565320301137e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:26:49,256] [INFO] [timer.py:199:stop] epoch=15/micro_step=2040/global_step=14310, RunningAvgSamplesPerSec=23.820341311433356, CurrSamplesPerSec=23.599686141015873, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:27:16,157] [INFO] [logging.py:96:log_dist] [Rank 0] step=14320, skipped=275, lr=[4.998132538063975e-08, 4.998132538063975e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:27:16,401] [INFO] [timer.py:199:stop] epoch=15/micro_step=2080/global_step=14320, RunningAvgSamplesPerSec=23.820198754812616, CurrSamplesPerSec=23.659018104161728, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:27:43,327] [INFO] [logging.py:96:log_dist] [Rank 0] step=14330, skipped=275, lr=[4.851383542964191e-08, 4.851383542964191e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:27:43,571] [INFO] [timer.py:199:stop] epoch=15/micro_step=2120/global_step=14330, RunningAvgSamplesPerSec=23.82004239171939, CurrSamplesPerSec=23.58262302700618, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:28:10,472] [INFO] [logging.py:96:log_dist] [Rank 0] step=14340, skipped=275, lr=[4.706810215165701e-08, 4.706810215165701e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:28:10,716] [INFO] [timer.py:199:stop] epoch=15/micro_step=2160/global_step=14340, RunningAvgSamplesPerSec=23.81989927863853, CurrSamplesPerSec=23.640551153676487, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:28:37,629] [INFO] [logging.py:96:log_dist] [Rank 0] step=14350, skipped=275, lr=[4.5644132131933135e-08, 4.5644132131933135e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:28:37,874] [INFO] [timer.py:199:stop] epoch=15/micro_step=2200/global_step=14350, RunningAvgSamplesPerSec=23.819748290581472, CurrSamplesPerSec=23.628007672982978, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:29:04,788] [INFO] [logging.py:96:log_dist] [Rank 0] step=14360, skipped=275, lr=[4.4241931856588175e-08, 4.4241931856588175e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:29:05,034] [INFO] [timer.py:199:stop] epoch=15/micro_step=2240/global_step=14360, RunningAvgSamplesPerSec=23.81959602911827, CurrSamplesPerSec=23.582069873826235, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:29:31,945] [INFO] [logging.py:96:log_dist] [Rank 0] step=14370, skipped=275, lr=[4.28615077125782e-08, 4.28615077125782e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:29:32,189] [INFO] [timer.py:199:stop] epoch=15/micro_step=2280/global_step=14370, RunningAvgSamplesPerSec=23.8194469321726, CurrSamplesPerSec=23.55747769948262, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:29:48,206] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 02:29:50,634] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 02:29:58,557] [INFO] [logging.py:96:log_dist] [Rank 0] step=14380, skipped=277, lr=[4.1772851440644845e-08, 4.1772851440644845e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:29:58,796] [INFO] [timer.py:199:stop] epoch=15/micro_step=2320/global_step=14380, RunningAvgSamplesPerSec=23.819635418197144, CurrSamplesPerSec=23.552523261625197, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:30:25,750] [INFO] [logging.py:96:log_dist] [Rank 0] step=14390, skipped=277, lr=[4.043164011154094e-08, 4.043164011154094e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:30:25,994] [INFO] [timer.py:199:stop] epoch=15/micro_step=2360/global_step=14390, RunningAvgSamplesPerSec=23.819460783747534, CurrSamplesPerSec=23.54100197406096, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:30:52,942] [INFO] [logging.py:96:log_dist] [Rank 0] step=14400, skipped=277, lr=[3.911222226947448e-08, 3.911222226947448e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:30:53,186] [INFO] [timer.py:199:stop] epoch=15/micro_step=2400/global_step=14400, RunningAvgSamplesPerSec=23.819289697207747, CurrSamplesPerSec=23.62475536778331, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:31:20,132] [INFO] [logging.py:96:log_dist] [Rank 0] step=14410, skipped=277, lr=[3.781460392433294e-08, 3.781460392433294e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:31:20,377] [INFO] [timer.py:199:stop] epoch=15/micro_step=2440/global_step=14410, RunningAvgSamplesPerSec=23.819119839316045, CurrSamplesPerSec=23.580824856177287, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:31:47,274] [INFO] [logging.py:96:log_dist] [Rank 0] step=14420, skipped=277, lr=[3.653879098670754e-08, 3.653879098670754e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:31:47,519] [INFO] [timer.py:199:stop] epoch=15/micro_step=2480/global_step=14420, RunningAvgSamplesPerSec=23.818979628991073, CurrSamplesPerSec=23.578848841859614, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:32:14,423] [INFO] [logging.py:96:log_dist] [Rank 0] step=14430, skipped=277, lr=[3.528478926786696e-08, 3.528478926786696e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:32:14,668] [INFO] [timer.py:199:stop] epoch=15/micro_step=2520/global_step=14430, RunningAvgSamplesPerSec=23.818833793771727, CurrSamplesPerSec=23.62628159388741, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:32:41,552] [INFO] [logging.py:96:log_dist] [Rank 0] step=14440, skipped=277, lr=[3.405260447973165e-08, 3.405260447973165e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:32:41,796] [INFO] [timer.py:199:stop] epoch=15/micro_step=2560/global_step=14440, RunningAvgSamplesPerSec=23.818701746056014, CurrSamplesPerSec=23.678543598490347, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:33:08,703] [INFO] [logging.py:96:log_dist] [Rank 0] step=14450, skipped=277, lr=[3.284224223484543e-08, 3.284224223484543e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:33:08,946] [INFO] [timer.py:199:stop] epoch=15/micro_step=2600/global_step=14450, RunningAvgSamplesPerSec=23.818556091651892, CurrSamplesPerSec=23.65876162389424, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:33:35,858] [INFO] [logging.py:96:log_dist] [Rank 0] step=14460, skipped=277, lr=[3.1653708046352495e-08, 3.1653708046352495e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:33:36,101] [INFO] [timer.py:199:stop] epoch=15/micro_step=2640/global_step=14460, RunningAvgSamplesPerSec=23.818406819313314, CurrSamplesPerSec=23.6503800741382, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:34:03,018] [INFO] [logging.py:96:log_dist] [Rank 0] step=14470, skipped=277, lr=[3.048700732797002e-08, 3.048700732797002e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:34:03,261] [INFO] [timer.py:199:stop] epoch=15/micro_step=2680/global_step=14470, RunningAvgSamplesPerSec=23.81825540209142, CurrSamplesPerSec=23.576844167895196, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:34:24,696] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 02:34:27,114] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 02:34:29,588] [INFO] [logging.py:96:log_dist] [Rank 0] step=14480, skipped=279, lr=[2.956937042627529e-08, 2.956937042627529e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:34:29,832] [INFO] [timer.py:199:stop] epoch=15/micro_step=2720/global_step=14480, RunningAvgSamplesPerSec=23.818464877703775, CurrSamplesPerSec=23.588050276332133, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:34:56,734] [INFO] [logging.py:96:log_dist] [Rank 0] step=14490, skipped=279, lr=[2.8441983279202135e-08, 2.8441983279202135e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:34:56,978] [INFO] [timer.py:199:stop] epoch=15/micro_step=2760/global_step=14490, RunningAvgSamplesPerSec=23.818322563102644, CurrSamplesPerSec=23.65211176032262, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:35:23,940] [INFO] [logging.py:96:log_dist] [Rank 0] step=14500, skipped=279, lr=[2.7336444231497413e-08, 2.7336444231497413e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:35:24,186] [INFO] [timer.py:199:stop] epoch=15/micro_step=2800/global_step=14500, RunningAvgSamplesPerSec=23.818143575135625, CurrSamplesPerSec=23.53989959506899, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:35:51,114] [INFO] [logging.py:96:log_dist] [Rank 0] step=14510, skipped=279, lr=[2.6252758318841213e-08, 2.6252758318841213e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:35:51,358] [INFO] [timer.py:199:stop] epoch=15/micro_step=2840/global_step=14510, RunningAvgSamplesPerSec=23.81798677220892, CurrSamplesPerSec=23.55899317739527, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:36:18,302] [INFO] [logging.py:96:log_dist] [Rank 0] step=14520, skipped=279, lr=[2.5190930477372e-08, 2.5190930477372e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:36:18,548] [INFO] [timer.py:199:stop] epoch=15/micro_step=2880/global_step=14520, RunningAvgSamplesPerSec=23.817819735482573, CurrSamplesPerSec=23.54028149370543, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:36:45,492] [INFO] [logging.py:96:log_dist] [Rank 0] step=14530, skipped=279, lr=[2.4150965543665742e-08, 2.4150965543665742e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:36:45,736] [INFO] [timer.py:199:stop] epoch=15/micro_step=2920/global_step=14530, RunningAvgSamplesPerSec=23.817651754832806, CurrSamplesPerSec=23.575821252833183, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:37:12,660] [INFO] [logging.py:96:log_dist] [Rank 0] step=14540, skipped=279, lr=[2.3132868254715004e-08, 2.3132868254715004e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:37:12,904] [INFO] [timer.py:199:stop] epoch=15/micro_step=2960/global_step=14540, RunningAvgSamplesPerSec=23.81749712660169, CurrSamplesPerSec=23.617393161908137, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:37:39,839] [INFO] [logging.py:96:log_dist] [Rank 0] step=14550, skipped=279, lr=[2.213664324790646e-08, 2.213664324790646e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:37:40,084] [INFO] [timer.py:199:stop] epoch=15/micro_step=3000/global_step=14550, RunningAvgSamplesPerSec=23.817335431984343, CurrSamplesPerSec=23.56871920403741, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:38:06,988] [INFO] [logging.py:96:log_dist] [Rank 0] step=14560, skipped=279, lr=[2.1162295060997317e-08, 2.1162295060997317e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:38:07,233] [INFO] [timer.py:199:stop] epoch=15/micro_step=3040/global_step=14560, RunningAvgSamplesPerSec=23.817191557531455, CurrSamplesPerSec=23.598217286435112, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:38:34,177] [INFO] [logging.py:96:log_dist] [Rank 0] step=14570, skipped=279, lr=[2.020982813209978e-08, 2.020982813209978e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:38:34,422] [INFO] [timer.py:199:stop] epoch=15/micro_step=3080/global_step=14570, RunningAvgSamplesPerSec=23.817025754914635, CurrSamplesPerSec=23.496396921818363, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:39:01,289] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 02:39:01,290] [INFO] [logging.py:96:log_dist] [Rank 0] step=14580, skipped=280, lr=[1.9371319959474087e-08, 1.9371319959474087e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:39:01,291] [INFO] [timer.py:199:stop] epoch=15/micro_step=3120/global_step=14580, RunningAvgSamplesPerSec=23.817055132019107, CurrSamplesPerSec=26.43466477906336, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:39:03,715] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 02:39:27,913] [INFO] [logging.py:96:log_dist] [Rank 0] step=14590, skipped=281, lr=[1.8550542214940644e-08, 1.8550542214940644e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:39:28,156] [INFO] [timer.py:199:stop] epoch=15/micro_step=3160/global_step=14590, RunningAvgSamplesPerSec=23.817086195768855, CurrSamplesPerSec=23.62178041756102, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:39:55,087] [INFO] [logging.py:96:log_dist] [Rank 0] step=14600, skipped=281, lr=[1.765936557077271e-08, 1.765936557077271e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:39:55,330] [INFO] [timer.py:199:stop] epoch=15/micro_step=3200/global_step=14600, RunningAvgSamplesPerSec=23.816929297494788, CurrSamplesPerSec=23.65164911897349, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:40:22,229] [INFO] [logging.py:96:log_dist] [Rank 0] step=14610, skipped=281, lr=[1.6790086140297347e-08, 1.6790086140297347e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:40:22,474] [INFO] [timer.py:199:stop] epoch=15/micro_step=3240/global_step=14610, RunningAvgSamplesPerSec=23.816790019620264, CurrSamplesPerSec=23.569816005865334, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:40:49,387] [INFO] [logging.py:96:log_dist] [Rank 0] step=14620, skipped=281, lr=[1.594270788304202e-08, 1.594270788304202e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:40:49,626] [INFO] [timer.py:199:stop] epoch=15/micro_step=3280/global_step=14620, RunningAvgSamplesPerSec=23.816646524996983, CurrSamplesPerSec=23.63449208049739, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:41:16,533] [INFO] [logging.py:96:log_dist] [Rank 0] step=14630, skipped=281, lr=[1.51172346587751e-08, 1.51172346587751e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:41:16,776] [INFO] [timer.py:199:stop] epoch=15/micro_step=3320/global_step=14630, RunningAvgSamplesPerSec=23.81650450815816, CurrSamplesPerSec=23.66402996859817, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:41:43,679] [INFO] [logging.py:96:log_dist] [Rank 0] step=14640, skipped=281, lr=[1.4313670227489249e-08, 1.4313670227489249e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:41:43,925] [INFO] [timer.py:199:stop] epoch=15/micro_step=3360/global_step=14640, RunningAvgSamplesPerSec=23.816363472045964, CurrSamplesPerSec=23.585988083268628, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:42:10,834] [INFO] [logging.py:96:log_dist] [Rank 0] step=14650, skipped=281, lr=[1.3532018249383205e-08, 1.3532018249383205e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:42:11,079] [INFO] [timer.py:199:stop] epoch=15/micro_step=3400/global_step=14650, RunningAvgSamplesPerSec=23.816219384455533, CurrSamplesPerSec=23.58807100372087, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:42:38,011] [INFO] [logging.py:96:log_dist] [Rank 0] step=14660, skipped=281, lr=[1.2772282284845187e-08, 1.2772282284845187e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:42:38,255] [INFO] [timer.py:199:stop] epoch=15/micro_step=3440/global_step=14660, RunningAvgSamplesPerSec=23.81606210823403, CurrSamplesPerSec=23.671229243122472, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:43:05,179] [INFO] [logging.py:96:log_dist] [Rank 0] step=14670, skipped=281, lr=[1.2034465794436808e-08, 1.2034465794436808e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:43:05,424] [INFO] [timer.py:199:stop] epoch=15/micro_step=3480/global_step=14670, RunningAvgSamplesPerSec=23.815908448337467, CurrSamplesPerSec=23.660469511905973, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:43:32,348] [INFO] [logging.py:96:log_dist] [Rank 0] step=14680, skipped=281, lr=[1.1318572138879166e-08, 1.1318572138879166e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:43:32,593] [INFO] [timer.py:199:stop] epoch=15/micro_step=3520/global_step=14680, RunningAvgSamplesPerSec=23.815755177432592, CurrSamplesPerSec=23.631354489346695, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:43:37,730] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, but hysteresis is 2. Reducing hysteresis to 1 [2023-04-24 02:43:40,150] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072 [2023-04-24 02:43:58,939] [INFO] [logging.py:96:log_dist] [Rank 0] step=14690, skipped=283, lr=[1.076164385017997e-08, 1.076164385017997e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:43:59,182] [INFO] [timer.py:199:stop] epoch=15/micro_step=3560/global_step=14690, RunningAvgSamplesPerSec=23.81595249048192, CurrSamplesPerSec=23.525557440000224, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:44:26,082] [INFO] [logging.py:96:log_dist] [Rank 0] step=14700, skipped=283, lr=[1.0085219447612204e-08, 1.0085219447612204e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:44:26,327] [INFO] [timer.py:199:stop] epoch=15/micro_step=3600/global_step=14700, RunningAvgSamplesPerSec=23.815814382523826, CurrSamplesPerSec=23.644830384843658, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:44:53,216] [INFO] [logging.py:96:log_dist] [Rank 0] step=14710, skipped=283, lr=[9.430726758620686e-09, 9.430726758620686e-09], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:44:53,460] [INFO] [timer.py:199:stop] epoch=15/micro_step=3640/global_step=14710, RunningAvgSamplesPerSec=23.815683468770917, CurrSamplesPerSec=23.64389111189033, MemAllocated=11.44GB, MaxMemAllocated=31.66GB [2023-04-24 02:45:20,382] [INFO] [logging.py:96:log_dist] [Rank 0] step=14720, skipped=283, lr=[8.798168764390025e-09, 8.798168764390025e-09], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-24 02:45:20,627] [INFO] [timer.py:199:stop] epoch=15/micro_step=3680/global_step=14720, RunningAvgSamplesPerSec=23.815532323905884, CurrSamplesPerSec=23.595249015661345, MemAllocated=11.44GB, MaxMemAllocated=31.66GB ***** Evaluating perplexity, Epoch 16/16 ***** ppl: 1.7810072898864746 saving the final model ... [2023-04-24 02:46:01,159] [INFO] [launch.py:460:main] Process 7670 exits successfully. [2023-04-24 02:46:18,180] [INFO] [launch.py:460:main] Process 7669 exits successfully.