W0828 13:12:16.067000 140133368919872 torch/distributed/run.py:779] W0828 13:12:16.067000 140133368919872 torch/distributed/run.py:779] ***************************************** W0828 13:12:16.067000 140133368919872 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0828 13:12:16.067000 140133368919872 torch/distributed/run.py:779] ***************************************** The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`. The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`. 0it [00:00, ?it/s] 0it [00:00, ?it/s] 0it [00:00, ?it/s] 0it [00:00, ?it/s] [2024-08-28 13:12:18,163] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:12:18,250] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): Traceback (most recent call last): File "/mnt/nvme0n1/workspace/fengdahu/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 18, in from internvl.dist_utils import init_dist File "/mnt/nvme0n1/workspace/fengdahu/InternVL/internvl_chat/internvl/dist_utils.py", line 6, in import deepspeed File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/__init__.py", line 26, in from . import module_inject File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/module_inject/__init__.py", line 6, in from .replace_module import replace_transformer_layer, revert_transformer_layer, ReplaceWithTensorSlicing, GroupQuantizer, generic_injection File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 607, in from ..pipe import PipelineModule File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/pipe/__init__.py", line 6, in from ..runtime.pipe import PipelineModule, LayerSpec, TiedLayerSpec File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/pipe/__init__.py", line 6, in from .module import PipelineModule, LayerSpec, TiedLayerSpec File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 19, in from ..activation_checkpointing import checkpointing File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 26, in from deepspeed.runtime.config import DeepSpeedConfig File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 42, in from ..elasticity import ( File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/elasticity/__init__.py", line 10, in from .elastic_agent import DSElasticAgent File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/elasticity/elastic_agent.py", line 9, in from torch.distributed.elastic.agent.server.api import log, _get_socket_with_port ImportError: cannot import name 'log' from 'torch.distributed.elastic.agent.server.api' (/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py) Traceback (most recent call last): File "/mnt/nvme0n1/workspace/fengdahu/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 18, in from internvl.dist_utils import init_dist File "/mnt/nvme0n1/workspace/fengdahu/InternVL/internvl_chat/internvl/dist_utils.py", line 6, in import deepspeed File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/__init__.py", line 26, in from . import module_inject File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/module_inject/__init__.py", line 6, in from .replace_module import replace_transformer_layer, revert_transformer_layer, ReplaceWithTensorSlicing, GroupQuantizer, generic_injection File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 607, in from ..pipe import PipelineModule File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/pipe/__init__.py", line 6, in from ..runtime.pipe import PipelineModule, LayerSpec, TiedLayerSpec File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/pipe/__init__.py", line 6, in from .module import PipelineModule, LayerSpec, TiedLayerSpec File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 19, in from ..activation_checkpointing import checkpointing File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 26, in from deepspeed.runtime.config import DeepSpeedConfig File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 42, in from ..elasticity import ( File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/elasticity/__init__.py", line 10, in from .elastic_agent import DSElasticAgent File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/elasticity/elastic_agent.py", line 9, in from torch.distributed.elastic.agent.server.api import log, _get_socket_with_port ImportError: cannot import name 'log' from 'torch.distributed.elastic.agent.server.api' (/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py) E0828 13:12:19.984000 140133368919872 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 200677) of binary: /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/bin/python Traceback (most recent call last): File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/bin/torchrun", line 8, in sys.exit(main()) File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper return f(*args, **kwargs) File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ internvl/train/internvl_chat_finetune.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-08-28_13:12:19 host : SH-IDC1-10-198-35-71 rank : 1 (local_rank: 1) exitcode : 1 (pid: 200678) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-08-28_13:12:19 host : SH-IDC1-10-198-35-71 rank : 0 (local_rank: 0) exitcode : 1 (pid: 200677) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ [2024-08-28 13:12:55,997] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): Traceback (most recent call last): File "/mnt/nvme0n1/workspace/fengdahu/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 18, in from internvl.dist_utils import init_dist File "/mnt/nvme0n1/workspace/fengdahu/InternVL/internvl_chat/internvl/dist_utils.py", line 6, in import deepspeed File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/__init__.py", line 26, in from . import module_inject File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/module_inject/__init__.py", line 6, in from .replace_module import replace_transformer_layer, revert_transformer_layer, ReplaceWithTensorSlicing, GroupQuantizer, generic_injection File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 607, in from ..pipe import PipelineModule File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/pipe/__init__.py", line 6, in from ..runtime.pipe import PipelineModule, LayerSpec, TiedLayerSpec File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/pipe/__init__.py", line 6, in from .module import PipelineModule, LayerSpec, TiedLayerSpec File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 19, in from ..activation_checkpointing import checkpointing File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 26, in from deepspeed.runtime.config import DeepSpeedConfig File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 42, in from ..elasticity import ( File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/elasticity/__init__.py", line 10, in from .elastic_agent import DSElasticAgent File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/elasticity/elastic_agent.py", line 9, in from torch.distributed.elastic.agent.server.api import log, _get_socket_with_port ImportError: cannot import name 'log' from 'torch.distributed.elastic.agent.server.api' (/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py) E0828 13:12:57.974000 140476093118272 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 201112) of binary: /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/bin/python Traceback (most recent call last): File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/bin/torchrun", line 8, in sys.exit(main()) File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper return f(*args, **kwargs) File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ internvl/train/internvl_chat_finetune.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-08-28_13:12:57 host : SH-IDC1-10-198-35-71 rank : 0 (local_rank: 0) exitcode : 1 (pid: 201112) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ W0828 13:22:23.854000 140644781451072 torch/distributed/run.py:779] W0828 13:22:23.854000 140644781451072 torch/distributed/run.py:779] ***************************************** W0828 13:22:23.854000 140644781451072 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0828 13:22:23.854000 140644781451072 torch/distributed/run.py:779] ***************************************** [2024-08-28 13:22:26,498] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:22:26,515] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output):  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2024-08-28 13:22:29,053] [INFO] [comm.py:637:init_distributed] cdb=None [2024-08-28 13:22:29,053] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 08/28/2024 13:22:29 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 08/28/2024 13:22:29 - INFO - __main__ - Training/evaluation parameters TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=4, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=zero_stage1_config.json, disable_tqdm=False, dispatch_batches=None, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=epoch, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=2, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=4e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora_ui/runs/Aug28_13-22-29_SH-IDC1-10-198-35-71, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=5.0, optim=adamw_torch, optim_args=None, output_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora_ui, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=4, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora_ui, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=200, save_strategy=steps, save_total_limit=1, seed=42, skip_memory_metrics=True, split_batches=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.01, ) 08/28/2024 13:22:29 - INFO - __main__ - Loading Tokenizer: ./pretrained/InternVL2-8B [INFO|tokenization_utils_base.py:2025] 2024-08-28 13:22:29,128 >> loading file ./tokenizer.model [INFO|tokenization_utils_base.py:2025] 2024-08-28 13:22:29,128 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2025] 2024-08-28 13:22:29,128 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2025] 2024-08-28 13:22:29,128 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2025] 2024-08-28 13:22:29,129 >> loading file tokenizer.json [WARNING|logging.py:314] 2024-08-28 13:22:29,272 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 08/28/2024 13:22:29 - INFO - __main__ - Loading InternVLChatModel... [INFO|configuration_utils.py:727] 2024-08-28 13:22:29,384 >> loading configuration file ./pretrained/InternVL2-8B/config.json [INFO|configuration_utils.py:792] 2024-08-28 13:22:29,386 >> Model config InternVLChatConfig { "_commit_hash": null, "architectures": [ "InternVLChatModel" ], "auto_map": { "AutoConfig": "configuration_internvl_chat.InternVLChatConfig", "AutoModel": "modeling_internvl_chat.InternVLChatModel", "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel" }, "downsample_ratio": 0.5, "dynamic_image_size": true, "force_image_size": 448, "llm_config": { "_name_or_path": "internlm/internlm2_5-7b-chat", "add_cross_attention": false, "architectures": [ "InternLM2ForCausalLM" ], "attn_implementation": "flash_attention_2", "auto_map": { "AutoConfig": "configuration_internlm2.InternLM2Config", "AutoModel": "modeling_internlm2.InternLM2ForCausalLM", "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM" }, "bad_words_ids": null, "begin_suppress_tokens": null, "bias": false, "bos_token_id": 1, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 2, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "silu", "hidden_size": 4096, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 14336, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "min_length": 0, "model_type": "internlm2", "no_repeat_ngram_size": 0, "num_attention_heads": 32, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 32, "num_key_value_heads": 8, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": 2, "prefix": null, "pretraining_tp": 1, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 2.0, "type": "dynamic" }, "rope_theta": 1000000, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_cache": true, "vocab_size": 92553 }, "max_dynamic_patch": 12, "min_dynamic_patch": 1, "model_type": "internvl_chat", "pad2square": false, "ps_version": "v2", "select_layer": -1, "template": "internlm2-chat", "torch_dtype": "bfloat16", "transformers_version": null, "use_backbone_lora": 0, "use_llm_lora": 0, "use_thumbnail": true, "vision_config": { "_name_or_path": "", "add_cross_attention": false, "architectures": [ "InternVisionModel" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "drop_path_rate": 0.0, "dropout": 0.0, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 448, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-06, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "intern_vit_6b", "no_repeat_ngram_size": 0, "norm_type": "layer_norm", "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_hidden_layers": 24, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "pruned_heads": {}, "qk_normalization": false, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_flash_attn": true } } 08/28/2024 13:22:29 - INFO - __main__ - Using flash_attention_2 for InternLM [INFO|modeling_utils.py:3473] 2024-08-28 13:22:29,388 >> loading weights file ./pretrained/InternVL2-8B/model.safetensors.index.json [INFO|modeling_utils.py:1426] 2024-08-28 13:22:29,388 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16. [INFO|configuration_utils.py:826] 2024-08-28 13:22:29,389 >> Generate config GenerationConfig {} [INFO|configuration_utils.py:826] 2024-08-28 13:22:29,439 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2, "pad_token_id": 2 } [2024-08-28 13:22:30,060] [INFO] [comm.py:637:init_distributed] cdb=None 08/28/2024 13:22:30 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False [WARNING|logging.py:314] 2024-08-28 13:22:30,277 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Loading checkpoint shards: 0%| | 0/4 [00:00> All model checkpoint weights were used when initializing InternVLChatModel. [INFO|modeling_utils.py:4358] 2024-08-28 13:22:36,096 >> All the weights of InternVLChatModel were initialized from the model checkpoint at ./pretrained/InternVL2-8B. If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training. [INFO|configuration_utils.py:779] 2024-08-28 13:22:36,100 >> loading configuration file ./pretrained/InternVL2-8B/generation_config.json [INFO|configuration_utils.py:826] 2024-08-28 13:22:36,100 >> Generate config GenerationConfig { "eos_token_id": [ 92542, 92543 ] } 08/28/2024 13:22:36 - INFO - __main__ - Finished 08/28/2024 13:22:36 - INFO - __main__ - model.config.force_image_size: 448 08/28/2024 13:22:36 - INFO - __main__ - data_args.force_image_size: 448 08/28/2024 13:22:36 - INFO - __main__ - model.config.vision_config.image_size: 448 08/28/2024 13:22:36 - INFO - __main__ - [Dataset] num_image_token: 256 08/28/2024 13:22:36 - INFO - __main__ - [Dataset] dynamic_image_size: True 08/28/2024 13:22:36 - INFO - __main__ - [Dataset] use_thumbnail: True 08/28/2024 13:22:36 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6 08/28/2024 13:22:36 - INFO - __main__ - Formatting inputs...Skip in lazy mode [rank0]: Traceback (most recent call last): [rank0]: File "/mnt/nvme0n1/workspace/fengdahu/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 910, in [rank0]: main() [rank0]: File "/mnt/nvme0n1/workspace/fengdahu/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 822, in main [rank0]: train_dataset = build_datasets( [rank0]: File "/mnt/nvme0n1/workspace/fengdahu/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 574, in build_datasets [rank0]: dataset = LazySupervisedDataset( [rank0]: File "/mnt/nvme0n1/workspace/fengdahu/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 255, in __init__ [rank0]: with open(meta['annotation'], 'r') as f: [rank0]: FileNotFoundError: [Errno 2] No such file or directory: 'data/ui-dataset/annotations/ui_dataset_train.jsonl' W0828 13:22:37.089000 140644781451072 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 208849 closing signal SIGTERM /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' E0828 13:22:37.403000 140644781451072 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 208848) of binary: /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/bin/python Traceback (most recent call last): File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/bin/torchrun", line 8, in sys.exit(main()) File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper return f(*args, **kwargs) File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ internvl/train/internvl_chat_finetune.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-08-28_13:22:37 host : SH-IDC1-10-198-35-71 rank : 0 (local_rank: 0) exitcode : 1 (pid: 208848) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ W0828 13:25:57.097000 139835668510528 torch/distributed/run.py:779] W0828 13:25:57.097000 139835668510528 torch/distributed/run.py:779] ***************************************** W0828 13:25:57.097000 139835668510528 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0828 13:25:57.097000 139835668510528 torch/distributed/run.py:779] ***************************************** [2024-08-28 13:25:59,722] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:25:59,722] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  async_io: please install the libaio-devel package with yum [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): petrel_client is not installed. If you read data locally instead of from ceph, ignore it. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2024-08-28 13:26:01,972] [INFO] [comm.py:637:init_distributed] cdb=None [2024-08-28 13:26:01,972] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 08/28/2024 13:26:02 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 08/28/2024 13:26:02 - INFO - __main__ - Training/evaluation parameters TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=4, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=zero_stage1_config.json, disable_tqdm=False, dispatch_batches=None, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=epoch, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=2, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=4e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora_ui/runs/Aug28_13-26-01_SH-IDC1-10-198-35-71, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=5.0, optim=adamw_torch, optim_args=None, output_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora_ui, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=4, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora_ui, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=200, save_strategy=steps, save_total_limit=1, seed=42, skip_memory_metrics=True, split_batches=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.01, ) 08/28/2024 13:26:02 - INFO - __main__ - Loading Tokenizer: ./pretrained/InternVL2-8B [INFO|tokenization_utils_base.py:2025] 2024-08-28 13:26:02,061 >> loading file ./tokenizer.model [INFO|tokenization_utils_base.py:2025] 2024-08-28 13:26:02,061 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2025] 2024-08-28 13:26:02,061 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2025] 2024-08-28 13:26:02,061 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2025] 2024-08-28 13:26:02,062 >> loading file tokenizer.json [WARNING|logging.py:314] 2024-08-28 13:26:02,203 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 08/28/2024 13:26:02 - INFO - __main__ - Loading InternVLChatModel... [INFO|configuration_utils.py:727] 2024-08-28 13:26:02,313 >> loading configuration file ./pretrained/InternVL2-8B/config.json [INFO|configuration_utils.py:792] 2024-08-28 13:26:02,314 >> Model config InternVLChatConfig { "_commit_hash": null, "architectures": [ "InternVLChatModel" ], "auto_map": { "AutoConfig": "configuration_internvl_chat.InternVLChatConfig", "AutoModel": "modeling_internvl_chat.InternVLChatModel", "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel" }, "downsample_ratio": 0.5, "dynamic_image_size": true, "force_image_size": 448, "llm_config": { "_name_or_path": "internlm/internlm2_5-7b-chat", "add_cross_attention": false, "architectures": [ "InternLM2ForCausalLM" ], "attn_implementation": "flash_attention_2", "auto_map": { "AutoConfig": "configuration_internlm2.InternLM2Config", "AutoModel": "modeling_internlm2.InternLM2ForCausalLM", "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM" }, "bad_words_ids": null, "begin_suppress_tokens": null, "bias": false, "bos_token_id": 1, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 2, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "silu", "hidden_size": 4096, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 14336, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "min_length": 0, "model_type": "internlm2", "no_repeat_ngram_size": 0, "num_attention_heads": 32, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 32, "num_key_value_heads": 8, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": 2, "prefix": null, "pretraining_tp": 1, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 2.0, "type": "dynamic" }, "rope_theta": 1000000, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_cache": true, "vocab_size": 92553 }, "max_dynamic_patch": 12, "min_dynamic_patch": 1, "model_type": "internvl_chat", "pad2square": false, "ps_version": "v2", "select_layer": -1, "template": "internlm2-chat", "torch_dtype": "bfloat16", "transformers_version": null, "use_backbone_lora": 0, "use_llm_lora": 0, "use_thumbnail": true, "vision_config": { "_name_or_path": "", "add_cross_attention": false, "architectures": [ "InternVisionModel" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "drop_path_rate": 0.0, "dropout": 0.0, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 448, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-06, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "intern_vit_6b", "no_repeat_ngram_size": 0, "norm_type": "layer_norm", "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_hidden_layers": 24, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "pruned_heads": {}, "qk_normalization": false, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_flash_attn": true } } 08/28/2024 13:26:02 - INFO - __main__ - Using flash_attention_2 for InternLM [INFO|modeling_utils.py:3473] 2024-08-28 13:26:02,315 >> loading weights file ./pretrained/InternVL2-8B/model.safetensors.index.json [INFO|modeling_utils.py:1426] 2024-08-28 13:26:02,316 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16. [INFO|configuration_utils.py:826] 2024-08-28 13:26:02,317 >> Generate config GenerationConfig {} [INFO|configuration_utils.py:826] 2024-08-28 13:26:02,366 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2, "pad_token_id": 2 } [2024-08-28 13:26:03,078] [INFO] [comm.py:637:init_distributed] cdb=None 08/28/2024 13:26:03 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False [WARNING|logging.py:314] 2024-08-28 13:26:03,293 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Loading checkpoint shards: 0%| | 0/4 [00:00> All model checkpoint weights were used when initializing InternVLChatModel. [INFO|modeling_utils.py:4358] 2024-08-28 13:26:10,231 >> All the weights of InternVLChatModel were initialized from the model checkpoint at ./pretrained/InternVL2-8B. If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training. [INFO|configuration_utils.py:779] 2024-08-28 13:26:10,235 >> loading configuration file ./pretrained/InternVL2-8B/generation_config.json [INFO|configuration_utils.py:826] 2024-08-28 13:26:10,235 >> Generate config GenerationConfig { "eos_token_id": [ 92542, 92543 ] } 08/28/2024 13:26:10 - INFO - __main__ - Finished 08/28/2024 13:26:10 - INFO - __main__ - model.config.force_image_size: 448 08/28/2024 13:26:10 - INFO - __main__ - data_args.force_image_size: 448 08/28/2024 13:26:10 - INFO - __main__ - model.config.vision_config.image_size: 448 08/28/2024 13:26:10 - INFO - __main__ - [Dataset] num_image_token: 256 08/28/2024 13:26:10 - INFO - __main__ - [Dataset] dynamic_image_size: True 08/28/2024 13:26:10 - INFO - __main__ - [Dataset] use_thumbnail: True 08/28/2024 13:26:10 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6 08/28/2024 13:26:10 - INFO - __main__ - Formatting inputs...Skip in lazy mode 08/28/2024 13:26:10 - INFO - __main__ - Add dataset: ui-dataset with length: 416 08/28/2024 13:26:10 - INFO - __main__ - [Dataset] num_image_token: 256 08/28/2024 13:26:10 - INFO - __main__ - [Dataset] dynamic_image_size: True 08/28/2024 13:26:10 - INFO - __main__ - [Dataset] use_thumbnail: True 08/28/2024 13:26:10 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6 08/28/2024 13:26:10 - INFO - __main__ - Formatting inputs...Skip in lazy mode Loading checkpoint shards: 75%|███████▌ | 3/4 [00:05<00:01, 1.97s/it]08/28/2024 13:26:10 - INFO - __main__ - Add eval dataset: ui-dataset with length: 46 trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_B.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_A.default.weight 08/28/2024 13:26:11 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_B.default.weight 08/28/2024 13:26:11 - WARNING - accelerate.utils.other - Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. Loading checkpoint shards: 100%|██████████| 4/4 [00:06<00:00, 1.45s/it] Loading checkpoint shards: 100%|██████████| 4/4 [00:06<00:00, 1.65s/it] [INFO|trainer.py:571] 2024-08-28 13:26:11,363 >> Using auto half precision backend [2024-08-28 13:26:11,523] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4, git-hash=unknown, git-branch=unknown trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855 [2024-08-28 13:26:14,793] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /mnt/nvme0n1/workspace/fengdahu/fdh_cache/torch_extensions/py310_cu121 as PyTorch extensions root... Creating extension directory /mnt/nvme0n1/workspace/fengdahu/fdh_cache/torch_extensions/py310_cu121/fused_adam... Using /mnt/nvme0n1/workspace/fengdahu/fdh_cache/torch_extensions/py310_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /mnt/nvme0n1/workspace/fengdahu/fdh_cache/torch_extensions/py310_cu121/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/3] /mnt/llm/toolchains/cuda/cuda-12.1/bin/nvcc --generate-dependencies-with-compile --dependency-output multi_tensor_adam.cuda.o.d -ccbin gcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/include -isystem /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/include/TH -isystem /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/include/THC -isystem /mnt/llm/toolchains/cuda/cuda-12.1/include -isystem /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -std=c++17 -c /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o [2/3] g++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/include -isystem /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/include/TH -isystem /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/include/THC -isystem /mnt/llm/toolchains/cuda/cuda-12.1/include -isystem /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DBF16_AVAILABLE -c /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o [3/3] g++ fused_adam_frontend.o multi_tensor_adam.cuda.o -shared -L/mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/mnt/llm/toolchains/cuda/cuda-12.1/lib64 -lcudart -o fused_adam.so Loading extension module fused_adam... Time to load fused_adam op: 27.1447594165802 seconds [2024-08-28 13:26:41,940] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2024-08-28 13:26:41,940] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer Loading extension module fused_adam... Time to load fused_adam op: 27.148159980773926 seconds [2024-08-28 13:26:41,971] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam [2024-08-28 13:26:41,971] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type= [2024-08-28 13:26:41,971] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 1 optimizer [2024-08-28 13:26:41,971] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 1000000000 [2024-08-28 13:26:41,971] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 1000000000 [2024-08-28 13:26:41,971] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: False [2024-08-28 13:26:41,972] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False [2024-08-28 13:26:42,184] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states [2024-08-28 13:26:42,184] [INFO] [utils.py:782:see_memory_usage] MA 15.68 GB Max_MA 15.72 GB CA 15.91 GB Max_CA 16 GB [2024-08-28 13:26:42,185] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 52.0 GB, percent = 6.9% [2024-08-28 13:26:42,343] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states [2024-08-28 13:26:42,344] [INFO] [utils.py:782:see_memory_usage] MA 15.68 GB Max_MA 15.76 GB CA 15.98 GB Max_CA 16 GB [2024-08-28 13:26:42,345] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 52.0 GB, percent = 6.9% [2024-08-28 13:26:42,345] [INFO] [stage_1_and_2.py:543:__init__] optimizer state initialized [2024-08-28 13:26:42,500] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer [2024-08-28 13:26:42,500] [INFO] [utils.py:782:see_memory_usage] MA 15.68 GB Max_MA 15.68 GB CA 15.98 GB Max_CA 16 GB [2024-08-28 13:26:42,501] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 52.0 GB, percent = 6.9% [2024-08-28 13:26:42,504] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer [2024-08-28 13:26:42,504] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler [2024-08-28 13:26:42,504] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = [2024-08-28 13:26:42,504] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]] [2024-08-28 13:26:42,508] [INFO] [config.py:997:print] DeepSpeedEngine configuration: [2024-08-28 13:26:42,508] [INFO] [config.py:1001:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2024-08-28 13:26:42,508] [INFO] [config.py:1001:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2024-08-28 13:26:42,508] [INFO] [config.py:1001:print] amp_enabled .................. False [2024-08-28 13:26:42,508] [INFO] [config.py:1001:print] amp_params ................... False [2024-08-28 13:26:42,508] [INFO] [config.py:1001:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] bfloat16_enabled ............. True [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] bfloat16_immediate_grad_update False [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] checkpoint_parallel_write_pipeline False [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] checkpoint_tag_validation_enabled True [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] checkpoint_tag_validation_fail False [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] comms_config ................. [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] communication_data_type ...... None [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] curriculum_enabled_legacy .... False [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] curriculum_params_legacy ..... False [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] data_efficiency_enabled ...... False [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] dataloader_drop_last ......... False [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] disable_allgather ............ False [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] dump_state ................... False [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] dynamic_loss_scale_args ...... None [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] eigenvalue_enabled ........... False [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] eigenvalue_gas_boundary_resolution 1 [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] eigenvalue_layer_name ........ bert.encoder.layer [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] eigenvalue_layer_num ......... 0 [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] eigenvalue_max_iter .......... 100 [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] eigenvalue_stability ......... 1e-06 [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] eigenvalue_tol ............... 0.01 [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] eigenvalue_verbose ........... False [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] elasticity_enabled ........... False [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] fp16_auto_cast ............... None [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] fp16_enabled ................. False [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] fp16_master_weights_and_gradients False [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] global_rank .................. 0 [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] grad_accum_dtype ............. None [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] gradient_accumulation_steps .. 2 [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] gradient_clipping ............ 1.0 [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] gradient_predivide_factor .... 1.0 [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] graph_harvesting ............. False [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] initial_dynamic_scale ........ 1 [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] load_universal_checkpoint .... False [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] loss_scale ................... 1.0 [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] memory_breakdown ............. False [2024-08-28 13:26:42,509] [INFO] [config.py:1001:print] mics_hierarchial_params_gather False [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] mics_shard_size .............. -1 [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] optimizer_legacy_fusion ...... False [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] optimizer_name ............... adamw [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] optimizer_params ............. {'lr': 4e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.01} [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] pld_enabled .................. False [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] pld_params ................... False [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] prescale_gradients ........... False [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] scheduler_name ............... None [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] scheduler_params ............. None [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] seq_parallel_communication_data_type torch.float32 [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] sparse_attention ............. None [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] sparse_gradients_enabled ..... False [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] steps_per_print .............. inf [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] timers_config ................ enabled=True synchronized=True [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] train_batch_size ............. 16 [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] train_micro_batch_size_per_gpu 4 [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] use_data_before_expert_parallel_ False [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] use_node_local_storage ....... False [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] wall_clock_breakdown ......... True [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] weight_quantization_config ... None [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] world_size ................... 2 [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] zero_allow_untested_optimizer False [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] zero_enabled ................. True [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] zero_force_ds_cpu_optimizer .. True [2024-08-28 13:26:42,510] [INFO] [config.py:1001:print] zero_optimization_stage ...... 1 [2024-08-28 13:26:42,510] [INFO] [config.py:987:print_user_config] json = { "zero_optimization": { "stage": 1, "allgather_partitions": true, "allgather_bucket_size": 1.000000e+09, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1.000000e+09, "contiguous_gradients": true }, "fp16": { "enabled": false, "auto_cast": true, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": 4e-05, "betas": [0.9, 0.999], "eps": 1e-08, "weight_decay": 0.01 } }, "gradient_accumulation_steps": 2, "gradient_clipping": 1.0, "steps_per_print": inf, "train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "wall_clock_breakdown": true } [INFO|trainer.py:1721] 2024-08-28 13:26:42,511 >> ***** Running training ***** [INFO|trainer.py:1722] 2024-08-28 13:26:42,511 >> Num examples = 416 [INFO|trainer.py:1723] 2024-08-28 13:26:42,511 >> Num Epochs = 5 [INFO|trainer.py:1724] 2024-08-28 13:26:42,511 >> Instantaneous batch size per device = 4 [INFO|trainer.py:1727] 2024-08-28 13:26:42,511 >> Total train batch size (w. parallel, distributed & accumulation) = 16 [INFO|trainer.py:1728] 2024-08-28 13:26:42,511 >> Gradient Accumulation steps = 2 [INFO|trainer.py:1729] 2024-08-28 13:26:42,511 >> Total optimization steps = 130 [INFO|trainer.py:1730] 2024-08-28 13:26:42,515 >> Number of trainable parameters = 37,748,736 0%| | 0/130 [00:00= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2024-08-28 13:28:32,192] [INFO] [comm.py:637:init_distributed] cdb=None [2024-08-28 13:28:32,193] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. 08/28/2024 13:28:32 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 08/28/2024 13:28:32 - INFO - __main__ - Training/evaluation parameters TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=4, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=zero_stage1_config.json, disable_tqdm=False, dispatch_batches=None, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=epoch, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=2, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=4e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora_ui/runs/Aug28_13-28-32_SH-IDC1-10-198-35-71, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=5.0, optim=adamw_torch, optim_args=None, output_dir=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora_ui, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=4, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora_ui, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=200, save_strategy=steps, save_total_limit=1, seed=42, skip_memory_metrics=True, split_batches=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.01, ) 08/28/2024 13:28:32 - INFO - __main__ - Loading Tokenizer: ./pretrained/InternVL2-8B [INFO|tokenization_utils_base.py:2025] 2024-08-28 13:28:32,297 >> loading file ./tokenizer.model [INFO|tokenization_utils_base.py:2025] 2024-08-28 13:28:32,297 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2025] 2024-08-28 13:28:32,297 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2025] 2024-08-28 13:28:32,297 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2025] 2024-08-28 13:28:32,297 >> loading file tokenizer.json [WARNING|logging.py:314] 2024-08-28 13:28:32,440 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 08/28/2024 13:28:32 - INFO - __main__ - Loading InternVLChatModel... [INFO|configuration_utils.py:727] 2024-08-28 13:28:32,551 >> loading configuration file ./pretrained/InternVL2-8B/config.json [INFO|configuration_utils.py:792] 2024-08-28 13:28:32,552 >> Model config InternVLChatConfig { "_commit_hash": null, "architectures": [ "InternVLChatModel" ], "auto_map": { "AutoConfig": "configuration_internvl_chat.InternVLChatConfig", "AutoModel": "modeling_internvl_chat.InternVLChatModel", "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel" }, "downsample_ratio": 0.5, "dynamic_image_size": true, "force_image_size": 448, "llm_config": { "_name_or_path": "internlm/internlm2_5-7b-chat", "add_cross_attention": false, "architectures": [ "InternLM2ForCausalLM" ], "attn_implementation": "flash_attention_2", "auto_map": { "AutoConfig": "configuration_internlm2.InternLM2Config", "AutoModel": "modeling_internlm2.InternLM2ForCausalLM", "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM" }, "bad_words_ids": null, "begin_suppress_tokens": null, "bias": false, "bos_token_id": 1, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 2, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "silu", "hidden_size": 4096, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 14336, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "min_length": 0, "model_type": "internlm2", "no_repeat_ngram_size": 0, "num_attention_heads": 32, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 32, "num_key_value_heads": 8, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": 2, "prefix": null, "pretraining_tp": 1, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 2.0, "type": "dynamic" }, "rope_theta": 1000000, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_cache": true, "vocab_size": 92553 }, "max_dynamic_patch": 12, "min_dynamic_patch": 1, "model_type": "internvl_chat", "pad2square": false, "ps_version": "v2", "select_layer": -1, "template": "internlm2-chat", "torch_dtype": "bfloat16", "transformers_version": null, "use_backbone_lora": 0, "use_llm_lora": 0, "use_thumbnail": true, "vision_config": { "_name_or_path": "", "add_cross_attention": false, "architectures": [ "InternVisionModel" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "drop_path_rate": 0.0, "dropout": 0.0, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 448, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-06, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "intern_vit_6b", "no_repeat_ngram_size": 0, "norm_type": "layer_norm", "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_hidden_layers": 24, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "pruned_heads": {}, "qk_normalization": false, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_flash_attn": true } } 08/28/2024 13:28:32 - INFO - __main__ - Using flash_attention_2 for InternLM [INFO|modeling_utils.py:3473] 2024-08-28 13:28:32,554 >> loading weights file ./pretrained/InternVL2-8B/model.safetensors.index.json [INFO|modeling_utils.py:1426] 2024-08-28 13:28:32,554 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16. [INFO|configuration_utils.py:826] 2024-08-28 13:28:32,555 >> Generate config GenerationConfig {} [INFO|configuration_utils.py:826] 2024-08-28 13:28:32,604 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2, "pad_token_id": 2 } [2024-08-28 13:28:33,380] [INFO] [comm.py:637:init_distributed] cdb=None 08/28/2024 13:28:33 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False [WARNING|logging.py:314] 2024-08-28 13:28:33,599 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Loading checkpoint shards: 0%| | 0/4 [00:00> All model checkpoint weights were used when initializing InternVLChatModel. [INFO|modeling_utils.py:4358] 2024-08-28 13:28:40,365 >> All the weights of InternVLChatModel were initialized from the model checkpoint at ./pretrained/InternVL2-8B. If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training. [INFO|configuration_utils.py:779] 2024-08-28 13:28:40,369 >> loading configuration file ./pretrained/InternVL2-8B/generation_config.json [INFO|configuration_utils.py:826] 2024-08-28 13:28:40,370 >> Generate config GenerationConfig { "eos_token_id": [ 92542, 92543 ] } 08/28/2024 13:28:40 - INFO - __main__ - Finished 08/28/2024 13:28:40 - INFO - __main__ - model.config.force_image_size: 448 08/28/2024 13:28:40 - INFO - __main__ - data_args.force_image_size: 448 08/28/2024 13:28:40 - INFO - __main__ - model.config.vision_config.image_size: 448 08/28/2024 13:28:40 - INFO - __main__ - [Dataset] num_image_token: 256 08/28/2024 13:28:40 - INFO - __main__ - [Dataset] dynamic_image_size: True 08/28/2024 13:28:40 - INFO - __main__ - [Dataset] use_thumbnail: True 08/28/2024 13:28:40 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6 08/28/2024 13:28:40 - INFO - __main__ - Formatting inputs...Skip in lazy mode 08/28/2024 13:28:40 - INFO - __main__ - Add dataset: ui-dataset with length: 416 08/28/2024 13:28:40 - INFO - __main__ - [Dataset] num_image_token: 256 08/28/2024 13:28:40 - INFO - __main__ - [Dataset] dynamic_image_size: True 08/28/2024 13:28:40 - INFO - __main__ - [Dataset] use_thumbnail: True 08/28/2024 13:28:40 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6 08/28/2024 13:28:40 - INFO - __main__ - Formatting inputs...Skip in lazy mode Loading checkpoint shards: 75%|███████▌ | 3/4 [00:05<00:01, 1.91s/it]08/28/2024 13:28:41 - INFO - __main__ - Add eval dataset: ui-dataset with length: 46 Loading checkpoint shards: 100%|██████████| 4/4 [00:06<00:00, 1.50s/it] Loading checkpoint shards: 100%|██████████| 4/4 [00:06<00:00, 1.65s/it] trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_B.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_A.default.weight 08/28/2024 13:28:41 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_B.default.weight 08/28/2024 13:28:41 - WARNING - accelerate.utils.other - Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. [INFO|trainer.py:571] 2024-08-28 13:28:41,806 >> Using auto half precision backend [2024-08-28 13:28:42,034] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4, git-hash=unknown, git-branch=unknown trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4855 [2024-08-28 13:28:45,459] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /mnt/nvme0n1/workspace/fengdahu/fdh_cache/torch_extensions/py310_cu121 as PyTorch extensions root... Using /mnt/nvme0n1/workspace/fengdahu/fdh_cache/torch_extensions/py310_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /mnt/nvme0n1/workspace/fengdahu/fdh_cache/torch_extensions/py310_cu121/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.08145475387573242 seconds [2024-08-28 13:28:45,543] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2024-08-28 13:28:45,543] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer Loading extension module fused_adam... Time to load fused_adam op: 0.10166192054748535 seconds [2024-08-28 13:28:45,578] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam [2024-08-28 13:28:45,578] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type= [2024-08-28 13:28:45,578] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 1 optimizer [2024-08-28 13:28:45,578] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 1000000000 [2024-08-28 13:28:45,578] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 1000000000 [2024-08-28 13:28:45,578] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: False [2024-08-28 13:28:45,578] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False [2024-08-28 13:28:45,827] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states [2024-08-28 13:28:45,827] [INFO] [utils.py:782:see_memory_usage] MA 15.68 GB Max_MA 15.72 GB CA 15.91 GB Max_CA 16 GB [2024-08-28 13:28:45,828] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 52.91 GB, percent = 7.0% [2024-08-28 13:28:45,987] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states [2024-08-28 13:28:45,988] [INFO] [utils.py:782:see_memory_usage] MA 15.68 GB Max_MA 15.76 GB CA 15.98 GB Max_CA 16 GB [2024-08-28 13:28:45,988] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 52.91 GB, percent = 7.0% [2024-08-28 13:28:45,988] [INFO] [stage_1_and_2.py:543:__init__] optimizer state initialized [2024-08-28 13:28:46,147] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer [2024-08-28 13:28:46,148] [INFO] [utils.py:782:see_memory_usage] MA 15.68 GB Max_MA 15.68 GB CA 15.98 GB Max_CA 16 GB [2024-08-28 13:28:46,148] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 52.92 GB, percent = 7.0% [2024-08-28 13:28:46,151] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer [2024-08-28 13:28:46,152] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler [2024-08-28 13:28:46,152] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = [2024-08-28 13:28:46,152] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]] [2024-08-28 13:28:46,156] [INFO] [config.py:997:print] DeepSpeedEngine configuration: [2024-08-28 13:28:46,156] [INFO] [config.py:1001:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2024-08-28 13:28:46,156] [INFO] [config.py:1001:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2024-08-28 13:28:46,156] [INFO] [config.py:1001:print] amp_enabled .................. False [2024-08-28 13:28:46,156] [INFO] [config.py:1001:print] amp_params ................... False [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] bfloat16_enabled ............. True [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] bfloat16_immediate_grad_update False [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] checkpoint_parallel_write_pipeline False [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] checkpoint_tag_validation_enabled True [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] checkpoint_tag_validation_fail False [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] comms_config ................. [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] communication_data_type ...... None [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] curriculum_enabled_legacy .... False [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] curriculum_params_legacy ..... False [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] data_efficiency_enabled ...... False [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] dataloader_drop_last ......... False [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] disable_allgather ............ False [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] dump_state ................... False [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] dynamic_loss_scale_args ...... None [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] eigenvalue_enabled ........... False [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] eigenvalue_gas_boundary_resolution 1 [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] eigenvalue_layer_name ........ bert.encoder.layer [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] eigenvalue_layer_num ......... 0 [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] eigenvalue_max_iter .......... 100 [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] eigenvalue_stability ......... 1e-06 [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] eigenvalue_tol ............... 0.01 [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] eigenvalue_verbose ........... False [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] elasticity_enabled ........... False [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] fp16_auto_cast ............... None [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] fp16_enabled ................. False [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] fp16_master_weights_and_gradients False [2024-08-28 13:28:46,157] [INFO] [config.py:1001:print] global_rank .................. 0 [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] grad_accum_dtype ............. None [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] gradient_accumulation_steps .. 2 [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] gradient_clipping ............ 1.0 [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] gradient_predivide_factor .... 1.0 [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] graph_harvesting ............. False [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] initial_dynamic_scale ........ 1 [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] load_universal_checkpoint .... False [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] loss_scale ................... 1.0 [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] memory_breakdown ............. False [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] mics_hierarchial_params_gather False [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] mics_shard_size .............. -1 [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] optimizer_legacy_fusion ...... False [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] optimizer_name ............... adamw [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] optimizer_params ............. {'lr': 4e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.01} [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] pld_enabled .................. False [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] pld_params ................... False [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] prescale_gradients ........... False [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] scheduler_name ............... None [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] scheduler_params ............. None [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] seq_parallel_communication_data_type torch.float32 [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] sparse_attention ............. None [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] sparse_gradients_enabled ..... False [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] steps_per_print .............. inf [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] timers_config ................ enabled=True synchronized=True [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] train_batch_size ............. 16 [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] train_micro_batch_size_per_gpu 4 [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] use_data_before_expert_parallel_ False [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] use_node_local_storage ....... False [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] wall_clock_breakdown ......... True [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] weight_quantization_config ... None [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] world_size ................... 2 [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] zero_allow_untested_optimizer False [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] zero_enabled ................. True [2024-08-28 13:28:46,158] [INFO] [config.py:1001:print] zero_force_ds_cpu_optimizer .. True [2024-08-28 13:28:46,159] [INFO] [config.py:1001:print] zero_optimization_stage ...... 1 [2024-08-28 13:28:46,159] [INFO] [config.py:987:print_user_config] json = { "zero_optimization": { "stage": 1, "allgather_partitions": true, "allgather_bucket_size": 1.000000e+09, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1.000000e+09, "contiguous_gradients": true }, "fp16": { "enabled": false, "auto_cast": true, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": 4e-05, "betas": [0.9, 0.999], "eps": 1e-08, "weight_decay": 0.01 } }, "gradient_accumulation_steps": 2, "gradient_clipping": 1.0, "steps_per_print": inf, "train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "wall_clock_breakdown": true } [INFO|trainer.py:1721] 2024-08-28 13:28:46,159 >> ***** Running training ***** [INFO|trainer.py:1722] 2024-08-28 13:28:46,159 >> Num examples = 416 [INFO|trainer.py:1723] 2024-08-28 13:28:46,159 >> Num Epochs = 5 [INFO|trainer.py:1724] 2024-08-28 13:28:46,159 >> Instantaneous batch size per device = 4 [INFO|trainer.py:1727] 2024-08-28 13:28:46,159 >> Total train batch size (w. parallel, distributed & accumulation) = 16 [INFO|trainer.py:1728] 2024-08-28 13:28:46,159 >> Gradient Accumulation steps = 2 [INFO|trainer.py:1729] 2024-08-28 13:28:46,159 >> Total optimization steps = 130 [INFO|trainer.py:1730] 2024-08-28 13:28:46,164 >> Number of trainable parameters = 37,748,736 0%| | 0/130 [00:00= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 929 [2024-08-28 13:30:18,701] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.57 | optimizer_step: 0.58 [2024-08-28 13:30:18,701] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 475.61 | bwd_microstep: 873.45 | bwd_inner_microstep: 867.61 | bwd_allreduce_microstep: 5.79 | step_microstep: 9.63 [2024-08-28 13:30:18,701] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 949.56 | bwd: 1734.47 | bwd_inner: 1728.42 | bwd_allreduce: 5.89 | step: 9.79 20%|██ | 26/130 [01:32<05:16, 3.05s/it] {'loss': 1.3587, 'learning_rate': 3.7065817632643115e-05, 'epoch': 1.0} 20%|██ | 26/130 [01:32<05:16, 3.05s/it][INFO|trainer.py:3242] 2024-08-28 13:30:18,717 >> ***** Running Evaluation ***** [INFO|trainer.py:3244] 2024-08-28 13:30:18,717 >> Num examples = 46 [INFO|trainer.py:3247] 2024-08-28 13:30:18,717 >> Batch size = 8 [2024-08-28 13:30:20,368] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:30:20,442] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:30:24,632] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:30:24,670] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:30:28,913] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:30:28,960] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:30:33,075] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:30:33,274] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): dynamic ViT batch size: 18, images per sample: 2.25, dynamic token length: 906 0%| | 0/3 [00:00= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. dynamic ViT batch size: 22, images per sample: 2.75, dynamic token length: 936 100%|██████████| 3/3 [00:02<00:00, 1.01s/it] {'eval_loss': 1.2306509017944336, 'eval_runtime': 21.666, 'eval_samples_per_second': 2.123, 'eval_steps_per_second': 0.138, 'epoch': 1.0} 20%|██ | 26/130 [01:54<05:16, 3.05s/it] 100%|██████████| 3/3 [00:02<00:00, 1.01s/it] [2024-08-28 13:30:42,098] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:30:42,104] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:30:46,110] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:30:46,144] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:30:50,525] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:30:50,529] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:30:54,797] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:30:54,821] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 936 [2024-08-28 13:30:59,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 490.33 | bwd_microstep: 842.07 | bwd_inner_microstep: 841.85 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 963 [2024-08-28 13:31:00,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.57 | optimizer_gradients: 0.60 | optimizer_step: 0.57 [2024-08-28 13:31:00,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 473.35 | bwd_microstep: 985.56 | bwd_inner_microstep: 873.64 | bwd_allreduce_microstep: 111.81 | step_microstep: 13.55 [2024-08-28 13:31:00,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 963.66 | bwd: 1827.66 | bwd_inner: 1715.57 | bwd_allreduce: 111.88 | step: 13.68 21%|██ | 27/130 [02:14<25:13, 14.69s/it] {'loss': 1.2631, 'learning_rate': 3.680051846301543e-05, 'epoch': 1.04} 21%|██ | 27/130 [02:14<25:13, 14.69s/it]dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 930 [2024-08-28 13:31:01,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 448.34 | bwd_microstep: 839.95 | bwd_inner_microstep: 839.74 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 959 [2024-08-28 13:31:03,287] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.61 | optimizer_step: 0.57 [2024-08-28 13:31:03,288] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 478.84 | bwd_microstep: 895.42 | bwd_inner_microstep: 873.94 | bwd_allreduce_microstep: 21.43 | step_microstep: 12.97 [2024-08-28 13:31:03,288] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 927.16 | bwd: 1735.40 | bwd_inner: 1713.72 | bwd_allreduce: 21.52 | step: 13.09 22%|██▏ | 28/130 [02:17<18:52, 11.10s/it] {'loss': 1.2049, 'learning_rate': 3.65247754863199e-05, 'epoch': 1.08} 22%|██▏ | 28/130 [02:17<18:52, 11.10s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 940 [2024-08-28 13:31:04,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 462.05 | bwd_microstep: 857.27 | bwd_inner_microstep: 856.99 | bwd_allreduce_microstep: 0.12 | step_microstep: 0.15 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 927 [2024-08-28 13:31:05,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.60 | optimizer_step: 0.58 [2024-08-28 13:31:05,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 459.62 | bwd_microstep: 854.30 | bwd_inner_microstep: 847.99 | bwd_allreduce_microstep: 6.18 | step_microstep: 9.84 [2024-08-28 13:31:05,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 921.65 | bwd: 1711.60 | bwd_inner: 1705.08 | bwd_allreduce: 6.29 | step: 10.00 22%|██▏ | 29/130 [02:19<14:26, 8.57s/it] {'loss': 1.2327, 'learning_rate': 3.623876011431714e-05, 'epoch': 1.12} 22%|██▏ | 29/130 [02:19<14:26, 8.57s/it]dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 946 [2024-08-28 13:31:07,288] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 450.42 | bwd_microstep: 845.44 | bwd_inner_microstep: 845.34 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.07 dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 936 [2024-08-28 13:31:08,617] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 1.12 | optimizer_step: 0.64 [2024-08-28 13:31:08,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 450.55 | bwd_microstep: 851.32 | bwd_inner_microstep: 844.25 | bwd_allreduce_microstep: 6.94 | step_microstep: 12.07 [2024-08-28 13:31:08,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 900.95 | bwd: 1696.78 | bwd_inner: 1689.63 | bwd_allreduce: 6.97 | step: 12.15 23%|██▎ | 30/130 [02:22<11:19, 6.80s/it] {'loss': 1.0479, 'learning_rate': 3.5942650144458454e-05, 'epoch': 1.15} 23%|██▎ | 30/130 [02:22<11:19, 6.80s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 916 [2024-08-28 13:31:09,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 469.62 | bwd_microstep: 854.77 | bwd_inner_microstep: 854.51 | bwd_allreduce_microstep: 0.12 | step_microstep: 0.15 dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 918 [2024-08-28 13:31:11,283] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.62 | optimizer_step: 0.62 [2024-08-28 13:31:11,283] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 446.15 | bwd_microstep: 841.50 | bwd_inner_microstep: 834.98 | bwd_allreduce_microstep: 6.43 | step_microstep: 10.13 [2024-08-28 13:31:11,284] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 915.75 | bwd: 1696.31 | bwd_inner: 1689.53 | bwd_allreduce: 6.56 | step: 10.29 24%|██▍ | 31/130 [02:25<09:10, 5.56s/it] {'loss': 1.0442, 'learning_rate': 3.56366296493606e-05, 'epoch': 1.19} 24%|██▍ | 31/130 [02:25<09:10, 5.56s/it]dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 920 [2024-08-28 13:31:12,586] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 445.15 | bwd_microstep: 833.53 | bwd_inner_microstep: 833.31 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 909 [2024-08-28 13:31:14,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.62 | optimizer_step: 0.60 [2024-08-28 13:31:14,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 457.83 | bwd_microstep: 985.54 | bwd_inner_microstep: 840.96 | bwd_allreduce_microstep: 144.53 | step_microstep: 13.79 [2024-08-28 13:31:14,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 902.96 | bwd: 1819.11 | bwd_inner: 1674.30 | bwd_allreduce: 144.62 | step: 13.92 25%|██▍ | 32/130 [02:27<07:42, 4.72s/it] {'loss': 1.0004, 'learning_rate': 3.532088886237956e-05, 'epoch': 1.23} 25%|██▍ | 32/130 [02:27<07:42, 4.72s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 953 [2024-08-28 13:31:15,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 466.22 | bwd_microstep: 858.86 | bwd_inner_microstep: 858.62 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.15 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 951 [2024-08-28 13:31:16,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.64 | optimizer_step: 0.57 [2024-08-28 13:31:16,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.23 | bwd_microstep: 879.92 | bwd_inner_microstep: 873.53 | bwd_allreduce_microstep: 6.31 | step_microstep: 9.92 [2024-08-28 13:31:16,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 943.43 | bwd: 1738.82 | bwd_inner: 1732.19 | bwd_allreduce: 6.39 | step: 10.07 25%|██▌ | 33/130 [02:30<06:40, 4.13s/it] {'loss': 1.0744, 'learning_rate': 3.499562405935469e-05, 'epoch': 1.27} 25%|██▌ | 33/130 [02:30<06:40, 4.13s/it]dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 946 [2024-08-28 13:31:18,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 452.38 | bwd_microstep: 849.49 | bwd_inner_microstep: 849.34 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 963 [2024-08-28 13:31:19,509] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.61 | optimizer_step: 0.58 [2024-08-28 13:31:19,510] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 476.08 | bwd_microstep: 880.81 | bwd_inner_microstep: 873.85 | bwd_allreduce_microstep: 6.83 | step_microstep: 13.47 [2024-08-28 13:31:19,510] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 928.44 | bwd: 1730.32 | bwd_inner: 1723.27 | bwd_allreduce: 6.87 | step: 13.56 26%|██▌ | 34/130 [02:33<05:55, 3.70s/it] {'loss': 0.8552, 'learning_rate': 3.4661037436596526e-05, 'epoch': 1.31} 26%|██▌ | 34/130 [02:33<05:55, 3.70s/it]dynamic ViT batch size: 6, images per sample: 1.5, dynamic token length: 932 [2024-08-28 13:31:20,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 437.92 | bwd_microstep: 831.29 | bwd_inner_microstep: 831.04 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.19 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 921 [2024-08-28 13:31:22,202] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.59 | optimizer_gradients: 0.63 | optimizer_step: 0.63 [2024-08-28 13:31:22,202] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 473.08 | bwd_microstep: 886.23 | bwd_inner_microstep: 858.51 | bwd_allreduce_microstep: 27.67 | step_microstep: 16.46 [2024-08-28 13:31:22,202] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 910.96 | bwd: 1717.55 | bwd_inner: 1689.58 | bwd_allreduce: 27.80 | step: 16.63 27%|██▋ | 35/130 [02:36<05:23, 3.40s/it] {'loss': 1.0061, 'learning_rate': 3.431733698519437e-05, 'epoch': 1.35} 27%|██▋ | 35/130 [02:36<05:23, 3.40s/it]dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 931 [2024-08-28 13:31:23,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 449.29 | bwd_microstep: 840.90 | bwd_inner_microstep: 840.62 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.16 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 931 [2024-08-28 13:31:24,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.57 | optimizer_gradients: 0.63 | optimizer_step: 0.60 [2024-08-28 13:31:24,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 476.69 | bwd_microstep: 911.08 | bwd_inner_microstep: 869.17 | bwd_allreduce_microstep: 41.80 | step_microstep: 13.42 [2024-08-28 13:31:24,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 925.96 | bwd: 1752.01 | bwd_inner: 1709.85 | bwd_allreduce: 41.88 | step: 13.58 28%|██▊ | 36/130 [02:38<05:00, 3.20s/it] {'loss': 0.8791, 'learning_rate': 3.396473636172146e-05, 'epoch': 1.38} 28%|██▊ | 36/130 [02:38<05:00, 3.20s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 966 [2024-08-28 13:31:26,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 487.74 | bwd_microstep: 888.58 | bwd_inner_microstep: 888.37 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 930 [2024-08-28 13:31:27,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.64 | optimizer_step: 0.60 [2024-08-28 13:31:27,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 451.35 | bwd_microstep: 849.39 | bwd_inner_microstep: 842.24 | bwd_allreduce_microstep: 7.00 | step_microstep: 10.22 [2024-08-28 13:31:27,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 939.06 | bwd: 1738.01 | bwd_inner: 1730.71 | bwd_allreduce: 7.03 | step: 10.36 28%|██▊ | 37/130 [02:41<04:44, 3.06s/it] {'loss': 0.8711, 'learning_rate': 3.360345475541839e-05, 'epoch': 1.42} 28%|██▊ | 37/130 [02:41<04:44, 3.06s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 939 [2024-08-28 13:31:29,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.77 | bwd_microstep: 869.79 | bwd_inner_microstep: 869.62 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 908 [2024-08-28 13:31:30,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.73 | optimizer_step: 0.69 [2024-08-28 13:31:30,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 458.35 | bwd_microstep: 925.62 | bwd_inner_microstep: 841.86 | bwd_allreduce_microstep: 83.62 | step_microstep: 13.37 [2024-08-28 13:31:30,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 936.10 | bwd: 1795.44 | bwd_inner: 1711.52 | bwd_allreduce: 83.69 | step: 13.49 29%|██▉ | 38/130 [02:44<04:33, 2.98s/it] {'loss': 1.1293, 'learning_rate': 3.323371675193719e-05, 'epoch': 1.46} 29%|██▉ | 38/130 [02:44<04:33, 2.98s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 960 [2024-08-28 13:31:31,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 470.29 | bwd_microstep: 863.60 | bwd_inner_microstep: 863.32 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.15 dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 968 [2024-08-28 13:31:33,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.65 | optimizer_step: 0.69 [2024-08-28 13:31:33,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 464.03 | bwd_microstep: 874.95 | bwd_inner_microstep: 868.01 | bwd_allreduce_microstep: 6.78 | step_microstep: 10.58 [2024-08-28 13:31:33,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 934.30 | bwd: 1738.58 | bwd_inner: 1731.46 | bwd_allreduce: 6.88 | step: 10.74 30%|███ | 39/130 [02:47<04:24, 2.90s/it] {'loss': 0.7738, 'learning_rate': 3.285575219373079e-05, 'epoch': 1.5} 30%|███ | 39/130 [02:47<04:24, 2.90s/it]dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 951 [2024-08-28 13:31:34,513] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 454.90 | bwd_microstep: 847.72 | bwd_inner_microstep: 847.48 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.19 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 946 [2024-08-28 13:31:35,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.61 | optimizer_step: 0.57 [2024-08-28 13:31:35,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 466.63 | bwd_microstep: 866.75 | bwd_inner_microstep: 859.97 | bwd_allreduce_microstep: 6.66 | step_microstep: 9.97 [2024-08-28 13:31:35,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 921.51 | bwd: 1714.50 | bwd_inner: 1707.53 | bwd_allreduce: 6.72 | step: 10.15 31%|███ | 40/130 [02:49<04:15, 2.84s/it] {'loss': 0.7033, 'learning_rate': 3.246979603717467e-05, 'epoch': 1.54} 31%|███ | 40/130 [02:49<04:15, 2.84s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 973 [2024-08-28 13:31:37,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 478.46 | bwd_microstep: 878.08 | bwd_inner_microstep: 877.80 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.16 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 916 [2024-08-28 13:31:38,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.63 | optimizer_step: 0.68 [2024-08-28 13:31:38,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 461.06 | bwd_microstep: 851.88 | bwd_inner_microstep: 845.13 | bwd_allreduce_microstep: 6.61 | step_microstep: 10.51 [2024-08-28 13:31:38,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 939.49 | bwd: 1730.01 | bwd_inner: 1723.06 | bwd_allreduce: 6.70 | step: 10.68 32%|███▏ | 41/130 [02:52<04:09, 2.80s/it] {'loss': 0.8763, 'learning_rate': 3.207608820650955e-05, 'epoch': 1.58} 32%|███▏ | 41/130 [02:52<04:09, 2.80s/it]dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 937 [2024-08-28 13:31:39,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 452.05 | bwd_microstep: 842.40 | bwd_inner_microstep: 842.14 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.15 dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 917 [2024-08-28 13:31:41,284] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.65 | optimizer_step: 0.59 [2024-08-28 13:31:41,285] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 447.65 | bwd_microstep: 895.12 | bwd_inner_microstep: 835.07 | bwd_allreduce_microstep: 59.94 | step_microstep: 13.07 [2024-08-28 13:31:41,285] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 899.67 | bwd: 1737.55 | bwd_inner: 1677.32 | bwd_allreduce: 60.01 | step: 13.23 32%|███▏ | 42/130 [02:55<04:03, 2.77s/it] {'loss': 0.7544, 'learning_rate': 3.1674873444695804e-05, 'epoch': 1.62} 32%|███▏ | 42/130 [02:55<04:03, 2.77s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 946 [2024-08-28 13:31:42,663] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 480.37 | bwd_microstep: 872.90 | bwd_inner_microstep: 872.73 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 914 [2024-08-28 13:31:44,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.60 | optimizer_step: 0.57 [2024-08-28 13:31:44,011] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 460.12 | bwd_microstep: 858.09 | bwd_inner_microstep: 844.18 | bwd_allreduce_microstep: 13.78 | step_microstep: 12.76 [2024-08-28 13:31:44,011] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 940.46 | bwd: 1731.01 | bwd_inner: 1716.94 | bwd_allreduce: 13.85 | step: 12.89 33%|███▎ | 43/130 [02:57<03:59, 2.76s/it] {'loss': 0.7928, 'learning_rate': 3.126640116127244e-05, 'epoch': 1.65} 33%|███▎ | 43/130 [02:57<03:59, 2.76s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 961 [2024-08-28 13:31:45,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 488.76 | bwd_microstep: 890.43 | bwd_inner_microstep: 890.17 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.14 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 932 [2024-08-28 13:31:46,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.62 | optimizer_step: 0.62 [2024-08-28 13:31:46,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.92 | bwd_microstep: 876.02 | bwd_inner_microstep: 869.44 | bwd_allreduce_microstep: 6.45 | step_microstep: 10.04 [2024-08-28 13:31:46,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 966.66 | bwd: 1766.48 | bwd_inner: 1759.71 | bwd_allreduce: 6.55 | step: 10.18 34%|███▍ | 44/130 [03:00<03:57, 2.76s/it] {'loss': 0.8081, 'learning_rate': 3.0850925277315193e-05, 'epoch': 1.69} 34%|███▍ | 44/130 [03:00<03:57, 2.76s/it]dynamic ViT batch size: 6, images per sample: 1.5, dynamic token length: 926 [2024-08-28 13:31:48,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 434.59 | bwd_microstep: 826.50 | bwd_inner_microstep: 826.37 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 956 [2024-08-28 13:31:49,476] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.60 | optimizer_step: 0.57 [2024-08-28 13:31:49,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 481.28 | bwd_microstep: 886.97 | bwd_inner_microstep: 877.96 | bwd_allreduce_microstep: 8.94 | step_microstep: 12.90 [2024-08-28 13:31:49,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 915.85 | bwd: 1713.49 | bwd_inner: 1704.36 | bwd_allreduce: 8.98 | step: 12.98 35%|███▍ | 45/130 [03:03<03:52, 2.74s/it] {'loss': 0.7112, 'learning_rate': 3.0428704067589963e-05, 'epoch': 1.73} 35%|███▍ | 45/130 [03:03<03:52, 2.74s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 956 [2024-08-28 13:31:50,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 469.31 | bwd_microstep: 862.62 | bwd_inner_microstep: 862.55 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 913 [2024-08-28 13:31:52,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.61 | optimizer_step: 0.56 [2024-08-28 13:31:52,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 472.52 | bwd_microstep: 865.37 | bwd_inner_microstep: 859.30 | bwd_allreduce_microstep: 6.01 | step_microstep: 9.76 [2024-08-28 13:31:52,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 941.82 | bwd: 1728.00 | bwd_inner: 1721.85 | bwd_allreduce: 6.06 | step: 9.83 35%|███▌ | 46/130 [03:06<03:49, 2.73s/it] {'loss': 0.7927, 'learning_rate': 3.0000000000000004e-05, 'epoch': 1.77} 35%|███▌ | 46/130 [03:06<03:49, 2.73s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 943 [2024-08-28 13:31:53,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 466.77 | bwd_microstep: 858.53 | bwd_inner_microstep: 858.46 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.10 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 941 [2024-08-28 13:31:54,903] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.60 | optimizer_step: 0.55 [2024-08-28 13:31:54,903] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 466.89 | bwd_microstep: 865.79 | bwd_inner_microstep: 859.30 | bwd_allreduce_microstep: 6.37 | step_microstep: 9.72 [2024-08-28 13:31:54,903] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 933.64 | bwd: 1724.34 | bwd_inner: 1717.80 | bwd_allreduce: 6.40 | step: 9.82 36%|███▌ | 47/130 [03:08<03:46, 2.73s/it] {'loss': 0.6722, 'learning_rate': 2.956507957242637e-05, 'epoch': 1.81} 36%|███▌ | 47/130 [03:08<03:46, 2.73s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 930 [2024-08-28 13:31:56,243] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 464.68 | bwd_microstep: 852.09 | bwd_inner_microstep: 851.99 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.08 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 934 [2024-08-28 13:31:57,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.63 | optimizer_step: 0.69 [2024-08-28 13:31:57,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.89 | bwd_microstep: 878.11 | bwd_inner_microstep: 871.25 | bwd_allreduce_microstep: 6.70 | step_microstep: 10.40 [2024-08-28 13:31:57,627] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 942.54 | bwd: 1730.23 | bwd_inner: 1723.32 | bwd_allreduce: 6.74 | step: 10.48 37%|███▋ | 48/130 [03:11<03:43, 2.73s/it] {'loss': 0.7657, 'learning_rate': 2.9124213147063263e-05, 'epoch': 1.85} 37%|███▋ | 48/130 [03:11<03:43, 2.73s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 944 [2024-08-28 13:31:59,005] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 479.42 | bwd_microstep: 872.72 | bwd_inner_microstep: 872.48 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.14 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 913 [2024-08-28 13:32:00,367] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.59 | optimizer_step: 0.58 [2024-08-28 13:32:00,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 472.82 | bwd_microstep: 864.17 | bwd_inner_microstep: 857.64 | bwd_allreduce_microstep: 6.40 | step_microstep: 9.92 [2024-08-28 13:32:00,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 952.21 | bwd: 1736.91 | bwd_inner: 1730.19 | bwd_allreduce: 6.50 | step: 10.06 38%|███▊ | 49/130 [03:14<03:41, 2.73s/it] {'loss': 0.7748, 'learning_rate': 2.8677674782351164e-05, 'epoch': 1.88} 38%|███▊ | 49/130 [03:14<03:41, 2.73s/it]dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 924 [2024-08-28 13:32:01,677] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 448.86 | bwd_microstep: 837.22 | bwd_inner_microstep: 836.99 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 931 [2024-08-28 13:32:03,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.61 | optimizer_step: 0.59 [2024-08-28 13:32:03,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 478.09 | bwd_microstep: 875.91 | bwd_inner_microstep: 869.19 | bwd_allreduce_microstep: 6.64 | step_microstep: 10.08 [2024-08-28 13:32:03,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 926.93 | bwd: 1713.17 | bwd_inner: 1706.22 | bwd_allreduce: 6.71 | step: 10.22 38%|███▊ | 50/130 [03:16<03:37, 2.72s/it] {'loss': 0.7922, 'learning_rate': 2.8225742062612236e-05, 'epoch': 1.92} 38%|███▊ | 50/130 [03:16<03:37, 2.72s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 956 [2024-08-28 13:32:04,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 481.83 | bwd_microstep: 876.85 | bwd_inner_microstep: 876.66 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 939 [2024-08-28 13:32:05,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.60 | optimizer_step: 0.56 [2024-08-28 13:32:05,830] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 479.23 | bwd_microstep: 879.10 | bwd_inner_microstep: 872.52 | bwd_allreduce_microstep: 6.52 | step_microstep: 9.91 [2024-08-28 13:32:05,830] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 961.04 | bwd: 1755.97 | bwd_inner: 1749.19 | bwd_allreduce: 6.58 | step: 10.05 39%|███▉ | 51/130 [03:19<03:35, 2.73s/it] {'loss': 0.8061, 'learning_rate': 2.7768695925493897e-05, 'epoch': 1.96} 39%|███▉ | 51/130 [03:19<03:35, 2.73s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 943 [2024-08-28 13:32:07,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 480.12 | bwd_microstep: 872.22 | bwd_inner_microstep: 872.04 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 922 [2024-08-28 13:32:09,617] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.61 | optimizer_step: 0.59 [2024-08-28 13:32:09,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 460.66 | bwd_microstep: 852.86 | bwd_inner_microstep: 846.27 | bwd_allreduce_microstep: 6.47 | step_microstep: 9.99 [2024-08-28 13:32:09,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 940.76 | bwd: 1725.11 | bwd_inner: 1718.36 | bwd_allreduce: 6.54 | step: 10.11 40%|████ | 52/130 [03:23<03:57, 3.05s/it] {'loss': 0.7763, 'learning_rate': 2.7306820487327906e-05, 'epoch': 2.0} 40%|████ | 52/130 [03:23<03:57, 3.05s/it][INFO|trainer.py:3242] 2024-08-28 13:32:09,634 >> ***** Running Evaluation ***** [INFO|trainer.py:3244] 2024-08-28 13:32:09,634 >> Num examples = 46 [INFO|trainer.py:3247] 2024-08-28 13:32:09,634 >> Batch size = 8 [2024-08-28 13:32:11,366] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:32:11,368] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:32:15,407] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:32:15,594] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:32:19,656] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:32:19,991] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:32:23,840] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:32:24,210] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): dynamic ViT batch size: 18, images per sample: 2.25, dynamic token length: 906 0%| | 0/3 [00:00= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. dynamic ViT batch size: 22, images per sample: 2.75, dynamic token length: 936 100%|██████████| 3/3 [00:03<00:00, 1.34s/it] {'eval_loss': 0.6676424741744995, 'eval_runtime': 21.6655, 'eval_samples_per_second': 2.123, 'eval_steps_per_second': 0.138, 'epoch': 2.0} 40%|████ | 52/130 [03:45<03:57, 3.05s/it] 100%|██████████| 3/3 [00:03<00:00, 1.34s/it] [2024-08-28 13:32:33,040] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:32:33,043] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:32:37,108] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:32:37,244] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:32:41,354] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:32:41,719] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:32:45,297] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:32:45,894] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 960 [2024-08-28 13:32:49,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 538.08 | bwd_microstep: 886.59 | bwd_inner_microstep: 886.35 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.12 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 936 [2024-08-28 13:32:51,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.62 | optimizer_step: 0.58 [2024-08-28 13:32:51,505] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 474.18 | bwd_microstep: 1611.71 | bwd_inner_microstep: 866.28 | bwd_allreduce_microstep: 745.38 | step_microstep: 13.08 [2024-08-28 13:32:51,505] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1012.23 | bwd: 2498.32 | bwd_inner: 1752.67 | bwd_allreduce: 745.50 | step: 13.21 41%|████ | 53/130 [04:05<18:51, 14.70s/it] {'loss': 0.7045, 'learning_rate': 2.684040286651338e-05, 'epoch': 2.04} 41%|████ | 53/130 [04:05<18:51, 14.70s/it]dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 910 [2024-08-28 13:32:52,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 449.96 | bwd_microstep: 829.76 | bwd_inner_microstep: 829.53 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.16 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 927 [2024-08-28 13:32:54,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.62 | optimizer_step: 0.66 [2024-08-28 13:32:54,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.77 | bwd_microstep: 867.07 | bwd_inner_microstep: 860.46 | bwd_allreduce_microstep: 6.47 | step_microstep: 10.32 [2024-08-28 13:32:54,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 927.71 | bwd: 1696.86 | bwd_inner: 1690.09 | bwd_allreduce: 6.53 | step: 10.49 42%|████▏ | 54/130 [04:08<14:03, 11.09s/it] {'loss': 0.7259, 'learning_rate': 2.6369733005033693e-05, 'epoch': 2.08} 42%|████▏ | 54/130 [04:08<14:03, 11.09s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 939 [2024-08-28 13:32:55,520] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 463.35 | bwd_microstep: 852.77 | bwd_inner_microstep: 852.71 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.08 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 933 [2024-08-28 13:32:56,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.60 | optimizer_step: 0.57 [2024-08-28 13:32:56,894] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 474.41 | bwd_microstep: 872.23 | bwd_inner_microstep: 866.03 | bwd_allreduce_microstep: 6.15 | step_microstep: 9.78 [2024-08-28 13:32:56,894] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 937.74 | bwd: 1725.02 | bwd_inner: 1718.75 | bwd_allreduce: 6.18 | step: 9.86 42%|████▏ | 55/130 [04:10<10:43, 8.58s/it] {'loss': 0.684, 'learning_rate': 2.589510348821809e-05, 'epoch': 2.12} 42%|████▏ | 55/130 [04:10<10:43, 8.58s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 936 [2024-08-28 13:32:58,237] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 467.08 | bwd_microstep: 851.98 | bwd_inner_microstep: 851.78 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 941 [2024-08-28 13:32:59,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.60 | optimizer_step: 0.59 [2024-08-28 13:32:59,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 468.75 | bwd_microstep: 862.23 | bwd_inner_microstep: 855.65 | bwd_allreduce_microstep: 6.41 | step_microstep: 9.86 [2024-08-28 13:32:59,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 935.80 | bwd: 1714.24 | bwd_inner: 1707.54 | bwd_allreduce: 6.48 | step: 10.00 43%|████▎ | 56/130 [04:13<08:24, 6.82s/it] {'loss': 0.6923, 'learning_rate': 2.5416809362860107e-05, 'epoch': 2.15} 43%|████▎ | 56/130 [04:13<08:24, 6.82s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 944 [2024-08-28 13:33:00,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 464.81 | bwd_microstep: 854.83 | bwd_inner_microstep: 854.62 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 925 [2024-08-28 13:33:02,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.61 | optimizer_step: 0.58 [2024-08-28 13:33:02,327] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 446.54 | bwd_microstep: 912.34 | bwd_inner_microstep: 836.97 | bwd_allreduce_microstep: 75.26 | step_microstep: 13.13 [2024-08-28 13:33:02,327] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 911.32 | bwd: 1767.20 | bwd_inner: 1691.66 | bwd_allreduce: 75.34 | step: 13.26 44%|████▍ | 57/130 [04:16<06:48, 5.59s/it] {'loss': 0.6401, 'learning_rate': 2.493514795380587e-05, 'epoch': 2.19} 44%|████▍ | 57/130 [04:16<06:48, 5.59s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 946 [2024-08-28 13:33:03,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 471.33 | bwd_microstep: 855.66 | bwd_inner_microstep: 855.60 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.10 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 923 [2024-08-28 13:33:05,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.62 | optimizer_step: 0.58 [2024-08-28 13:33:05,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 462.56 | bwd_microstep: 852.90 | bwd_inner_microstep: 846.26 | bwd_allreduce_microstep: 6.55 | step_microstep: 9.95 [2024-08-28 13:33:05,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 933.87 | bwd: 1708.58 | bwd_inner: 1701.87 | bwd_allreduce: 6.57 | step: 10.05 45%|████▍ | 58/130 [04:18<05:39, 4.72s/it] {'loss': 0.7307, 'learning_rate': 2.445041867912629e-05, 'epoch': 2.23} 45%|████▍ | 58/130 [04:18<05:39, 4.72s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 963 [2024-08-28 13:33:06,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 487.82 | bwd_microstep: 887.58 | bwd_inner_microstep: 887.54 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.09 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 915 [2024-08-28 13:33:07,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.61 | optimizer_step: 0.56 [2024-08-28 13:33:07,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 469.75 | bwd_microstep: 864.14 | bwd_inner_microstep: 857.68 | bwd_allreduce_microstep: 6.40 | step_microstep: 9.70 [2024-08-28 13:33:07,781] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 957.55 | bwd: 1751.72 | bwd_inner: 1745.22 | bwd_allreduce: 6.40 | step: 9.80 45%|████▌ | 59/130 [04:21<04:53, 4.13s/it] {'loss': 0.6738, 'learning_rate': 2.3962922863987956e-05, 'epoch': 2.27} 45%|████▌ | 59/130 [04:21<04:53, 4.13s/it]dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 956 [2024-08-28 13:33:09,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 459.65 | bwd_microstep: 848.94 | bwd_inner_microstep: 848.73 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 927 [2024-08-28 13:33:10,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.59 | optimizer_step: 0.57 [2024-08-28 13:33:10,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 465.16 | bwd_microstep: 855.01 | bwd_inner_microstep: 848.31 | bwd_allreduce_microstep: 6.58 | step_microstep: 9.84 [2024-08-28 13:33:10,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 924.78 | bwd: 1703.98 | bwd_inner: 1697.12 | bwd_allreduce: 6.65 | step: 9.95 46%|████▌ | 60/130 [04:24<04:18, 3.70s/it] {'loss': 0.6694, 'learning_rate': 2.3472963553338614e-05, 'epoch': 2.31} 46%|████▌ | 60/130 [04:24<04:18, 3.70s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 960 [2024-08-28 13:33:11,818] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 469.20 | bwd_microstep: 864.53 | bwd_inner_microstep: 864.31 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.07 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 963 [2024-08-28 13:33:13,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.60 | optimizer_step: 0.56 [2024-08-28 13:33:13,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 487.69 | bwd_microstep: 895.42 | bwd_inner_microstep: 888.83 | bwd_allreduce_microstep: 6.53 | step_microstep: 13.32 [2024-08-28 13:33:13,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 956.87 | bwd: 1759.98 | bwd_inner: 1753.18 | bwd_allreduce: 6.60 | step: 13.39 47%|████▋ | 61/130 [04:27<03:55, 3.42s/it] {'loss': 0.6664, 'learning_rate': 2.2980845323523487e-05, 'epoch': 2.35} 47%|████▋ | 61/130 [04:27<03:55, 3.42s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 932 [2024-08-28 13:33:14,575] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 467.47 | bwd_microstep: 853.07 | bwd_inner_microstep: 852.85 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.15 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 938 [2024-08-28 13:33:15,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.61 | optimizer_step: 0.59 [2024-08-28 13:33:15,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 482.50 | bwd_microstep: 876.99 | bwd_inner_microstep: 870.24 | bwd_allreduce_microstep: 6.68 | step_microstep: 9.81 [2024-08-28 13:33:15,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 949.95 | bwd: 1730.09 | bwd_inner: 1723.12 | bwd_allreduce: 6.74 | step: 9.97 48%|████▊ | 62/130 [04:29<03:38, 3.21s/it] {'loss': 0.6477, 'learning_rate': 2.2486874092949708e-05, 'epoch': 2.38} 48%|████▊ | 62/130 [04:29<03:38, 3.21s/it]dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 944 [2024-08-28 13:33:17,285] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 453.62 | bwd_microstep: 844.52 | bwd_inner_microstep: 844.42 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.07 dynamic ViT batch size: 6, images per sample: 1.5, dynamic token length: 898 [2024-08-28 13:33:18,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.59 | optimizer_step: 0.56 [2024-08-28 13:33:18,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 428.40 | bwd_microstep: 925.33 | bwd_inner_microstep: 817.68 | bwd_allreduce_microstep: 107.51 | step_microstep: 13.09 [2024-08-28 13:33:18,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 882.00 | bwd: 1769.88 | bwd_inner: 1662.17 | bwd_allreduce: 107.54 | step: 13.17 48%|████▊ | 63/130 [04:32<03:25, 3.06s/it] {'loss': 0.6324, 'learning_rate': 2.1991356931916335e-05, 'epoch': 2.42} 48%|████▊ | 63/130 [04:32<03:25, 3.06s/it]dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 954 [2024-08-28 13:33:19,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 458.96 | bwd_microstep: 848.88 | bwd_inner_microstep: 848.75 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.09 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 917 [2024-08-28 13:33:21,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.58 | optimizer_step: 0.58 [2024-08-28 13:33:21,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 460.19 | bwd_microstep: 872.52 | bwd_inner_microstep: 845.48 | bwd_allreduce_microstep: 26.92 | step_microstep: 13.24 [2024-08-28 13:33:21,361] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 919.13 | bwd: 1721.42 | bwd_inner: 1694.27 | bwd_allreduce: 26.97 | step: 13.33 49%|████▉ | 64/130 [04:35<03:14, 2.95s/it] {'loss': 0.6167, 'learning_rate': 2.149460187172849e-05, 'epoch': 2.46} 49%|████▉ | 64/130 [04:35<03:14, 2.95s/it]dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 933 [2024-08-28 13:33:22,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 457.66 | bwd_microstep: 840.79 | bwd_inner_microstep: 840.58 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 937 [2024-08-28 13:33:24,036] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.58 | optimizer_step: 0.57 [2024-08-28 13:33:24,036] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 465.59 | bwd_microstep: 862.71 | bwd_inner_microstep: 856.87 | bwd_allreduce_microstep: 5.73 | step_microstep: 9.63 [2024-08-28 13:33:24,036] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 923.23 | bwd: 1703.53 | bwd_inner: 1697.52 | bwd_allreduce: 5.80 | step: 9.77 50%|█████ | 65/130 [04:37<03:06, 2.87s/it] {'loss': 0.6035, 'learning_rate': 2.0996917713213944e-05, 'epoch': 2.5} 50%|█████ | 65/130 [04:37<03:06, 2.87s/it]dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 939 [2024-08-28 13:33:25,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 452.72 | bwd_microstep: 842.70 | bwd_inner_microstep: 842.47 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.12 dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 916 [2024-08-28 13:33:26,761] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.58 | optimizer_step: 0.59 [2024-08-28 13:33:26,761] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 446.81 | bwd_microstep: 928.65 | bwd_inner_microstep: 833.52 | bwd_allreduce_microstep: 95.02 | step_microstep: 13.13 [2024-08-28 13:33:26,761] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 899.50 | bwd: 1771.38 | bwd_inner: 1676.06 | bwd_allreduce: 95.10 | step: 13.26 51%|█████ | 66/130 [04:40<03:00, 2.82s/it] {'loss': 0.653, 'learning_rate': 2.0498613834761462e-05, 'epoch': 2.54} 51%|█████ | 66/130 [04:40<03:00, 2.82s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 922 [2024-08-28 13:33:28,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.64 | bwd_microstep: 861.25 | bwd_inner_microstep: 860.97 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.15 dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 925 [2024-08-28 13:33:29,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.60 | optimizer_step: 0.59 [2024-08-28 13:33:29,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 448.40 | bwd_microstep: 843.48 | bwd_inner_microstep: 837.71 | bwd_allreduce_microstep: 5.66 | step_microstep: 9.62 [2024-08-28 13:33:29,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 926.02 | bwd: 1704.76 | bwd_inner: 1698.77 | bwd_allreduce: 5.78 | step: 9.78 52%|█████▏ | 67/130 [04:43<02:55, 2.78s/it] {'loss': 0.6213, 'learning_rate': 2e-05, 'epoch': 2.58} 52%|█████▏ | 67/130 [04:43<02:55, 2.78s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 939 [2024-08-28 13:33:30,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 478.73 | bwd_microstep: 870.44 | bwd_inner_microstep: 870.20 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.14 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 928 [2024-08-28 13:33:32,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.57 | optimizer_step: 0.57 [2024-08-28 13:33:32,188] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 474.51 | bwd_microstep: 870.57 | bwd_inner_microstep: 864.70 | bwd_allreduce_microstep: 5.82 | step_microstep: 9.46 [2024-08-28 13:33:32,188] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 953.22 | bwd: 1741.04 | bwd_inner: 1734.93 | bwd_allreduce: 5.89 | step: 9.60 52%|█████▏ | 68/130 [04:46<02:51, 2.77s/it] {'loss': 0.5961, 'learning_rate': 1.9501386165238548e-05, 'epoch': 2.62} 52%|█████▏ | 68/130 [04:46<02:51, 2.77s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 913 [2024-08-28 13:33:33,520] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 465.68 | bwd_microstep: 843.13 | bwd_inner_microstep: 842.91 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.16 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 930 [2024-08-28 13:33:34,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.58 | optimizer_step: 0.57 [2024-08-28 13:33:34,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 464.65 | bwd_microstep: 859.40 | bwd_inner_microstep: 853.56 | bwd_allreduce_microstep: 5.72 | step_microstep: 9.63 [2024-08-28 13:33:34,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 930.31 | bwd: 1702.56 | bwd_inner: 1696.55 | bwd_allreduce: 5.80 | step: 9.79 53%|█████▎ | 69/130 [04:48<02:47, 2.74s/it] {'loss': 0.6726, 'learning_rate': 1.9003082286786056e-05, 'epoch': 2.65} 53%|█████▎ | 69/130 [04:48<02:47, 2.74s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 939 [2024-08-28 13:33:36,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 465.82 | bwd_microstep: 856.23 | bwd_inner_microstep: 856.02 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 918 [2024-08-28 13:33:37,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.57 | optimizer_step: 0.56 [2024-08-28 13:33:37,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 460.93 | bwd_microstep: 852.63 | bwd_inner_microstep: 846.91 | bwd_allreduce_microstep: 5.61 | step_microstep: 9.44 [2024-08-28 13:33:37,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 926.73 | bwd: 1708.88 | bwd_inner: 1703.00 | bwd_allreduce: 5.68 | step: 9.57 54%|█████▍ | 70/130 [04:51<02:43, 2.73s/it] {'loss': 0.6147, 'learning_rate': 1.8505398128271517e-05, 'epoch': 2.69} 54%|█████▍ | 70/130 [04:51<02:43, 2.73s/it]dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 935 [2024-08-28 13:33:38,876] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 458.35 | bwd_microstep: 842.33 | bwd_inner_microstep: 842.20 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10 dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 951 [2024-08-28 13:33:40,276] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.62 | optimizer_step: 0.57 [2024-08-28 13:33:40,276] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 454.55 | bwd_microstep: 915.13 | bwd_inner_microstep: 851.69 | bwd_allreduce_microstep: 63.30 | step_microstep: 13.13 [2024-08-28 13:33:40,276] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 912.88 | bwd: 1757.48 | bwd_inner: 1693.97 | bwd_allreduce: 63.34 | step: 13.23 55%|█████▍ | 71/130 [04:54<02:40, 2.73s/it] {'loss': 0.5905, 'learning_rate': 1.800864306808367e-05, 'epoch': 2.73} 55%|█████▍ | 71/130 [04:54<02:40, 2.73s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 978 [2024-08-28 13:33:41,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 479.80 | bwd_microstep: 882.39 | bwd_inner_microstep: 882.18 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 946 [2024-08-28 13:33:43,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.61 | optimizer_step: 0.57 [2024-08-28 13:33:43,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 479.58 | bwd_microstep: 881.51 | bwd_inner_microstep: 874.87 | bwd_allreduce_microstep: 6.56 | step_microstep: 9.85 [2024-08-28 13:33:43,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 959.36 | bwd: 1763.92 | bwd_inner: 1757.08 | bwd_allreduce: 6.63 | step: 9.99 55%|█████▌ | 72/130 [04:56<02:38, 2.74s/it] {'loss': 0.671, 'learning_rate': 1.7513125907050302e-05, 'epoch': 2.77} 55%|█████▌ | 72/130 [04:56<02:38, 2.74s/it]dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 972 [2024-08-28 13:33:44,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 473.39 | bwd_microstep: 870.54 | bwd_inner_microstep: 870.26 | bwd_allreduce_microstep: 0.12 | step_microstep: 0.21 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 936 [2024-08-28 13:33:45,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.63 | optimizer_step: 0.63 [2024-08-28 13:33:45,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 478.35 | bwd_microstep: 878.15 | bwd_inner_microstep: 871.31 | bwd_allreduce_microstep: 6.72 | step_microstep: 10.17 [2024-08-28 13:33:45,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 951.71 | bwd: 1748.72 | bwd_inner: 1741.66 | bwd_allreduce: 6.84 | step: 10.39 56%|█████▌ | 73/130 [04:59<02:36, 2.75s/it] {'loss': 0.6069, 'learning_rate': 1.701915467647651e-05, 'epoch': 2.81} 56%|█████▌ | 73/130 [04:59<02:36, 2.75s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 951 [2024-08-28 13:33:47,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 480.73 | bwd_microstep: 875.50 | bwd_inner_microstep: 875.47 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.09 dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 926 [2024-08-28 13:33:48,505] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.60 | optimizer_step: 0.57 [2024-08-28 13:33:48,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 448.62 | bwd_microstep: 844.08 | bwd_inner_microstep: 837.85 | bwd_allreduce_microstep: 6.18 | step_microstep: 9.88 [2024-08-28 13:33:48,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 929.33 | bwd: 1719.59 | bwd_inner: 1713.32 | bwd_allreduce: 6.19 | step: 9.97 57%|█████▋ | 74/130 [05:02<02:32, 2.73s/it] {'loss': 0.6503, 'learning_rate': 1.6527036446661396e-05, 'epoch': 2.85} 57%|█████▋ | 74/130 [05:02<02:32, 2.73s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 973 [2024-08-28 13:33:49,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 491.81 | bwd_microstep: 892.74 | bwd_inner_microstep: 892.54 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 930 [2024-08-28 13:33:51,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.62 | optimizer_step: 0.66 [2024-08-28 13:33:51,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 465.46 | bwd_microstep: 863.07 | bwd_inner_microstep: 856.44 | bwd_allreduce_microstep: 6.48 | step_microstep: 10.15 [2024-08-28 13:33:51,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 957.25 | bwd: 1755.84 | bwd_inner: 1749.03 | bwd_allreduce: 6.55 | step: 10.27 58%|█████▊ | 75/130 [05:05<02:30, 2.74s/it] {'loss': 0.6333, 'learning_rate': 1.6037077136012054e-05, 'epoch': 2.88} 58%|█████▊ | 75/130 [05:05<02:30, 2.74s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 925 [2024-08-28 13:33:52,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 462.35 | bwd_microstep: 846.55 | bwd_inner_microstep: 846.32 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.15 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 946 [2024-08-28 13:33:53,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.61 | optimizer_step: 0.58 [2024-08-28 13:33:53,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 479.02 | bwd_microstep: 881.62 | bwd_inner_microstep: 875.23 | bwd_allreduce_microstep: 6.31 | step_microstep: 9.93 [2024-08-28 13:33:53,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 941.35 | bwd: 1728.20 | bwd_inner: 1721.60 | bwd_allreduce: 6.38 | step: 10.08 58%|█████▊ | 76/130 [05:07<02:27, 2.73s/it] {'loss': 0.6213, 'learning_rate': 1.5549581320873715e-05, 'epoch': 2.92} 58%|█████▊ | 76/130 [05:07<02:27, 2.73s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 956 [2024-08-28 13:33:55,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 469.34 | bwd_microstep: 862.10 | bwd_inner_microstep: 861.90 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 921 [2024-08-28 13:33:56,682] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.60 | optimizer_step: 0.60 [2024-08-28 13:33:56,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 461.82 | bwd_microstep: 854.15 | bwd_inner_microstep: 847.60 | bwd_allreduce_microstep: 6.43 | step_microstep: 9.72 [2024-08-28 13:33:56,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 931.14 | bwd: 1716.28 | bwd_inner: 1709.58 | bwd_allreduce: 6.50 | step: 9.85 59%|█████▉ | 77/130 [05:10<02:24, 2.72s/it] {'loss': 0.6124, 'learning_rate': 1.5064852046194127e-05, 'epoch': 2.96} 59%|█████▉ | 77/130 [05:10<02:24, 2.72s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 968  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2024-08-28 13:33:58,092] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 490.69 | bwd_microstep: 892.61 | bwd_inner_microstep: 892.45 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 946 [2024-08-28 13:34:00,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.62 | optimizer_step: 0.68 [2024-08-28 13:34:00,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 465.77 | bwd_microstep: 870.78 | bwd_inner_microstep: 859.64 | bwd_allreduce_microstep: 11.04 | step_microstep: 13.42 [2024-08-28 13:34:00,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 956.43 | bwd: 1763.43 | bwd_inner: 1752.12 | bwd_allreduce: 11.11 | step: 13.55 60%|██████ | 78/130 [05:14<02:40, 3.09s/it] {'loss': 0.6404, 'learning_rate': 1.4583190637139901e-05, 'epoch': 3.0} 60%|██████ | 78/130 [05:14<02:40, 3.09s/it][INFO|trainer.py:3242] 2024-08-28 13:34:00,657 >> ***** Running Evaluation ***** [INFO|trainer.py:3244] 2024-08-28 13:34:00,657 >> Num examples = 46 [INFO|trainer.py:3247] 2024-08-28 13:34:00,657 >> Batch size = 8 [2024-08-28 13:34:02,377] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:34:02,378] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:34:06,396] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:34:06,577] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:34:10,719] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:34:10,971] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:34:14,666] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:34:15,047] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): dynamic ViT batch size: 18, images per sample: 2.25, dynamic token length: 906 0%| | 0/3 [00:00= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. dynamic ViT batch size: 22, images per sample: 2.75, dynamic token length: 936 100%|██████████| 3/3 [00:03<00:00, 1.23s/it] {'eval_loss': 0.5793539881706238, 'eval_runtime': 21.5979, 'eval_samples_per_second': 2.13, 'eval_steps_per_second': 0.139, 'epoch': 3.0} 60%|██████ | 78/130 [05:36<02:40, 3.09s/it] 100%|██████████| 3/3 [00:03<00:00, 1.23s/it] [2024-08-28 13:34:23,967] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:34:23,969] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:34:28,097] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:34:28,321] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:34:32,126] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:34:32,698] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:34:36,074] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:34:36,818] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 956 [2024-08-28 13:34:40,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 481.93 | bwd_microstep: 850.10 | bwd_inner_microstep: 849.90 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 925 [2024-08-28 13:34:42,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.62 | optimizer_step: 0.58 [2024-08-28 13:34:42,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 459.00 | bwd_microstep: 1681.61 | bwd_inner_microstep: 847.98 | bwd_allreduce_microstep: 833.50 | step_microstep: 13.35 [2024-08-28 13:34:42,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 940.90 | bwd: 2531.73 | bwd_inner: 1697.98 | bwd_allreduce: 833.56 | step: 13.48 61%|██████ | 79/130 [05:56<12:33, 14.77s/it] {'loss': 0.6378, 'learning_rate': 1.4104896511781916e-05, 'epoch': 3.04} 61%|██████ | 79/130 [05:56<12:33, 14.77s/it]dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 931 [2024-08-28 13:34:43,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 448.29 | bwd_microstep: 840.10 | bwd_inner_microstep: 839.89 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.14 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 944 [2024-08-28 13:34:45,393] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.61 | optimizer_step: 0.56 [2024-08-28 13:34:45,393] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 463.56 | bwd_microstep: 921.07 | bwd_inner_microstep: 856.54 | bwd_allreduce_microstep: 64.41 | step_microstep: 12.98 [2024-08-28 13:34:45,393] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 911.83 | bwd: 1761.20 | bwd_inner: 1696.50 | bwd_allreduce: 64.48 | step: 13.13 62%|██████▏ | 80/130 [05:59<09:17, 11.16s/it] {'loss': 0.629, 'learning_rate': 1.3630266994966314e-05, 'epoch': 3.08} 62%|██████▏ | 80/130 [05:59<09:17, 11.16s/it]dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 920 [2024-08-28 13:34:46,700] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 450.28 | bwd_microstep: 833.50 | bwd_inner_microstep: 833.30 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.09 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 937 [2024-08-28 13:34:48,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.60 | optimizer_step: 0.57 [2024-08-28 13:34:48,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 479.78 | bwd_microstep: 874.14 | bwd_inner_microstep: 867.83 | bwd_allreduce_microstep: 6.26 | step_microstep: 9.88 [2024-08-28 13:34:48,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 930.04 | bwd: 1707.67 | bwd_inner: 1701.16 | bwd_allreduce: 6.35 | step: 9.96 62%|██████▏ | 81/130 [06:01<07:02, 8.62s/it] {'loss': 0.5703, 'learning_rate': 1.3159597133486628e-05, 'epoch': 3.12} 62%|██████▏ | 81/130 [06:01<07:02, 8.62s/it]dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 940 [2024-08-28 13:34:49,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 450.48 | bwd_microstep: 843.36 | bwd_inner_microstep: 843.23 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10 dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 926 [2024-08-28 13:34:50,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.62 | optimizer_step: 0.67 [2024-08-28 13:34:50,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 446.91 | bwd_microstep: 846.09 | bwd_inner_microstep: 839.40 | bwd_allreduce_microstep: 6.58 | step_microstep: 10.33 [2024-08-28 13:34:50,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 897.36 | bwd: 1689.47 | bwd_inner: 1682.68 | bwd_allreduce: 6.64 | step: 10.44 63%|██████▎ | 82/130 [06:04<05:27, 6.82s/it] {'loss': 0.5396, 'learning_rate': 1.26931795126721e-05, 'epoch': 3.15} 63%|██████▎ | 82/130 [06:04<05:27, 6.82s/it]dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 917 [2024-08-28 13:34:52,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 445.42 | bwd_microstep: 832.57 | bwd_inner_microstep: 832.35 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.13 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 908 [2024-08-28 13:34:53,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.62 | optimizer_step: 0.59 [2024-08-28 13:34:53,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 469.28 | bwd_microstep: 863.36 | bwd_inner_microstep: 852.88 | bwd_allreduce_microstep: 10.36 | step_microstep: 13.22 [2024-08-28 13:34:53,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 914.68 | bwd: 1695.96 | bwd_inner: 1685.30 | bwd_allreduce: 10.44 | step: 13.34 64%|██████▍ | 83/130 [06:07<04:22, 5.58s/it] {'loss': 0.5786, 'learning_rate': 1.2231304074506108e-05, 'epoch': 3.19} 64%|██████▍ | 83/130 [06:07<04:22, 5.58s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 937 [2024-08-28 13:34:54,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 476.31 | bwd_microstep: 866.95 | bwd_inner_microstep: 866.72 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.14 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 924 [2024-08-28 13:34:56,098] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.60 | optimizer_step: 0.57 [2024-08-28 13:34:56,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 460.50 | bwd_microstep: 853.38 | bwd_inner_microstep: 846.74 | bwd_allreduce_microstep: 6.51 | step_microstep: 9.93 [2024-08-28 13:34:56,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 936.79 | bwd: 1720.38 | bwd_inner: 1713.55 | bwd_allreduce: 6.59 | step: 10.08 65%|██████▍ | 84/130 [06:09<03:36, 4.72s/it] {'loss': 0.638, 'learning_rate': 1.1774257937387774e-05, 'epoch': 3.23} 65%|██████▍ | 84/130 [06:09<03:36, 4.72s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 953 [2024-08-28 13:34:57,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 479.67 | bwd_microstep: 874.42 | bwd_inner_microstep: 874.16 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.18 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 930 [2024-08-28 13:34:58,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.62 | optimizer_step: 0.61 [2024-08-28 13:34:58,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 463.99 | bwd_microstep: 862.35 | bwd_inner_microstep: 851.59 | bwd_allreduce_microstep: 10.64 | step_microstep: 13.11 [2024-08-28 13:34:58,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 943.64 | bwd: 1736.81 | bwd_inner: 1725.84 | bwd_allreduce: 10.73 | step: 13.29 65%|██████▌ | 85/130 [06:12<03:05, 4.12s/it] {'loss': 0.6425, 'learning_rate': 1.132232521764884e-05, 'epoch': 3.27} 65%|██████▌ | 85/130 [06:12<03:05, 4.12s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 939 [2024-08-28 13:35:00,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.59 | bwd_microstep: 868.89 | bwd_inner_microstep: 868.65 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.21 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 915 [2024-08-28 13:35:01,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.60 | optimizer_step: 0.57 [2024-08-28 13:35:01,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 459.97 | bwd_microstep: 849.85 | bwd_inner_microstep: 843.52 | bwd_allreduce_microstep: 6.23 | step_microstep: 9.87 [2024-08-28 13:35:01,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 937.54 | bwd: 1718.77 | bwd_inner: 1712.26 | bwd_allreduce: 6.32 | step: 10.05 66%|██████▌ | 86/130 [06:15<02:42, 3.70s/it] {'loss': 0.5446, 'learning_rate': 1.087578685293674e-05, 'epoch': 3.31} 66%|██████▌ | 86/130 [06:15<02:42, 3.70s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 931 [2024-08-28 13:35:02,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 463.67 | bwd_microstep: 850.96 | bwd_inner_microstep: 850.72 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.13 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 921 [2024-08-28 13:35:04,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.62 | optimizer_step: 0.59 [2024-08-28 13:35:04,233] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 461.33 | bwd_microstep: 858.87 | bwd_inner_microstep: 845.77 | bwd_allreduce_microstep: 12.97 | step_microstep: 13.08 [2024-08-28 13:35:04,233] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 924.97 | bwd: 1709.87 | bwd_inner: 1696.59 | bwd_allreduce: 13.05 | step: 13.20 67%|██████▋ | 87/130 [06:18<02:26, 3.40s/it] {'loss': 0.5612, 'learning_rate': 1.0434920427573643e-05, 'epoch': 3.35} 67%|██████▋ | 87/130 [06:18<02:26, 3.40s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 978 [2024-08-28 13:35:05,615] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 478.56 | bwd_microstep: 880.12 | bwd_inner_microstep: 879.89 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.14 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 945 [2024-08-28 13:35:07,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.59 | optimizer_step: 0.58 [2024-08-28 13:35:07,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 479.08 | bwd_microstep: 879.44 | bwd_inner_microstep: 873.21 | bwd_allreduce_microstep: 6.11 | step_microstep: 9.67 [2024-08-28 13:35:07,002] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 957.61 | bwd: 1759.60 | bwd_inner: 1753.19 | bwd_allreduce: 6.19 | step: 9.82 68%|██████▊ | 88/130 [06:20<02:14, 3.21s/it] {'loss': 0.5518, 'learning_rate': 1.0000000000000006e-05, 'epoch': 3.38} 68%|██████▊ | 88/130 [06:20<02:14, 3.21s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 939 [2024-08-28 13:35:08,375] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 478.72 | bwd_microstep: 869.94 | bwd_inner_microstep: 869.71 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.15 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 931 [2024-08-28 13:35:09,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.61 | optimizer_step: 0.57 [2024-08-28 13:35:09,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 464.10 | bwd_microstep: 858.37 | bwd_inner_microstep: 851.88 | bwd_allreduce_microstep: 6.37 | step_microstep: 9.84 [2024-08-28 13:35:09,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 942.79 | bwd: 1728.35 | bwd_inner: 1721.67 | bwd_allreduce: 6.44 | step: 9.99 68%|██████▊ | 89/130 [06:23<02:05, 3.06s/it] {'loss': 0.5649, 'learning_rate': 9.57129593241004e-06, 'epoch': 3.42} 68%|██████▊ | 89/130 [06:23<02:05, 3.06s/it]dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 966 [2024-08-28 13:35:11,076] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 464.00 | bwd_microstep: 862.66 | bwd_inner_microstep: 862.44 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 923 [2024-08-28 13:35:12,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.60 | optimizer_step: 0.60 [2024-08-28 13:35:12,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 472.86 | bwd_microstep: 869.40 | bwd_inner_microstep: 862.61 | bwd_allreduce_microstep: 6.66 | step_microstep: 9.95 [2024-08-28 13:35:12,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 936.83 | bwd: 1732.09 | bwd_inner: 1725.15 | bwd_allreduce: 6.73 | step: 10.09 69%|██████▉ | 90/130 [06:26<01:58, 2.96s/it] {'loss': 0.6007, 'learning_rate': 9.149074722684815e-06, 'epoch': 3.46} 69%|██████▉ | 90/130 [06:26<01:58, 2.96s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 968 [2024-08-28 13:35:13,853] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 490.65 | bwd_microstep: 890.80 | bwd_inner_microstep: 890.54 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.16 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 985 [2024-08-28 13:35:15,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.62 | optimizer_step: 0.60 [2024-08-28 13:35:15,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 493.32 | bwd_microstep: 905.76 | bwd_inner_microstep: 899.06 | bwd_allreduce_microstep: 6.58 | step_microstep: 10.08 [2024-08-28 13:35:15,279] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 983.95 | bwd: 1796.60 | bwd_inner: 1789.69 | bwd_allreduce: 6.68 | step: 10.24 70%|███████ | 91/130 [06:29<01:53, 2.92s/it] {'loss': 0.4963, 'learning_rate': 8.733598838727559e-06, 'epoch': 3.5} 70%|███████ | 91/130 [06:29<01:53, 2.92s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 963 [2024-08-28 13:35:16,682] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 490.64 | bwd_microstep: 889.89 | bwd_inner_microstep: 889.67 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.10 dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 936 [2024-08-28 13:35:18,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.59 | optimizer_step: 0.56 [2024-08-28 13:35:18,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 452.47 | bwd_microstep: 849.96 | bwd_inner_microstep: 843.84 | bwd_allreduce_microstep: 6.04 | step_microstep: 9.63 [2024-08-28 13:35:18,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 943.10 | bwd: 1739.87 | bwd_inner: 1733.56 | bwd_allreduce: 6.13 | step: 9.74 71%|███████ | 92/130 [06:31<01:48, 2.86s/it] {'loss': 0.5296, 'learning_rate': 8.325126555304208e-06, 'epoch': 3.54} 71%|███████ | 92/130 [06:31<01:48, 2.86s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 934 [2024-08-28 13:35:19,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 476.64 | bwd_microstep: 868.47 | bwd_inner_microstep: 868.30 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.15 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 928 [2024-08-28 13:35:20,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.60 | optimizer_step: 0.56 [2024-08-28 13:35:20,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 476.26 | bwd_microstep: 870.66 | bwd_inner_microstep: 863.94 | bwd_allreduce_microstep: 6.64 | step_microstep: 9.69 [2024-08-28 13:35:20,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 952.88 | bwd: 1739.16 | bwd_inner: 1732.24 | bwd_allreduce: 6.72 | step: 9.84 72%|███████▏ | 93/130 [06:34<01:44, 2.83s/it] {'loss': 0.5389, 'learning_rate': 7.923911793490449e-06, 'epoch': 3.58} 72%|███████▏ | 93/130 [06:34<01:44, 2.83s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 926 [2024-08-28 13:35:22,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 462.35 | bwd_microstep: 848.34 | bwd_inner_microstep: 848.29 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.07 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 979 [2024-08-28 13:35:23,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.61 | optimizer_step: 0.57 [2024-08-28 13:35:23,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 480.60 | bwd_microstep: 889.47 | bwd_inner_microstep: 883.21 | bwd_allreduce_microstep: 6.16 | step_microstep: 9.70 [2024-08-28 13:35:23,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 942.93 | bwd: 1737.83 | bwd_inner: 1731.53 | bwd_allreduce: 6.17 | step: 9.77 72%|███████▏ | 94/130 [06:37<01:40, 2.80s/it] {'loss': 0.4734, 'learning_rate': 7.530203962825331e-06, 'epoch': 3.62} 72%|███████▏ | 94/130 [06:37<01:40, 2.80s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 966 [2024-08-28 13:35:24,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 490.05 | bwd_microstep: 889.80 | bwd_inner_microstep: 889.77 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.06 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 932 [2024-08-28 13:35:26,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.60 | optimizer_step: 0.56 [2024-08-28 13:35:26,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 465.06 | bwd_microstep: 863.59 | bwd_inner_microstep: 857.24 | bwd_allreduce_microstep: 6.22 | step_microstep: 9.64 [2024-08-28 13:35:26,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 955.08 | bwd: 1753.40 | bwd_inner: 1747.05 | bwd_allreduce: 6.23 | step: 9.72 73%|███████▎ | 95/130 [06:40<01:37, 2.79s/it] {'loss': 0.5749, 'learning_rate': 7.1442478062692135e-06, 'epoch': 3.65} 73%|███████▎ | 95/130 [06:40<01:37, 2.79s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 930 [2024-08-28 13:35:27,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.28 | bwd_microstep: 869.65 | bwd_inner_microstep: 869.49 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.14 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 972 [2024-08-28 13:35:29,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.60 | optimizer_step: 0.56 [2024-08-28 13:35:29,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 480.94 | bwd_microstep: 888.33 | bwd_inner_microstep: 882.08 | bwd_allreduce_microstep: 6.13 | step_microstep: 9.69 [2024-08-28 13:35:29,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 958.20 | bwd: 1758.01 | bwd_inner: 1751.60 | bwd_allreduce: 6.17 | step: 9.83 74%|███████▍ | 96/130 [06:42<01:34, 2.78s/it] {'loss': 0.558, 'learning_rate': 6.766283248062817e-06, 'epoch': 3.69} 74%|███████▍ | 96/130 [06:42<01:34, 2.78s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 939 [2024-08-28 13:35:30,372] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 478.28 | bwd_microstep: 869.44 | bwd_inner_microstep: 869.27 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.10 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 930 [2024-08-28 13:35:31,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.58 | optimizer_step: 0.56 [2024-08-28 13:35:31,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 466.51 | bwd_microstep: 863.11 | bwd_inner_microstep: 857.05 | bwd_allreduce_microstep: 6.01 | step_microstep: 9.64 [2024-08-28 13:35:31,727] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 944.78 | bwd: 1732.58 | bwd_inner: 1726.32 | bwd_allreduce: 6.10 | step: 9.73 75%|███████▍ | 97/130 [06:45<01:31, 2.76s/it] {'loss': 0.5729, 'learning_rate': 6.396545244581609e-06, 'epoch': 3.73} 75%|███████▍ | 97/130 [06:45<01:31, 2.76s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 960 [2024-08-28 13:35:33,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 470.67 | bwd_microstep: 865.67 | bwd_inner_microstep: 865.45 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 980 [2024-08-28 13:35:34,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.60 | optimizer_step: 0.56 [2024-08-28 13:35:34,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 481.54 | bwd_microstep: 892.08 | bwd_inner_microstep: 885.54 | bwd_allreduce_microstep: 6.42 | step_microstep: 9.62 [2024-08-28 13:35:34,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 952.19 | bwd: 1757.78 | bwd_inner: 1751.07 | bwd_allreduce: 6.49 | step: 9.74 75%|███████▌ | 98/130 [06:48<01:28, 2.76s/it] {'loss': 0.5351, 'learning_rate': 6.035263638278546e-06, 'epoch': 3.77} 75%|███████▌ | 98/130 [06:48<01:28, 2.76s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 975 [2024-08-28 13:35:35,896] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 492.44 | bwd_microstep: 896.69 | bwd_inner_microstep: 896.66 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.06 dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 946 [2024-08-28 13:35:37,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.59 | optimizer_step: 0.56 [2024-08-28 13:35:37,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 455.57 | bwd_microstep: 855.55 | bwd_inner_microstep: 849.09 | bwd_allreduce_microstep: 6.35 | step_microstep: 9.55 [2024-08-28 13:35:37,233] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 947.98 | bwd: 1752.24 | bwd_inner: 1745.78 | bwd_allreduce: 6.34 | step: 9.62 76%|███████▌ | 99/130 [06:51<01:25, 2.76s/it] {'loss': 0.54, 'learning_rate': 5.682663014805631e-06, 'epoch': 3.81} 76%|███████▌ | 99/130 [06:51<01:25, 2.76s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 934 [2024-08-28 13:35:38,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 466.77 | bwd_microstep: 854.99 | bwd_inner_microstep: 854.85 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10 dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 904 [2024-08-28 13:35:39,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.61 | optimizer_step: 0.64 [2024-08-28 13:35:39,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 445.79 | bwd_microstep: 840.34 | bwd_inner_microstep: 833.80 | bwd_allreduce_microstep: 6.42 | step_microstep: 10.11 [2024-08-28 13:35:39,889] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 912.53 | bwd: 1695.35 | bwd_inner: 1688.72 | bwd_allreduce: 6.46 | step: 10.21 77%|███████▋ | 100/130 [06:53<01:21, 2.73s/it] {'loss': 0.454, 'learning_rate': 5.338962563403478e-06, 'epoch': 3.85} 77%|███████▋ | 100/130 [06:53<01:21, 2.73s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 951 [2024-08-28 13:35:41,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 468.54 | bwd_microstep: 861.75 | bwd_inner_microstep: 861.53 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 930 [2024-08-28 13:35:42,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.60 | optimizer_step: 0.57 [2024-08-28 13:35:42,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 466.09 | bwd_microstep: 861.50 | bwd_inner_microstep: 855.11 | bwd_allreduce_microstep: 6.28 | step_microstep: 9.77 [2024-08-28 13:35:42,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 934.61 | bwd: 1723.30 | bwd_inner: 1716.71 | bwd_allreduce: 6.35 | step: 9.87 78%|███████▊ | 101/130 [06:56<01:18, 2.72s/it] {'loss': 0.5863, 'learning_rate': 5.004375940645314e-06, 'epoch': 3.88} 78%|███████▊ | 101/130 [06:56<01:18, 2.72s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 973 [2024-08-28 13:35:44,004] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 492.32 | bwd_microstep: 893.69 | bwd_inner_microstep: 893.49 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 918 [2024-08-28 13:35:45,321] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.60 | optimizer_step: 0.56 [2024-08-28 13:35:45,321] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 448.85 | bwd_microstep: 841.87 | bwd_inner_microstep: 835.54 | bwd_allreduce_microstep: 6.21 | step_microstep: 10.57 [2024-08-28 13:35:45,321] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 941.15 | bwd: 1735.59 | bwd_inner: 1729.10 | bwd_allreduce: 6.28 | step: 10.71 78%|███████▊ | 102/130 [06:59<01:16, 2.72s/it] {'loss': 0.5254, 'learning_rate': 4.679111137620442e-06, 'epoch': 3.92} 78%|███████▊ | 102/130 [06:59<01:16, 2.72s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 959 [2024-08-28 13:35:46,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 481.78 | bwd_microstep: 879.40 | bwd_inner_microstep: 879.22 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 927 [2024-08-28 13:35:48,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.59 | optimizer_step: 0.56 [2024-08-28 13:35:48,053] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 464.53 | bwd_microstep: 857.49 | bwd_inner_microstep: 851.13 | bwd_allreduce_microstep: 6.25 | step_microstep: 9.68 [2024-08-28 13:35:48,053] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 946.29 | bwd: 1736.92 | bwd_inner: 1730.39 | bwd_allreduce: 6.32 | step: 9.81 79%|███████▉ | 103/130 [07:01<01:13, 2.73s/it] {'loss': 0.6083, 'learning_rate': 4.363370350639405e-06, 'epoch': 3.96} 79%|███████▉ | 103/130 [07:01<01:13, 2.73s/it]dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 946  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2024-08-28 13:35:49,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 454.03 | bwd_microstep: 846.18 | bwd_inner_microstep: 845.97 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 953 [2024-08-28 13:35:51,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.58 | optimizer_step: 0.56 [2024-08-28 13:35:51,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 478.75 | bwd_microstep: 883.72 | bwd_inner_microstep: 877.35 | bwd_allreduce_microstep: 6.30 | step_microstep: 9.67 [2024-08-28 13:35:51,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 932.76 | bwd: 1729.92 | bwd_inner: 1723.35 | bwd_allreduce: 6.37 | step: 9.82 80%|████████ | 104/130 [07:05<01:18, 3.00s/it] {'loss': 0.4868, 'learning_rate': 4.057349855541557e-06, 'epoch': 4.0} 80%|████████ | 104/130 [07:05<01:18, 3.00s/it][INFO|trainer.py:3242] 2024-08-28 13:35:51,719 >> ***** Running Evaluation ***** [INFO|trainer.py:3244] 2024-08-28 13:35:51,719 >> Num examples = 46 [INFO|trainer.py:3247] 2024-08-28 13:35:51,719 >> Batch size = 8 [2024-08-28 13:35:53,463] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:35:53,465] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:35:57,402] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:35:57,448] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:36:01,573] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:36:01,615] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:36:05,521] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:36:05,697] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): dynamic ViT batch size: 18, images per sample: 2.25, dynamic token length: 906 0%| | 0/3 [00:00= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. dynamic ViT batch size: 22, images per sample: 2.75, dynamic token length: 936 100%|██████████| 3/3 [00:02<00:00, 1.12s/it] {'eval_loss': 0.5455546379089355, 'eval_runtime': 21.1145, 'eval_samples_per_second': 2.179, 'eval_steps_per_second': 0.142, 'epoch': 4.0} 80%|████████ | 104/130 [07:26<01:18, 3.00s/it] 100%|██████████| 3/3 [00:03<00:00, 1.12s/it] [2024-08-28 13:36:14,559] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:36:14,560] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:36:18,517] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:36:18,549] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:36:22,583] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:36:22,633] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:36:26,574] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:36:26,744] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 985 [2024-08-28 13:36:31,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 531.27 | bwd_microstep: 890.76 | bwd_inner_microstep: 890.72 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.08 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 933 [2024-08-28 13:36:32,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.59 | optimizer_step: 0.57 [2024-08-28 13:36:32,661] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 460.60 | bwd_microstep: 858.34 | bwd_inner_microstep: 852.23 | bwd_allreduce_microstep: 6.06 | step_microstep: 9.66 [2024-08-28 13:36:32,661] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 991.85 | bwd: 1749.10 | bwd_inner: 1742.96 | bwd_allreduce: 6.08 | step: 9.74 81%|████████ | 105/130 [07:46<05:59, 14.39s/it] {'loss': 0.449, 'learning_rate': 3.76123988568287e-06, 'epoch': 4.04} 81%|████████ | 105/130 [07:46<05:59, 14.39s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 938 [2024-08-28 13:36:34,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 474.61 | bwd_microstep: 866.88 | bwd_inner_microstep: 866.73 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 931 [2024-08-28 13:36:35,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.59 | optimizer_step: 0.54 [2024-08-28 13:36:35,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 473.87 | bwd_microstep: 871.34 | bwd_inner_microstep: 865.38 | bwd_allreduce_microstep: 5.91 | step_microstep: 9.56 [2024-08-28 13:36:35,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 948.45 | bwd: 1738.25 | bwd_inner: 1732.11 | bwd_allreduce: 6.00 | step: 9.68 82%|████████▏ | 106/130 [07:49<04:21, 10.89s/it] {'loss': 0.5543, 'learning_rate': 3.4752245136801065e-06, 'epoch': 4.08} 82%|████████▏ | 106/130 [07:49<04:21, 10.89s/it]dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 925 [2024-08-28 13:36:36,701] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 446.19 | bwd_microstep: 835.25 | bwd_inner_microstep: 835.02 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 963 [2024-08-28 13:36:38,080] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.58 | optimizer_step: 0.55 [2024-08-28 13:36:38,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 473.91 | bwd_microstep: 880.64 | bwd_inner_microstep: 874.10 | bwd_allreduce_microstep: 6.43 | step_microstep: 9.59 [2024-08-28 13:36:38,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 920.08 | bwd: 1715.92 | bwd_inner: 1709.19 | bwd_allreduce: 6.50 | step: 9.72 82%|████████▏ | 107/130 [07:51<03:13, 8.43s/it] {'loss': 0.492, 'learning_rate': 3.199481536984572e-06, 'epoch': 4.12} 82%|████████▏ | 107/130 [07:51<03:13, 8.43s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 930 [2024-08-28 13:36:39,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 461.95 | bwd_microstep: 849.34 | bwd_inner_microstep: 849.29 | bwd_allreduce_microstep: 0.01 | step_microstep: 0.07 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 925 [2024-08-28 13:36:40,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.60 | optimizer_step: 0.56 [2024-08-28 13:36:40,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 460.70 | bwd_microstep: 887.77 | bwd_inner_microstep: 846.91 | bwd_allreduce_microstep: 40.81 | step_microstep: 12.99 [2024-08-28 13:36:40,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 922.63 | bwd: 1737.15 | bwd_inner: 1696.19 | bwd_allreduce: 40.84 | step: 13.07 83%|████████▎ | 108/130 [07:54<02:27, 6.71s/it] {'loss': 0.561, 'learning_rate': 2.934182367356888e-06, 'epoch': 4.15} 83%|████████▎ | 108/130 [07:54<02:27, 6.71s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 960 [2024-08-28 13:36:42,169] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.75 | bwd_microstep: 875.45 | bwd_inner_microstep: 875.30 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 942 [2024-08-28 13:36:43,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.61 | optimizer_step: 0.58 [2024-08-28 13:36:43,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.05 | bwd_microstep: 887.26 | bwd_inner_microstep: 871.15 | bwd_allreduce_microstep: 16.04 | step_microstep: 13.13 [2024-08-28 13:36:43,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 954.78 | bwd: 1762.74 | bwd_inner: 1746.45 | bwd_allreduce: 16.11 | step: 13.26 84%|████████▍ | 109/130 [07:57<01:56, 5.53s/it] {'loss': 0.5317, 'learning_rate': 2.679491924311226e-06, 'epoch': 4.19} 84%|████████▍ | 109/130 [07:57<01:56, 5.53s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 939 [2024-08-28 13:36:44,904] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 464.61 | bwd_microstep: 852.70 | bwd_inner_microstep: 852.47 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 937 [2024-08-28 13:36:46,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.60 | optimizer_step: 0.56 [2024-08-28 13:36:46,294] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 464.85 | bwd_microstep: 896.29 | bwd_inner_microstep: 854.20 | bwd_allreduce_microstep: 41.98 | step_microstep: 13.20 [2024-08-28 13:36:46,294] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 929.45 | bwd: 1749.02 | bwd_inner: 1706.74 | bwd_allreduce: 42.04 | step: 13.33 85%|████████▍ | 110/130 [08:00<01:33, 4.69s/it] {'loss': 0.5508, 'learning_rate': 2.435568532595427e-06, 'epoch': 4.23} 85%|████████▍ | 110/130 [08:00<01:33, 4.69s/it]dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 924 [2024-08-28 13:36:47,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 445.87 | bwd_microstep: 835.73 | bwd_inner_microstep: 835.49 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 917 [2024-08-28 13:36:49,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.59 | optimizer_step: 0.56 [2024-08-28 13:36:49,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 471.08 | bwd_microstep: 901.75 | bwd_inner_microstep: 859.06 | bwd_allreduce_microstep: 42.62 | step_microstep: 13.04 [2024-08-28 13:36:49,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 916.93 | bwd: 1737.52 | bwd_inner: 1694.59 | bwd_allreduce: 42.69 | step: 13.16 85%|████████▌ | 111/130 [08:02<01:17, 4.10s/it] {'loss': 0.4914, 'learning_rate': 2.2025638237706294e-06, 'epoch': 4.27} 85%|████████▌ | 111/130 [08:02<01:17, 4.10s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 939 [2024-08-28 13:36:50,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.85 | bwd_microstep: 868.77 | bwd_inner_microstep: 868.59 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.12 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 923 [2024-08-28 13:36:51,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.59 | optimizer_step: 0.55 [2024-08-28 13:36:51,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 460.75 | bwd_microstep: 851.60 | bwd_inner_microstep: 845.50 | bwd_allreduce_microstep: 6.05 | step_microstep: 9.61 [2024-08-28 13:36:51,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 938.58 | bwd: 1720.40 | bwd_inner: 1714.10 | bwd_allreduce: 6.16 | step: 9.74 86%|████████▌ | 112/130 [08:05<01:06, 3.68s/it] {'loss': 0.5875, 'learning_rate': 1.9806226419516195e-06, 'epoch': 4.31} 86%|████████▌ | 112/130 [08:05<01:06, 3.68s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 943 [2024-08-28 13:36:53,053] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 465.80 | bwd_microstep: 856.34 | bwd_inner_microstep: 856.11 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.12 dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 975 [2024-08-28 13:36:54,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.59 | optimizer_step: 0.57 [2024-08-28 13:36:54,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 465.71 | bwd_microstep: 875.14 | bwd_inner_microstep: 868.56 | bwd_allreduce_microstep: 6.45 | step_microstep: 9.75 [2024-08-28 13:36:54,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 931.49 | bwd: 1731.51 | bwd_inner: 1724.75 | bwd_allreduce: 6.54 | step: 9.87 87%|████████▋ | 113/130 [08:08<00:57, 3.39s/it] {'loss': 0.4155, 'learning_rate': 1.7698829537665374e-06, 'epoch': 4.35} 87%|████████▋ | 113/130 [08:08<00:57, 3.39s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 946 [2024-08-28 13:36:55,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 479.09 | bwd_microstep: 872.71 | bwd_inner_microstep: 872.40 | bwd_allreduce_microstep: 0.12 | step_microstep: 0.15 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 960 [2024-08-28 13:36:57,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.60 | optimizer_step: 0.58 [2024-08-28 13:36:57,190] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 481.71 | bwd_microstep: 885.84 | bwd_inner_microstep: 879.30 | bwd_allreduce_microstep: 6.49 | step_microstep: 9.75 [2024-08-28 13:36:57,190] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 960.78 | bwd: 1758.58 | bwd_inner: 1751.79 | bwd_allreduce: 6.60 | step: 9.90 88%|████████▊ | 114/130 [08:11<00:51, 3.20s/it] {'loss': 0.5517, 'learning_rate': 1.5704757625918454e-06, 'epoch': 4.38} 88%|████████▊ | 114/130 [08:11<00:51, 3.20s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 973 [2024-08-28 13:36:58,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 491.29 | bwd_microstep: 892.99 | bwd_inner_microstep: 892.72 | bwd_allreduce_microstep: 0.12 | step_microstep: 0.16 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 925 [2024-08-28 13:36:59,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.61 | optimizer_step: 0.58 [2024-08-28 13:36:59,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 461.39 | bwd_microstep: 854.78 | bwd_inner_microstep: 848.12 | bwd_allreduce_microstep: 6.53 | step_microstep: 9.83 [2024-08-28 13:36:59,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 952.66 | bwd: 1747.80 | bwd_inner: 1740.94 | bwd_allreduce: 6.65 | step: 9.99 88%|████████▊ | 115/130 [08:13<00:46, 3.07s/it] {'loss': 0.5025, 'learning_rate': 1.3825250271159175e-06, 'epoch': 4.42} 88%|████████▊ | 115/130 [08:13<00:46, 3.07s/it]dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 934 [2024-08-28 13:37:01,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 450.30 | bwd_microstep: 841.40 | bwd_inner_microstep: 841.28 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08 dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 917 [2024-08-28 13:37:02,609] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.60 | optimizer_step: 0.54 [2024-08-28 13:37:02,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 447.34 | bwd_microstep: 879.03 | bwd_inner_microstep: 834.86 | bwd_allreduce_microstep: 44.05 | step_microstep: 13.00 [2024-08-28 13:37:02,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 897.62 | bwd: 1720.45 | bwd_inner: 1676.19 | bwd_allreduce: 44.09 | step: 13.09 89%|████████▉ | 116/130 [08:16<00:41, 2.95s/it] {'loss': 0.4756, 'learning_rate': 1.2061475842818337e-06, 'epoch': 4.46} 89%|████████▉ | 116/130 [08:16<00:41, 2.95s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 956 [2024-08-28 13:37:03,962] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 468.98 | bwd_microstep: 861.56 | bwd_inner_microstep: 861.35 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 911 [2024-08-28 13:37:05,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.61 | optimizer_step: 0.58 [2024-08-28 13:37:05,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 459.67 | bwd_microstep: 862.83 | bwd_inner_microstep: 843.88 | bwd_allreduce_microstep: 18.83 | step_microstep: 13.34 [2024-08-28 13:37:05,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 928.62 | bwd: 1724.43 | bwd_inner: 1705.30 | bwd_allreduce: 18.90 | step: 13.47 90%|█████████ | 117/130 [08:19<00:37, 2.87s/it] {'loss': 0.5201, 'learning_rate': 1.0414530766573661e-06, 'epoch': 4.5} 90%|█████████ | 117/130 [08:19<00:37, 2.87s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 951 [2024-08-28 13:37:06,694] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 480.54 | bwd_microstep: 875.33 | bwd_inner_microstep: 875.16 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 925 [2024-08-28 13:37:08,037] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.61 | optimizer_step: 0.58 [2024-08-28 13:37:08,037] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 462.89 | bwd_microstep: 854.52 | bwd_inner_microstep: 848.01 | bwd_allreduce_microstep: 6.38 | step_microstep: 9.86 [2024-08-28 13:37:08,037] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 943.41 | bwd: 1729.88 | bwd_inner: 1723.22 | bwd_allreduce: 6.46 | step: 9.98 91%|█████████ | 118/130 [08:21<00:33, 2.83s/it] {'loss': 0.5558, 'learning_rate': 8.885438842771843e-07, 'epoch': 4.54} 91%|█████████ | 118/130 [08:21<00:33, 2.83s/it]dynamic ViT batch size: 6, images per sample: 1.5, dynamic token length: 909 [2024-08-28 13:37:09,311] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 431.17 | bwd_microstep: 820.25 | bwd_inner_microstep: 820.02 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 920 [2024-08-28 13:37:10,698] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.59 | optimizer_step: 0.57 [2024-08-28 13:37:10,699] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 473.50 | bwd_microstep: 885.10 | bwd_inner_microstep: 862.13 | bwd_allreduce_microstep: 22.90 | step_microstep: 13.26 [2024-08-28 13:37:10,699] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 904.65 | bwd: 1705.39 | bwd_inner: 1682.21 | bwd_allreduce: 22.97 | step: 13.38 92%|█████████▏| 119/130 [08:24<00:30, 2.78s/it] {'loss': 0.4251, 'learning_rate': 7.475150609997595e-07, 'epoch': 4.58} 92%|█████████▏| 119/130 [08:24<00:30, 2.78s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 939 [2024-08-28 13:37:12,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 478.43 | bwd_microstep: 871.54 | bwd_inner_microstep: 871.38 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.14 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 930 [2024-08-28 13:37:13,424] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.58 | optimizer_step: 0.56 [2024-08-28 13:37:13,424] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 466.12 | bwd_microstep: 860.21 | bwd_inner_microstep: 854.12 | bwd_allreduce_microstep: 6.03 | step_microstep: 9.64 [2024-08-28 13:37:13,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 944.53 | bwd: 1731.77 | bwd_inner: 1725.50 | bwd_allreduce: 6.13 | step: 9.78 92%|█████████▏| 120/130 [08:27<00:27, 2.76s/it] {'loss': 0.6341, 'learning_rate': 6.184542754184431e-07, 'epoch': 4.62} 92%|█████████▏| 120/130 [08:27<00:27, 2.76s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 926 [2024-08-28 13:37:14,790] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 476.18 | bwd_microstep: 863.67 | bwd_inner_microstep: 863.49 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.14 dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 927 [2024-08-28 13:37:16,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.61 | optimizer_step: 0.59 [2024-08-28 13:37:16,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 450.81 | bwd_microstep: 847.34 | bwd_inner_microstep: 840.62 | bwd_allreduce_microstep: 6.62 | step_microstep: 10.16 [2024-08-28 13:37:16,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 926.97 | bwd: 1711.03 | bwd_inner: 1704.18 | bwd_allreduce: 6.67 | step: 10.30 93%|█████████▎| 121/130 [08:29<00:24, 2.74s/it] {'loss': 0.4536, 'learning_rate': 5.014417563635276e-07, 'epoch': 4.65} 93%|█████████▎| 121/130 [08:29<00:24, 2.74s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 966 [2024-08-28 13:37:17,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 489.91 | bwd_microstep: 890.49 | bwd_inner_microstep: 890.28 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 924 [2024-08-28 13:37:18,844] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.60 | optimizer_step: 0.61 [2024-08-28 13:37:18,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 449.99 | bwd_microstep: 844.81 | bwd_inner_microstep: 838.27 | bwd_allreduce_microstep: 6.42 | step_microstep: 10.16 [2024-08-28 13:37:18,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 939.88 | bwd: 1735.33 | bwd_inner: 1728.60 | bwd_allreduce: 6.49 | step: 10.30 94%|█████████▍| 122/130 [08:32<00:21, 2.74s/it] {'loss': 0.5496, 'learning_rate': 3.965502430291235e-07, 'epoch': 4.69} 94%|█████████▍| 122/130 [08:32<00:21, 2.74s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 922 [2024-08-28 13:37:20,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 463.09 | bwd_microstep: 844.76 | bwd_inner_microstep: 844.55 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.15 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 908 [2024-08-28 13:37:21,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.61 | optimizer_step: 0.60 [2024-08-28 13:37:21,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 473.38 | bwd_microstep: 864.40 | bwd_inner_microstep: 857.79 | bwd_allreduce_microstep: 6.52 | step_microstep: 10.21 [2024-08-28 13:37:21,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 936.44 | bwd: 1709.19 | bwd_inner: 1702.38 | bwd_allreduce: 6.59 | step: 10.36 95%|█████████▍| 123/130 [08:35<00:19, 2.73s/it] {'loss': 0.5243, 'learning_rate': 3.038449397558396e-07, 'epoch': 4.73} 95%|█████████▍| 123/130 [08:35<00:19, 2.73s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 956 [2024-08-28 13:37:22,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 481.84 | bwd_microstep: 878.56 | bwd_inner_microstep: 878.40 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 930 [2024-08-28 13:37:24,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.61 | optimizer_step: 0.58 [2024-08-28 13:37:24,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 466.28 | bwd_microstep: 859.07 | bwd_inner_microstep: 852.40 | bwd_allreduce_microstep: 6.55 | step_microstep: 9.98 [2024-08-28 13:37:24,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 948.10 | bwd: 1737.67 | bwd_inner: 1730.84 | bwd_allreduce: 6.62 | step: 10.12 95%|█████████▌| 124/130 [08:38<00:16, 2.73s/it] {'loss': 0.5546, 'learning_rate': 2.2338347549742956e-07, 'epoch': 4.77} 95%|█████████▌| 124/130 [08:38<00:16, 2.73s/it]dynamic ViT batch size: 8, images per sample: 2.0, dynamic token length: 932 [2024-08-28 13:37:25,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 453.15 | bwd_microstep: 840.94 | bwd_inner_microstep: 840.72 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 925 [2024-08-28 13:37:26,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.63 | optimizer_step: 0.60 [2024-08-28 13:37:26,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 476.34 | bwd_microstep: 870.23 | bwd_inner_microstep: 863.76 | bwd_allreduce_microstep: 6.42 | step_microstep: 10.27 [2024-08-28 13:37:26,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 929.46 | bwd: 1711.20 | bwd_inner: 1704.52 | bwd_allreduce: 6.51 | step: 10.42 96%|█████████▌| 125/130 [08:40<00:13, 2.72s/it] {'loss': 0.4766, 'learning_rate': 1.5521586799655875e-07, 'epoch': 4.81} 96%|█████████▌| 125/130 [08:40<00:13, 2.72s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 979 [2024-08-28 13:37:28,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 496.09 | bwd_microstep: 897.60 | bwd_inner_microstep: 897.36 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.18 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 913 [2024-08-28 13:37:29,730] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.62 | optimizer_step: 0.62 [2024-08-28 13:37:29,730] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 460.43 | bwd_microstep: 850.89 | bwd_inner_microstep: 844.15 | bwd_allreduce_microstep: 6.60 | step_microstep: 10.14 [2024-08-28 13:37:29,730] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 956.49 | bwd: 1748.52 | bwd_inner: 1741.61 | bwd_allreduce: 6.69 | step: 10.32 97%|█████████▋| 126/130 [08:43<00:10, 2.73s/it] {'loss': 0.5953, 'learning_rate': 9.938449269197181e-08, 'epoch': 4.85} 97%|█████████▋| 126/130 [08:43<00:10, 2.73s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 951 [2024-08-28 13:37:31,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 468.83 | bwd_microstep: 860.92 | bwd_inner_microstep: 860.71 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 927 [2024-08-28 13:37:32,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.61 | optimizer_step: 0.59 [2024-08-28 13:37:32,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 463.57 | bwd_microstep: 856.37 | bwd_inner_microstep: 849.94 | bwd_allreduce_microstep: 6.32 | step_microstep: 9.96 [2024-08-28 13:37:32,431] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 932.37 | bwd: 1717.33 | bwd_inner: 1710.72 | bwd_allreduce: 6.38 | step: 10.10 98%|█████████▊| 127/130 [08:46<00:08, 2.72s/it] {'loss': 0.5025, 'learning_rate': 5.592405637639742e-08, 'epoch': 4.88} 98%|█████████▊| 127/130 [08:46<00:08, 2.72s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 961 [2024-08-28 13:37:33,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.78 | bwd_microstep: 875.20 | bwd_inner_microstep: 875.00 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 928 [2024-08-28 13:37:35,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.60 | optimizer_step: 0.57 [2024-08-28 13:37:35,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 463.94 | bwd_microstep: 855.91 | bwd_inner_microstep: 849.52 | bwd_allreduce_microstep: 6.27 | step_microstep: 9.84 [2024-08-28 13:37:35,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 941.70 | bwd: 1731.15 | bwd_inner: 1724.59 | bwd_allreduce: 6.33 | step: 9.94 98%|█████████▊| 128/130 [08:48<00:05, 2.72s/it] {'loss': 0.4808, 'learning_rate': 2.4861575621553112e-08, 'epoch': 4.92} 98%|█████████▊| 128/130 [08:48<00:05, 2.72s/it]dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 980 [2024-08-28 13:37:36,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 481.33 | bwd_microstep: 881.19 | bwd_inner_microstep: 880.95 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.15 dynamic ViT batch size: 10, images per sample: 2.5, dynamic token length: 959 [2024-08-28 13:37:37,908] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.64 | optimizer_step: 0.59 [2024-08-28 13:37:37,908] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 471.77 | bwd_microstep: 872.02 | bwd_inner_microstep: 865.31 | bwd_allreduce_microstep: 6.58 | step_microstep: 10.03 [2024-08-28 13:37:37,909] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 953.08 | bwd: 1753.24 | bwd_inner: 1746.35 | bwd_allreduce: 6.66 | step: 10.18 99%|█████████▉| 129/130 [08:51<00:02, 2.73s/it] {'loss': 0.463, 'learning_rate': 6.216359983675091e-09, 'epoch': 4.96} 99%|█████████▉| 129/130 [08:51<00:02, 2.73s/it]dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 936  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2024-08-28 13:37:39,283] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 478.34 | bwd_microstep: 871.27 | bwd_inner_microstep: 871.10 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.16  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. dynamic ViT batch size: 12, images per sample: 3.0, dynamic token length: 936 [2024-08-28 13:37:41,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.64 | optimizer_step: 0.59 [2024-08-28 13:37:41,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.10 | bwd_microstep: 1023.89 | bwd_inner_microstep: 870.67 | bwd_allreduce_microstep: 153.17 | step_microstep: 13.44 [2024-08-28 13:37:41,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 955.41 | bwd: 1895.20 | bwd_inner: 1741.77 | bwd_allreduce: 153.26 | step: 13.59 100%|██████████| 130/130 [08:55<00:00, 3.03s/it] {'loss': 0.5275, 'learning_rate': 0.0, 'epoch': 5.0} 100%|██████████| 130/130 [08:55<00:00, 3.03s/it][INFO|trainer.py:3242] 2024-08-28 13:37:41,666 >> ***** Running Evaluation ***** [INFO|trainer.py:3244] 2024-08-28 13:37:41,666 >> Num examples = 46 [INFO|trainer.py:3247] 2024-08-28 13:37:41,666 >> Batch size = 8 [2024-08-28 13:37:43,379] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:37:43,392] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:37:47,393] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:37:47,709] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:37:51,321] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:37:51,732] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2024-08-28 13:37:55,740] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-28 13:37:56,148] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /mnt/nvme0n1/workspace/fengdahu/anaconda3/envs/internvl/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): dynamic ViT batch size: 18, images per sample: 2.25, dynamic token length: 906 0%| | 0/3 [00:00= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images.  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.  [WARNING]  async_io: please install the libaio-devel package with yum  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. dynamic ViT batch size: 22, images per sample: 2.75, dynamic token length: 936 100%|██████████| 3/3 [00:02<00:00, 1.09s/it] {'eval_loss': 0.5420305728912354, 'eval_runtime': 21.3644, 'eval_samples_per_second': 2.153, 'eval_steps_per_second': 0.14, 'epoch': 5.0} 100%|██████████| 130/130 [09:16<00:00, 3.03s/it] 100%|██████████| 3/3 [00:02<00:00, 1.09s/it] [INFO|trainer.py:1962] 2024-08-28 13:38:03,032 >> Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 556.8684, 'train_samples_per_second': 3.735, 'train_steps_per_second': 0.233, 'train_loss': 1.008507801936223, 'epoch': 5.0} 100%|██████████| 130/130 [09:16<00:00, 3.03s/it] 100%|██████████| 130/130 [09:16<00:00, 4.28s/it] [INFO|trainer.py:2936] 2024-08-28 13:38:11,846 >> Saving model checkpoint to work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora_ui [INFO|configuration_utils.py:473] 2024-08-28 13:38:11,848 >> Configuration saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora_ui/config.json [INFO|configuration_utils.py:594] 2024-08-28 13:38:11,848 >> Configuration saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora_ui/generation_config.json [INFO|modeling_utils.py:2501] 2024-08-28 13:38:18,983 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora_ui/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2433] 2024-08-28 13:38:18,985 >> tokenizer config file saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora_ui/tokenizer_config.json [INFO|tokenization_utils_base.py:2442] 2024-08-28 13:38:18,986 >> Special tokens file saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora_ui/special_tokens_map.json [INFO|tokenization_utils_base.py:2493] 2024-08-28 13:38:18,986 >> added tokens file saved in work_dirs/internvl_chat_v2_0/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora_ui/added_tokens.json ***** train metrics ***** epoch = 5.0 train_loss = 1.0085 train_runtime = 0:09:16.86 train_samples = 416 train_samples_per_second = 3.735 train_steps_per_second = 0.233