[2024-12-04 14:10:38,207] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-04 14:10:39,754] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2024-12-04 14:10:39,754] [INFO] [runner.py:571:main] cmd = /vol3/ctr/.conda/envs/llava_rest/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNV19 --master_addr=127.0.0.1 --master_port=29504 --enable_each_rank_log=None llava/train/train_mem.py --deepspeed /vol3/home/ctr/llava-rlhf/LLaVA-REST-MCTS/models/LLaVA/scripts/zero3_offload.json --model_name_or_path /vol3/home/ctr/llava-rlhf/models/llava-v1.5-7b --version v1 --data_path /vol3/home/ctr/llava-rlhf/datasets/aokvqa/aokvqa_policy_train.json --image_folder /vol3/home/ctr/llava-rlhf/datasets/coco --vision_tower /vol3/home/ctr/llava-rlhf/models/clip-vit-large-patch14-336 --mm_projector_type mlp2x_gelu --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length True --bf16 True --output_dir /vol3/home/ctr/llava-rlhf/models/llava-v1.5-7b-sft-policy-v2 --num_train_epochs 3 --per_device_train_batch_size 16 --per_device_eval_batch_size 8 --gradient_accumulation_steps 2 --evaluation_strategy no --save_strategy steps --save_steps 100 --save_total_limit 3 --learning_rate 5e-6 --weight_decay 0.05 --warmup_ratio 0.1 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 8 --lazy_preprocess True --report_to wandb [2024-12-04 14:10:42,757] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-04 14:10:44,308] [INFO] [launch.py:138:main] 0 NCCL_TIMEOUT=360 [2024-12-04 14:10:44,308] [INFO] [launch.py:138:main] 0 NCCL_IB_TIMEOUT=360 [2024-12-04 14:10:44,308] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5]} [2024-12-04 14:10:44,308] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=6, node_rank=0 [2024-12-04 14:10:44,308] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1, 2, 3, 4, 5]}) [2024-12-04 14:10:44,308] [INFO] [launch.py:163:main] dist_world_size=6 [2024-12-04 14:10:44,308] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 [2024-12-04 14:10:48,071] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-04 14:10:48,370] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-04 14:10:48,394] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-04 14:10:48,408] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-04 14:10:48,450] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-04 14:10:48,473] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-04 14:10:49,624] [INFO] [comm.py:637:init_distributed] cdb=None [2024-12-04 14:10:49,895] [INFO] [comm.py:637:init_distributed] cdb=None [2024-12-04 14:10:49,895] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2024-12-04 14:10:49,935] [INFO] [comm.py:637:init_distributed] cdb=None [2024-12-04 14:10:49,974] [INFO] [comm.py:637:init_distributed] cdb=None [2024-12-04 14:10:50,051] [INFO] [comm.py:637:init_distributed] cdb=None [2024-12-04 14:10:50,072] [INFO] [comm.py:637:init_distributed] cdb=None model_args: ModelArguments(model_name_or_path='/vol3/home/ctr/llava-rlhf/models/llava-v1.5-7b', version='v1', freeze_backbone=False, tune_mm_mlp_adapter=False, vision_tower='/vol3/home/ctr/llava-rlhf/models/clip-vit-large-patch14-336', mm_vision_select_layer=-2, pretrain_mm_mlp_adapter=None, mm_projector_type='mlp2x_gelu', mm_use_im_start_end=False, mm_use_im_patch_token=False, mm_patch_merge_type='flat', mm_vision_select_feature='patch') data_args: DataArguments(data_path='/vol3/home/ctr/llava-rlhf/datasets/aokvqa/aokvqa_policy_train.json', lazy_preprocess=True, is_multimodal=False, image_folder='/vol3/home/ctr/llava-rlhf/datasets/coco', image_aspect_ratio='pad') training_args: TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, bits=16, cache_dir=None, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=8, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=/vol3/home/ctr/llava-rlhf/LLaVA-REST-MCTS/models/LLaVA/scripts/zero3_offload.json, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=False, double_quant=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, freeze_mm_mlp_adapter=False, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=2, gradient_checkpointing=True, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, group_by_modality_length=True, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-06, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=/vol3/home/ctr/llava-rlhf/models/llava-v1.5-7b-sft-policy-v2/runs/Dec04_14-10-49_a102, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lora_alpha=16, lora_bias=none, lora_dropout=0.05, lora_enable=False, lora_r=64, lora_weight_path=, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mm_projector_lr=None, model_max_length=2048, mp_parameters=, mpt_attn_impl=triton, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=3.0, optim=adamw_torch, optim_args=None, output_dir=/vol3/home/ctr/llava-rlhf/models/llava-v1.5-7b-sft-policy-v2, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=16, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, quant_type=nf4, ray_scope=last, remove_unused_columns=False, report_to=['wandb'], resume_from_checkpoint=None, run_name=/vol3/home/ctr/llava-rlhf/models/llava-v1.5-7b-sft-policy-v2, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=100, save_strategy=steps, save_total_limit=3, seed=42, skip_memory_metrics=True, split_batches=False, tf32=True, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.1, warmup_steps=0, weight_decay=0.05, ) You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors. You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [2024-12-04 14:11:17,198] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 295, num_elems = 6.76B Loading checkpoint shards: 0%| | 0/2 [00:00