diff --git "a/stderr.log" "b/stderr.log" new file mode 100644--- /dev/null +++ "b/stderr.log" @@ -0,0 +1,78 @@ ++ deepspeed --num_nodes=1 --num_gpus=4 --master_port 56337 --module safe_rlhf.finetune --train_datasets alpaca --model_name_or_path huggyllama/llama-7b --max_length 512 --trust_remote_code True --epochs 3 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 16 --gradient_checkpointing --learning_rate 2e-5 --lr_scheduler_type cosine --lr_warmup_ratio 0.03 --weight_decay 0.0 --seed 42 --output_dir /data/jiongxiao_wang/rlhf_attack/safe-rlhf/output/sft --log_type wandb --log_project Safe-RLHF-SFT --zero_stage 3 --bf16 True --tf32 True +2023-12-31 20:07:06.942109: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered +2023-12-31 20:07:06.942153: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered +2023-12-31 20:07:06.943372: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered +2023-12-31 20:07:07.014180: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered +2023-12-31 20:07:07.014285: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered +2023-12-31 20:07:07.015279: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered +2023-12-31 20:07:07.017820: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered +2023-12-31 20:07:07.017846: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered +2023-12-31 20:07:07.018518: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered +2023-12-31 20:07:07.073231: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered +2023-12-31 20:07:07.073268: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered +2023-12-31 20:07:07.074177: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered +2023-12-31 20:07:08.102141: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT +2023-12-31 20:07:08.104105: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT +2023-12-31 20:07:08.108165: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT +2023-12-31 20:07:08.108457: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT + Loading checkpoint shards: 0%| | 0/2 [00:00. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565 +You are using the legacy behaviour of the . This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565 +Using pad_token, but it is not set yet. +Using pad_token, but it is not set yet. +You are using the legacy behaviour of the . This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565 +Using pad_token, but it is not set yet. + Loading checkpoint shards: 50%|█████ | 1/2 [00:08<00:08, 8.76s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:12<00:00, 5.52s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:12<00:00, 6.01s/it] +You are using the legacy behaviour of the . This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565 +Using pad_token, but it is not set yet. +Using /data/jiongxiao_wang/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... +Using /data/jiongxiao_wang/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... +Using /data/jiongxiao_wang/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... +Using /data/jiongxiao_wang/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... +Detected CUDA files, patching ldflags +Emitting ninja build file /data/jiongxiao_wang/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja... +Building extension module fused_adam... +Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) +Loading extension module fused_adam... +Loading extension module fused_adam... +Loading extension module fused_adam... +Loading extension module fused_adam... +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +wandb: Currently logged in as: jayfeather (jayfeather1024). Use `wandb login --relogin` to force relogin +wandb: Tracking run with wandb version 0.16.1 +wandb: Run data is saved locally in /data/jiongxiao_wang/rlhf_attack/safe-rlhf/output/sft/wandb/run-20231231_200741-owu4dq7j +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run sft-2023-12-31-20-07-40 +wandb: ⭐️ View project at https://wandb.ai/jayfeather1024/Safe-RLHF-SFT +wandb: 🚀 View run at https://wandb.ai/jayfeather1024/Safe-RLHF-SFT/runs/owu4dq7j + Training 1/3 epoch: 0%| | 0/9753 [00:00