8-bit precision error

#32
by saireddy - opened

anyone facing this issue with A100 multi gpus
ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}
I am using "auto" for device map, still hitting this issue

can you provide the training code?

model_id = "google/gemma-7b"
BitsAndBytesConfig int-4 config

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)

print("initiating model download")

model = AutoModelForCausalLM.from_pretrained(model_id,
quantization_config=bnb_config,
use_cache=False,
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
device_map="auto", token=access_token)
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
target_modules=["q_proj", "v_proj"],
r=64,
bias="none",
task_type="CAUSAL_LM",
)
prepare model for training

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)
from transformers import TrainingArguments
args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=15,
per_device_train_batch_size=8,
gradient_accumulation_steps=2,
gradient_checkpointing=True,

optim="paged_adamw_32bit",
logging_steps=100,
save_strategy="epoch",
learning_rate=2e-4,
bf16=True,
tf32=True,
max_grad_norm=0.3,
warmup_ratio=0.03,
seed=42,
eval_steps=100,
lr_scheduler_type="cosine",
evaluation_strategy='epoch',
disable_tqdm=False,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
report_to="wandb",
run_name=run_name # disable tqdm since with packing values are in correct

)
from trl import SFTTrainer

max_seq_length = 2048 # max sequence length for model and packing of the dataset

trainer = SFTTrainer(
model=model,
peft_config=peft_config,
max_seq_length=max_seq_length,
tokenizer=tokenizer,
packing=True,
formatting_func=generate_prompt, # this will aplly the create_prompt mapping to all training and test dataset
args=args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"]
)
trainer.train()

few reqs that might help
accelerate==0.27.2
transformers==4.38.1
trl==0.7.11
bitsandbytes==0.42.0

Hi @saireddy !

This is because you need to make sure your model fits the entire GPU instead of splitting it across all available devices. Can you try:

+ from accelerate import PartialState

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
-    device_map="auto", 
+   device_map={"": PartialState().process_index}
    token=access_token
)

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"],
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

from transformers import TrainingArguments

args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=15,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    optim="paged_adamw_32bit",
    logging_steps=100,
    save_strategy="epoch",
    learning_rate=2e-4,
    bf16=True,
    tf32=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    seed=42,
    eval_steps=100,
    lr_scheduler_type="cosine",
    evaluation_strategy='epoch',
    disable_tqdm=False,
    load_best_model_at_end=True,    
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    report_to="wandb",
    run_name=run_name # disable tqdm since with packing values are in correct
)
from trl import SFTTrainer

max_seq_length = 2048 # max sequence length for model and packing of the dataset

trainer = SFTTrainer(
    model=model,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    formatting_func=generate_prompt, # this will aplly the create_prompt mapping to all training and test dataset
    args=args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"]
)
trainer.train()

the line device_map={"": PartialState().process_index} will make sure accelerate will force-set the model into the GPU of index i instead of splitting into multiple devices.
See the reasons behind it in this GitHub issue comment: https://github.com/huggingface/accelerate/issues/1840#issuecomment-1683105994

Thanks @ybelkada . couple of follow up questions
i am using same template for other models like llama2 and it doesn't show same errors and also I am using 4-bit quantization but error talks about 8-bit precision. am I missing something here ? can you please share your thoughts on this?

i tried with the changes you have mentioned, surprisingly I am hitting out of memory issue even with 80GB A100 one.

Same here, I tried with the suggestion. It's running out of memory even with 4 A10 gpus.

I am facing the same issue.

CODE and ERROR BELOW:

trainer.train()

File "/usr/local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 331, in train
output = super().train(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1776, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1228, in prepare
result = tuple(
File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1229, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1105, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1331, in prepare_model
raise ValueError(
ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}

  • device_map={"": PartialState().process_index}

above changes give Cuda OOM issue. I am working with Gemma2b-it

deleted

I also tried the suggestion and I am having same problem with mistral 7B model which I am trying to finetune.

My code is

qunatization config

quantization_config = BitsAndBytesConfig(
load_in_4bit = True, # enable 4-bit quantization
bnb_4bit_quant_type = 'nf4', # information theoretically optimal dtype for normally distributed weights
bnb_4bit_use_double_quant = True, # quantize quantized weights //insert xzibit meme
bnb_4bit_compute_dtype = torch.bfloat16 # optimized fp format for ML
)

lora_config = LoraConfig(
r = 16, # the dimension of the low-rank matrices
lora_alpha = 8, # scaling factor for LoRA activations vs pre-trained weight activations
target_modules = ['q_proj', 'k_proj', 'v_proj', 'o_proj'],
lora_dropout = 0.05, # dropout probability of the LoRA layers
bias = 'none', # wether to train bias weights, set to 'none' for attention layers
task_type = 'SEQ_CLS'
)

from accelerate import PartialState

model = AutoModelForSequenceClassification.from_pretrained(
model_name,
quantization_config=quantization_config,
num_labels=len(le.classes_),
device_map={"": PartialState().process_index},
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
model.config.pad_token_id = tokenizer.pad_token_id

define training args

training_args = TrainingArguments(
output_dir = 'multilabel_classification',
learning_rate = 1e-4,
per_device_train_batch_size = 8, # tested with 16gb gpu ram
per_device_eval_batch_size = 8,
num_train_epochs = 10,
weight_decay = 0.01,
evaluation_strategy = 'epoch',
save_strategy = 'epoch',
load_best_model_at_end = True
)

label_weights = 1 - labels_encoded/labels_encoded.sum()

train

trainer = CustomTrainer(
model = model,
args = training_args,
train_dataset = tokenized_ds['train'],
eval_dataset = tokenized_ds['val'],
tokenizer = tokenizer,
data_collator = functools.partial(collate_fn, tokenizer=tokenizer),
compute_metrics = compute_metrics,
label_weights = torch.tensor(label_weights, device=model.device)
)

trainer.train()

deleted

BTW I am getting the original error mentioned - ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}
I0000 00:00:1711804411.121734 181893 cpu_client.cc:373] TfrtCpuClient destroyed.

i tried with new gemma 7b It model and still hitting the same issue

Hi everyone,

I encountered a similar error with llama3 - 7b. To address this issue, I tried the following solution:

the line device_map={"": PartialState().process_index} will make sure accelerate will force-set the model into the GPU of index i instead of splitting into multiple devices.
See the reasons behind it in this GitHub issue comment: https://github.com/huggingface/accelerate/issues/1840#issuecomment-1683105994

While this helped resolve the initial error, it has now led to an Out of Memory (OOM) issue. I find this situation somewhat unexpected, considering I am using 8 Nvidia A100 GPUs (each with 40GB of memory) and have never experienced OOM errors with this configuration when working with models of similar size. I am currently performing QLoRA during the fine-tuning process.

Following are the versions of libraries that I am using:
transformers =4.40.1
accelerate=0.30.0.dev0
trl=0.8.6
peft=0.10.0

I tried using device_map={"":0}, but I am still encountering an Out of Memory (OOM) error.

Here are my LoRA params:

  1. Rank = 8
  2. Alpha = 8
  3. Using 4bit quantization while loading the base model.

Has anyone figured out a solution to this problem? Thanks in advance!

Thanks @ybelkada . couple of follow up questions
i am using same template for other models like llama2 and it doesn't show same errors and also I am using 4-bit quantization but error talks about 8-bit precision. am I missing something here ? can you please share your thoughts on this?

i tried with the changes you have mentioned, surprisingly I am hitting out of memory issue even with 80GB A100 one.

I got the same error and hitting OOM after the change.
My training code works for llama2 but not llama3 on 4x A40.
Have you figured out a solution?

This is how I solved my issue:
Before running my script, I ran the command below. In my case, I wanted to use either GPU 3, 4, or 5 (other GPUs were highly loaded by other users)

export CUDA_VISIBLE_DEVICES=3,4,5

Inside my Python script, I used these commands,

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
CUDA_LAUNCH_BLOCKING=1

desired_device = 0
torch.cuda.set_device(desired_device)
............

model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=compute_dtype,
quantization_config=bnb_config,
device_map = torch.cuda.set_device(desired_device)
)

....................

My guess is the system is now considering GPU 3 as GPU 0 (default GPU) because of the "export CUDA_VISIBLE_DEVICES=3,4,5" command. Because, after the export command, I tried to see all the available GPUs and system gave me this output,

Number of available GPUs: 3
GPU 0: NVIDIA A100-SXM4-80GB
Compute Capability: (8, 0)
Memory: 79.20 GB
GPU 1: NVIDIA A100-SXM4-80GB
Compute Capability: (8, 0)
Memory: 79.20 GB
GPU 2: NVIDIA A100-SXM4-80GB
Compute Capability: (8, 0)
Memory: 79.20 GB

Google org

Hi @saireddy , Could you please confirm if this issue is resolved or you are still facing the same issue? Thank you.

above one worked for me, thanks @Renu11

saireddy changed discussion status to closed

@saireddy which solution above works for you? I had similar error but none of the suggestion of above work. Thanks

@ctsang @Renu11 reducing the batch size has worked for me and here are the versions i used
accelerate==0.31.0
bitsandbytes==0.43.1
datasets==2.18.0
deepspeed==0.14.4
evaluate==0.4.1
peft==0.11.1
transformers==4.42.3
trl==0.9.4

Sign up or log in to comment