8-bit precision error
anyone facing this issue with A100 multi gpus
ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}
I am using "auto" for device map, still hitting this issue
can you provide the training code?
model_id = "google/gemma-7b"
BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
print("initiating model download")
model = AutoModelForCausalLM.from_pretrained(model_id,
quantization_config=bnb_config,
use_cache=False,
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
device_map="auto", token=access_token)
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
target_modules=["q_proj", "v_proj"],
r=64,
bias="none",
task_type="CAUSAL_LM",
)
prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)
from transformers import TrainingArguments
args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=15,
per_device_train_batch_size=8,
gradient_accumulation_steps=2,
gradient_checkpointing=True,
optim="paged_adamw_32bit",
logging_steps=100,
save_strategy="epoch",
learning_rate=2e-4,
bf16=True,
tf32=True,
max_grad_norm=0.3,
warmup_ratio=0.03,
seed=42,
eval_steps=100,
lr_scheduler_type="cosine",
evaluation_strategy='epoch',
disable_tqdm=False,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
report_to="wandb",
run_name=run_name # disable tqdm since with packing values are in correct
)
from trl import SFTTrainer
max_seq_length = 2048 # max sequence length for model and packing of the dataset
trainer = SFTTrainer(
model=model,
peft_config=peft_config,
max_seq_length=max_seq_length,
tokenizer=tokenizer,
packing=True,
formatting_func=generate_prompt, # this will aplly the create_prompt mapping to all training and test dataset
args=args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"]
)
trainer.train()
few reqs that might help
accelerate==0.27.2
transformers==4.38.1
trl==0.7.11
bitsandbytes==0.42.0
Hi @saireddy !
This is because you need to make sure your model fits the entire GPU instead of splitting it across all available devices. Can you try:
+ from accelerate import PartialState
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
- device_map="auto",
+ device_map={"": PartialState().process_index}
token=access_token
)
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
target_modules=["q_proj", "v_proj"],
r=64,
bias="none",
task_type="CAUSAL_LM",
)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)
from transformers import TrainingArguments
args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=15,
per_device_train_batch_size=8,
gradient_accumulation_steps=2,
gradient_checkpointing=True,
optim="paged_adamw_32bit",
logging_steps=100,
save_strategy="epoch",
learning_rate=2e-4,
bf16=True,
tf32=True,
max_grad_norm=0.3,
warmup_ratio=0.03,
seed=42,
eval_steps=100,
lr_scheduler_type="cosine",
evaluation_strategy='epoch',
disable_tqdm=False,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
report_to="wandb",
run_name=run_name # disable tqdm since with packing values are in correct
)
from trl import SFTTrainer
max_seq_length = 2048 # max sequence length for model and packing of the dataset
trainer = SFTTrainer(
model=model,
peft_config=peft_config,
max_seq_length=max_seq_length,
tokenizer=tokenizer,
packing=True,
formatting_func=generate_prompt, # this will aplly the create_prompt mapping to all training and test dataset
args=args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"]
)
trainer.train()
the line device_map={"": PartialState().process_index}
will make sure accelerate
will force-set the model into the GPU of index i
instead of splitting into multiple devices.
See the reasons behind it in this GitHub issue comment: https://github.com/huggingface/accelerate/issues/1840#issuecomment-1683105994
Thanks
@ybelkada
. couple of follow up questions
i am using same template for other models like llama2 and it doesn't show same errors and also I am using 4-bit quantization but error talks about 8-bit precision. am I missing something here ? can you please share your thoughts on this?
i tried with the changes you have mentioned, surprisingly I am hitting out of memory issue even with 80GB A100 one.
Same here, I tried with the suggestion. It's running out of memory even with 4 A10 gpus.
I am facing the same issue.
CODE and ERROR BELOW:
trainer.train()
File "/usr/local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 331, in train
output = super().train(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1776, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1228, in prepare
result = tuple(
File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1229, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1105, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1331, in prepare_model
raise ValueError(
ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}
- device_map={"": PartialState().process_index}
above changes give Cuda OOM issue. I am working with Gemma2b-it
I also tried the suggestion and I am having same problem with mistral 7B model which I am trying to finetune.
My code is
qunatization config
quantization_config = BitsAndBytesConfig(
load_in_4bit = True, # enable 4-bit quantization
bnb_4bit_quant_type = 'nf4', # information theoretically optimal dtype for normally distributed weights
bnb_4bit_use_double_quant = True, # quantize quantized weights //insert xzibit meme
bnb_4bit_compute_dtype = torch.bfloat16 # optimized fp format for ML
)
lora_config = LoraConfig(
r = 16, # the dimension of the low-rank matrices
lora_alpha = 8, # scaling factor for LoRA activations vs pre-trained weight activations
target_modules = ['q_proj', 'k_proj', 'v_proj', 'o_proj'],
lora_dropout = 0.05, # dropout probability of the LoRA layers
bias = 'none', # wether to train bias weights, set to 'none' for attention layers
task_type = 'SEQ_CLS'
)
from accelerate import PartialState
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
quantization_config=quantization_config,
num_labels=len(le.classes_),
device_map={"": PartialState().process_index},
)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
model.config.pad_token_id = tokenizer.pad_token_id
define training args
training_args = TrainingArguments(
output_dir = 'multilabel_classification',
learning_rate = 1e-4,
per_device_train_batch_size = 8, # tested with 16gb gpu ram
per_device_eval_batch_size = 8,
num_train_epochs = 10,
weight_decay = 0.01,
evaluation_strategy = 'epoch',
save_strategy = 'epoch',
load_best_model_at_end = True
)
label_weights = 1 - labels_encoded/labels_encoded.sum()
train
trainer = CustomTrainer(
model = model,
args = training_args,
train_dataset = tokenized_ds['train'],
eval_dataset = tokenized_ds['val'],
tokenizer = tokenizer,
data_collator = functools.partial(collate_fn, tokenizer=tokenizer),
compute_metrics = compute_metrics,
label_weights = torch.tensor(label_weights, device=model.device)
)
trainer.train()
BTW I am getting the original error mentioned - ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}
I0000 00:00:1711804411.121734 181893 cpu_client.cc:373] TfrtCpuClient destroyed.
i tried with new gemma 7b It model and still hitting the same issue
Hi everyone,
I encountered a similar error with llama3 - 7b. To address this issue, I tried the following solution:
the line
device_map={"": PartialState().process_index}
will make sureaccelerate
will force-set the model into the GPU of indexi
instead of splitting into multiple devices.
See the reasons behind it in this GitHub issue comment: https://github.com/huggingface/accelerate/issues/1840#issuecomment-1683105994
While this helped resolve the initial error, it has now led to an Out of Memory (OOM) issue. I find this situation somewhat unexpected, considering I am using 8 Nvidia A100 GPUs (each with 40GB of memory) and have never experienced OOM errors with this configuration when working with models of similar size. I am currently performing QLoRA during the fine-tuning process.
Following are the versions of libraries that I am using:
transformers =4.40.1
accelerate=0.30.0.dev0
trl=0.8.6
peft=0.10.0
I tried using device_map={"":0}, but I am still encountering an Out of Memory (OOM) error.
Here are my LoRA params:
- Rank = 8
- Alpha = 8
- Using 4bit quantization while loading the base model.
Has anyone figured out a solution to this problem? Thanks in advance!
Thanks @ybelkada . couple of follow up questions
i am using same template for other models like llama2 and it doesn't show same errors and also I am using 4-bit quantization but error talks about 8-bit precision. am I missing something here ? can you please share your thoughts on this?i tried with the changes you have mentioned, surprisingly I am hitting out of memory issue even with 80GB A100 one.
I got the same error and hitting OOM after the change.
My training code works for llama2 but not llama3 on 4x A40.
Have you figured out a solution?
This is how I solved my issue:
Before running my script, I ran the command below. In my case, I wanted to use either GPU 3, 4, or 5 (other GPUs were highly loaded by other users)
export CUDA_VISIBLE_DEVICES=3,4,5
Inside my Python script, I used these commands,
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
CUDA_LAUNCH_BLOCKING=1
desired_device = 0
torch.cuda.set_device(desired_device)
............
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=compute_dtype,
quantization_config=bnb_config,
device_map = torch.cuda.set_device(desired_device)
)
....................
My guess is the system is now considering GPU 3 as GPU 0 (default GPU) because of the "export CUDA_VISIBLE_DEVICES=3,4,5" command. Because, after the export command, I tried to see all the available GPUs and system gave me this output,
Number of available GPUs: 3
GPU 0: NVIDIA A100-SXM4-80GB
Compute Capability: (8, 0)
Memory: 79.20 GB
GPU 1: NVIDIA A100-SXM4-80GB
Compute Capability: (8, 0)
Memory: 79.20 GB
GPU 2: NVIDIA A100-SXM4-80GB
Compute Capability: (8, 0)
Memory: 79.20 GB