model.generate is throwing AttributeError: 'HybridCache' object has no attribute 'float'

#18
by saireddy - opened

I am using transformers 4.42.3 and trying to full-fine tune gemma2-9B model using super vised fine tuning, when I try to do
outputs = model.generate(input_ids=input_ids, max_new_tokens=150,
do_sample=True, top_p=0.6, temperature=0.5,
pad_token_id=tokenizer.eos_token_id)
This is throwing ""AttributeError: 'HybridCache' object has no attribute 'float'"", but trainer.train() didnt throw any errors.

I got similar error in torch 2.0.1 and cuda 11.7

when i tried torch 2.1.1 and cuda 12.1, it works.

i am using nvcr.io/nvidia/pytorch:24.05-py3 image which has PyTorch 2.4.0a0+07cecf4 and NVIDIA CUDA 12.4.1 still hitting the same issue

model.generate(
    **tokenizer("### Q: What is the capital city of the Korea?", return_tensors='pt', return_token_type_ids=False).to('cuda'),
    do_sample=True,
    use_cache=False,
    max_new_tokens=256,
    eos_token_id=1,
)

simple trick workaround, with use_cache=False, for now.

Thanks for the trick @beomi , now I am hitting another issue
RuntimeError: Index put requires the source and destination dtypes match, got BFloat16 for the destination and Float for the source.
but I dont run into this issue with other LLM's, not sure whats wrong here. Any advise?

here is the base code

BitsAndBytesConfig int-4 config

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(model_id,
quantization_config=bnb_config,
use_cache=False,
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
device_map="auto", token=access_token)
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
target_modules=["q_proj", "v_proj"],
r=64,
bias="none",
task_type="CAUSAL_LM",
)
prepare model for training

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)
from transformers import TrainingArguments
args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=15,
per_device_train_batch_size=8,
gradient_accumulation_steps=2,
gradient_checkpointing=True,

optim="paged_adamw_32bit",
logging_steps=100,
save_strategy="epoch",
learning_rate=2e-4,
bf16=True,
tf32=True,
max_grad_norm=0.3,
warmup_ratio=0.03,
seed=42,
eval_steps=100,
lr_scheduler_type="cosine",
evaluation_strategy='epoch',
disable_tqdm=False,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",

run_name=run_name # disable tqdm since with packing values are in correct

)
from trl import SFTTrainer

max_seq_length = 2048 # max sequence length for model and packing of the dataset

trainer = SFTTrainer(
model=model,
peft_config=peft_config,
max_seq_length=max_seq_length,
tokenizer=tokenizer,
packing=True,
formatting_func=generate_prompt, # this will aplly the create_prompt mapping to all training and test dataset
args=args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"]
)
trainer.train()

@saireddy Could you confirm the version of the Transfomers/TRL/Accelerate and PyTorch/CUDA version?

Hey @beomi
i am using nvcr.io/nvidia/pytorch:24.05-py3 image which has PyTorch 2.4.0a0+07cecf4 and NVIDIA CUDA 12.4.1

accelerate==0.31.0
bitsandbytes==0.43.1
deepspeed==0.14.4
evaluate==0.4.1
peft==0.11.1
transformers==4.42.3
trl==0.9.4

stacktrace :

outputs = model.generate(input_ids=input_ids, max_new_tokens=max_new_tokens, do_sample=True,
File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 1491, in generate
outputs = self.base_model.generate(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1914, in generate
result = self._sample(
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2651, in _sample
outputs = self(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/gemma2/modeling_gemma2.py", line 1068, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/gemma2/modeling_gemma2.py", line 908, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/gemma2/modeling_gemma2.py", line 650, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/gemma2/modeling_gemma2.py", line 252, in forward
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/cache_utils.py", line 1071, in update
return update_fn(
File "/usr/local/lib/python3.10/dist-packages/transformers/cache_utils.py", line 1046, in _static_update
k_out[:, :, cache_position] = key_states
RuntimeError: Index put requires the source and destination dtypes match, got BFloat16 for the destination and Float for the source.

Sign up or log in to comment