4bit-quantized gemma-2-27b-it generates only pad tokens, like '<pad><pad><pad><pad><pad><pad><pad><pad><pad>'.
#29
by
kshinoda
- opened
Thank you for releasing the great models!
I found that this model (gemma-2-27b-it) seems to generate only PAD tokes in my environment when using 4-bit quantization.
My environment and codes are as follows.
How should this issue be fixed?
Thanks for your support in advance.
- torch==2.3.0+cu118
- transformers==4.42.4
- bitsandbytes==0.43.1
- CUDA==11.6
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer
kwargs = {'device_map': 'auto'}
kwargs['quantization_config'] = BitsAndBytesConfig(
load_in_4bit=True
)
model = AutoModelForCausalLM.from_pretrained('google/gemma-2-27b-it', low_cpu_mem_usage=True, **kwargs)
tokenizer = AutoTokenizer.from_pretrained('google/gemma-2-27b-it', use_fast=False, padding_side='right')
chat = [
{'role': 'user', 'content': 'Hello!'},
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([prompt], add_special_tokens=False, padding=True, truncation=True, return_tensors="pt")
inputs = {k: inputs[k].to('cuda') for k in inputs}
outputs = model.generate(**inputs)
tokenizer.decode(outputs[0].cpu().numpy().tolist())
and this is the output
'<bos><start_of_turn>user\nHello!<end_of_turn>\n<start_of_turn>model\n<pad><pad><pad><pad><pad><pad><pad><pad><pad>'
kshinoda
changed discussion title from
It generates only pad tokens, like '<pad><pad><pad><pad><pad><pad><pad><pad><pad>'.
to 4bit-quantized gemma-2-27b-it generates only pad tokens, like '<pad><pad><pad><pad><pad><pad><pad><pad><pad>'.
Just add that I'm facing the same issue with while using 8-bit quantization.
Same here with 4-bit quantization too.