4bit-quantized gemma-2-27b-it generates only pad tokens, like '<pad><pad><pad><pad><pad><pad><pad><pad><pad>'.

#29
by kshinoda - opened

Thank you for releasing the great models!

I found that this model (gemma-2-27b-it) seems to generate only PAD tokes in my environment when using 4-bit quantization.
My environment and codes are as follows.

How should this issue be fixed?
Thanks for your support in advance.

  • torch==2.3.0+cu118
  • transformers==4.42.4
  • bitsandbytes==0.43.1
  • CUDA==11.6
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer
kwargs = {'device_map': 'auto'}
kwargs['quantization_config'] = BitsAndBytesConfig(
    load_in_4bit=True
)
model = AutoModelForCausalLM.from_pretrained('google/gemma-2-27b-it', low_cpu_mem_usage=True, **kwargs)
tokenizer = AutoTokenizer.from_pretrained('google/gemma-2-27b-it', use_fast=False, padding_side='right')

chat = [
    {'role': 'user', 'content': 'Hello!'},
]

prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

inputs = tokenizer([prompt], add_special_tokens=False, padding=True, truncation=True, return_tensors="pt")
inputs = {k: inputs[k].to('cuda') for k in inputs}

outputs = model.generate(**inputs)

tokenizer.decode(outputs[0].cpu().numpy().tolist())

and this is the output

'<bos><start_of_turn>user\nHello!<end_of_turn>\n<start_of_turn>model\n<pad><pad><pad><pad><pad><pad><pad><pad><pad>'
kshinoda changed discussion title from It generates only pad tokens, like '<pad><pad><pad><pad><pad><pad><pad><pad><pad>'. to 4bit-quantized gemma-2-27b-it generates only pad tokens, like '<pad><pad><pad><pad><pad><pad><pad><pad><pad>'.

Just add that I'm facing the same issue with while using 8-bit quantization.

Same here with 4-bit quantization too.

Hi all. Please use torch_dtype=torch.bfloat16 when loading with from_pretrained(). There's a PR to update the model card examples here: #33.

Sign up or log in to comment