<pad> spam issue

#40
by Zewsic - opened

im truing to run example code

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it", device_map="auto", torch_dtype=torch.float16)

chat = [
    { "role": "user", "content": "Write a hello world program on python" },
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

inputs = tokenizer.encode(prompt, add_special_tokens=True, return_tensors="pt").to("mps")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)
print(tokenizer.decode(outputs[0]))

and i get this output

Write a hello world program on python<end_of_turn>
<start_of_turn>model
<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>```


why this happening?
Google org

That's really odd, can you try share exactly what prompt variable looks like?

I had the same issue: https://huggingface.co/google/gemma-7b/discussions/33

By my experience, it may work in your case by loading the model in float32.

That's really odd, can you try share exactly what prompt variable looks like?

<bos><start_of_turn>user
Write a hello world program on python<end_of_turn>
<start_of_turn>model

I had the same issue: https://huggingface.co/google/gemma-7b/discussions/33

By my experience, it may work in your case by loading the model in float32.

So, I tried this, and the result is just none. I get this message:

WARNING:root:Some parameters are on the meta device device because they were offloaded to the disk.

And the generation does not happen. It just doesn't happen, no errors are displayed. The same situation happens both with a regular model and with an instructional model. The result is absolutely the same.

It works right with bf16 or fp32, but will generate pad token when use fp16. Want to know why.

Google org

Do you only see this with the 7B IT model and not any other model?

Hi @suryabhupa

I've got similar errors, the 2B-it model works pretty good with all precision options, but the 7B-it only works fine under bfloat16. For float16, 8bit and 4 bit, when dealing with long inputs, the model freeze for couple of minutes, then repeat the input and generate lots of <pad>.
p.s. the experiments are running on a server with Tesla A100, so I don't think it's triggered by hardwares.

i have the same problem, did u have any solution?

Google org

That's quite bizarre, I'm curious if you find this happens if using the PyTorch or JAX codepaths? Just trying to diagnose where the issue might be coming from.

i have same problem. If I switch to CPU works well but gpu not working. i have 2 gpu and running huggingface examples.

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it", device_map="auto")
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))

output like this
<bos>Write me a poem about Machine Learning.<pad><pad>...

single gpu works
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it", device_map="cuda:0", torch_dtype=torch.bfloat16)
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda:0")

hello, @suryabhupa
continuous <pad> padding occurs on gemma-2b-it model too.
I guess it is related with hardware issue because it works different on each hardware with same dtype.
I attached screenshot, and i wish it helps you.

Model : gemma-1.1-2b-it

Experiments
Case1) CPU + float16 -> works well
Case2) MPS + float16 -> continuous <pad> padding occurs
Case3) CUDA + float16 -> works well

device specs

  • CPU : Macbook Air M3
  • MPS : same as CPU
  • CUDA : L4(google colab)

Screenshots
CUDA+FP16.png
MPS+FP16.png
CPU+FP16.png

Sign up or log in to comment