<pad> spam issue

#40

by Zewsic - opened Feb 23

Discussion

Zewsic

Feb 23

•

edited Feb 23

im truing to run example code

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it", device_map="auto", torch_dtype=torch.float16)

chat = [
    { "role": "user", "content": "Write a hello world program on python" },
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

inputs = tokenizer.encode(prompt, add_special_tokens=True, return_tensors="pt").to("mps")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)
print(tokenizer.decode(outputs[0]))

and i get this output

Write a hello world program on python<end_of_turn>
<start_of_turn>model
<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>```


why this happening?

suryabhupa

Google org Feb 23

That's really odd, can you try share exactly what prompt variable looks like?

EarthWorm001

Feb 23

I had the same issue: https://huggingface.co/google/gemma-7b/discussions/33

By my experience, it may work in your case by loading the model in float32.

Zewsic

Feb 23

That's really odd, can you try share exactly what prompt variable looks like?

<bos><start_of_turn>user
Write a hello world program on python<end_of_turn>
<start_of_turn>model

Zewsic

Feb 23

I had the same issue: https://huggingface.co/google/gemma-7b/discussions/33

By my experience, it may work in your case by loading the model in float32.

So, I tried this, and the result is just none. I get this message:

WARNING:root:Some parameters are on the meta device device because they were offloaded to the disk.

And the generation does not happen. It just doesn't happen, no errors are displayed. The same situation happens both with a regular model and with an instructional model. The result is absolutely the same.

yunhuan929

Mar 6

•

edited Mar 6

It works right with bf16 or fp32, but will generate pad token when use fp16. Want to know why.

suryabhupa

Google org Mar 7

Do you only see this with the 7B IT model and not any other model?

YM1024

Mar 8

•

edited Mar 8

Hi @suryabhupa

I've got similar errors, the 2B-it model works pretty good with all precision options, but the 7B-it only works fine under bfloat16. For float16, 8bit and 4 bit, when dealing with long inputs, the model freeze for couple of minutes, then repeat the input and generate lots of <pad>.
p.s. the experiments are running on a server with Tesla A100, so I don't think it's triggered by hardwares.

drewjiang

Apr 5

i have the same problem, did u have any solution?

suryabhupa

Google org Apr 5

That's quite bizarre, I'm curious if you find this happens if using the PyTorch or JAX codepaths? Just trying to diagnose where the issue might be coming from.

mindoflight

Apr 7

i have same problem. If I switch to CPU works well but gpu not working. i have 2 gpu and running huggingface examples.

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it", device_map="auto")
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))

output like this
<bos>Write me a poem about Machine Learning.<pad><pad>...

mindoflight

Apr 7

single gpu works
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it", device_map="cuda:0", torch_dtype=torch.bfloat16)
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda:0")

aiqwe

11 days ago

•

edited 11 days ago

hello, @suryabhupa
continuous <pad> padding occurs on gemma-2b-it model too.
I guess it is related with hardware issue because it works different on each hardware with same dtype.
I attached screenshot, and i wish it helps you.

Model : gemma-1.1-2b-it

Experiments
Case1) CPU + float16 -> works well
Case2) MPS + float16 -> continuous <pad> padding occurs
Case3) CUDA + float16 -> works well

device specs

CPU : Macbook Air M3
MPS : same as CPU
CUDA : L4(google colab)

Screenshots

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment