Warning: The attention mask and the pad token id were not set..
Hi, when I infer the llama3-8b-instruct, there is an warning:
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask
to obtain reliable results.
Setting pad_token_id
to eos_token_id
:128001 for open-end generation.
Although it doesn't impact the result, I still want to fix the warning. Any idea?
Thanks for answering.
when calling model.generate, setting pad_token_id=tokenizer.eos_token_id seems to remove the warning.
it seems to work fine still with that, although some explanation and reasoning wouldn't hurt. maybe i saw one when searching for the solution but i forget.. :)
model.generate(**encoded_input, pad_token_id=tokenizer.eos_token_id).just change this will be ok.
Hey there, i was also facing the same error and after tinkering with the tokenizer for a bit i found out that inputs = tokenizer.encode(prompt)
returns just the input_ids but not the attention mask, Whereas inputs = tokenizer(prompt)
returns both the input_ids and the attention mask.
So if you replace this code
inputs = tokenizer.encode(prompt)
output = model.generate(inputs)
With this
inputs = tokenizer(prompt)
output = model.generate(**inputs)
The error warning goes away. Hope that helps 😊.
I had the same warning as well, and it took me looking at the huggingface transformers code quite a bit but was able to come to a solution:
original warning: The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask
to obtain reliable results.
If you choose to use apply_chat_template() with your tokenizer, an instance of PreTrainedTokenizer for instance, then set return_dict=True, e.g.
return_output = tokenizer.apply_chat_template(
conversation,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True).to(model.device)
I do the to(model.device) at the end to move those (PyTorch?) tensors into the model's device (e.g. "cuda").
return_output is a dict of 2 keys: "input_ids" and "attention_mask". Use both when you run generate(..) for example:
return model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=generation_configuration.max_new_tokens,
do_sample=do_sample,
top_k=generation_configuration.top_k,
top_p=generation_configuration.top_p,
temperature=generation_configuration.temperature,
eos_token_id=eos_token_id,
streamer=streamer)
and that gets called in my function:
with torch.no_grad():
generate_output = run_model_generate(
input_ids=return_output["input_ids"],
model=model,
streamer=streamer,
eos_token_id=generation_configuration.eos_token_id,
generation_configuration=generation_configuration,
attention_mask=return_output["attention_mask"])
My code here: https://github.com/InServiceOfX/InServiceOfX/blob/master/PythonLibraries/HuggingFace/MoreTransformers/executable_scripts/terminal_only_infinite_loop_instruct.py (I'm trying to build out my own library so I'm calling a number of wrappers I made)