Should the bloomz-7b1 tokenizer_config.json have padding_side="left"?
The other bloom tokenizers seem to set that, and running generation with 7b1 shows this message:
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Hi
@bmot
,
This is correct, cc-ing
@Muennighoff
to be sure but one should open a PR to fix that? Do you want to open it? Otherwise I can do it
Yeah good point, would be great if you could do it @ybelkada 🤗
Closing as https://huggingface.co/bigscience/bloomz-7b1/discussions/7 being merged
FYI why we use padding="left"
(from
@patrickvonplaten
):
In short if you generate you always take the last token to predict the next one. However if you generate in batches, the last token (if padding is "right") might be a padding token which would then incorrectly be taken in our generate method.
Imagine the batch:
["hello my name is", "hey <pad> <pad> <pad>"]
For the first input the correct token will be sampled from "is" - however for the second input, generate would incorrectly sample from "" where as it should sample from "hey".
Making sure everything is batched on the left circumvents this problem