Should the bloomz-7b1 tokenizer_config.json have padding_side="left"?

#6
by bmot - opened

The other bloom tokenizers seem to set that, and running generation with 7b1 shows this message:

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
BigScience Workshop org

Hi @bmot ,
This is correct, cc-ing @Muennighoff to be sure but one should open a PR to fix that? Do you want to open it? Otherwise I can do it

BigScience Workshop org

Yeah good point, would be great if you could do it @ybelkada 🤗

BigScience Workshop org
BigScience Workshop org

FYI why we use padding="left" (from @patrickvonplaten ):

In short if you generate you always take the last token to predict the next one. However if you generate in batches, the last token (if padding is "right") might be a padding token which would then incorrectly be taken in our generate method.

Imagine the batch: ["hello my name is", "hey <pad> <pad> <pad>"] For the first input the correct token will be sampled from "is" - however for the second input, generate would incorrectly sample from "" where as it should sample from "hey".
Making sure everything is batched on the left circumvents this problem

ybelkada changed discussion status to closed

Sign up or log in to comment