bigscience/bloomz-7b1 · Should the bloomz-7b1 tokenizer_config.json have padding

bmot

Jan 18, 2023

The other bloom tokenizers seem to set that, and running generation with 7b1 shows this message:

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

ybelkada

BigScience Workshop org Jan 19, 2023

Hi @bmot ,
This is correct, cc-ing @Muennighoff to be sure but one should open a PR to fix that? Do you want to open it? Otherwise I can do it

Muennighoff

BigScience Workshop org Jan 19, 2023

Yeah good point, would be great if you could do it @ybelkada 🤗

ybelkada

BigScience Workshop org Jan 19, 2023

Closing as https://huggingface.co/bigscience/bloomz-7b1/discussions/7 being merged

ybelkada

BigScience Workshop org Jan 19, 2023

FYI why we use padding="left" (from @patrickvonplaten ):

In short if you generate you always take the last token to predict the next one. However if you generate in batches, the last token (if padding is "right") might be a padding token which would then incorrectly be taken in our generate method.

Imagine the batch: ["hello my name is", "hey <pad> <pad> <pad>"] For the first input the correct token will be sampled from "is" - however for the second input, generate would incorrectly sample from "" where as it should sample from "hey".
Making sure everything is batched on the left circumvents this problem

ybelkada changed discussion status to closed Jan 19, 2023

bigscience
/

bloomz-7b1

Should the bloomz-7b1 tokenizer_config.json have padding_side="left"?