padding_side discrepancy

#84

by Muennighoff - opened Aug 15, 2022

BigScience Workshop org Aug 15, 2022

PreTrainedTokenizerFast(name_or_path='bigscience/tokenizer', vocab_size=250680, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'})

PreTrainedTokenizerFast(name_or_path='bigscience/bloom', vocab_size=250680, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'})

I think the two should be the same no? cc @ybelkada

ybelkada

BigScience Workshop org Aug 15, 2022

Yeah you are right, the padding_side of bigscience/tokenizer should be set to left, opened a PR at: https://huggingface.co/bigscience/tokenizer/discussions/3
Feel free to merge it ;)

TimeRobber changed discussion status to closed Jan 27, 2023

PaulLerner

Jan 18, 2024

padding_side='left'?? if it's a left-to-right LM the padding should be on the right. Here's the current behavior:

In [72]: tokenizer(["foo","foo bar baz"],return_tensors="pt", padding="longest")['input_ids']
Out[72]: 
tensor([[    3,     3, 27988],
        [27988,  2879, 20300]])

PaulLerner

Jan 23, 2024

If it is correct it is also needed in https://huggingface.co/bigscience/bloomz-7b1-mt/blob/main/tokenizer_config.json

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment