padding_side discrepancy

#84
by Muennighoff - opened
BigScience Workshop org
PreTrainedTokenizerFast(name_or_path='bigscience/tokenizer', vocab_size=250680, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'})

PreTrainedTokenizerFast(name_or_path='bigscience/bloom', vocab_size=250680, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'})

I think the two should be the same no? cc @ybelkada

BigScience Workshop org

Yeah you are right, the padding_side of bigscience/tokenizer should be set to left, opened a PR at: https://huggingface.co/bigscience/tokenizer/discussions/3
Feel free to merge it ;)

TimeRobber changed discussion status to closed

padding_side='left'?? if it's a left-to-right LM the padding should be on the right. Here's the current behavior:

In [72]: tokenizer(["foo","foo bar baz"],return_tensors="pt", padding="longest")['input_ids']
Out[72]: 
tensor([[    3,     3, 27988],
        [27988,  2879, 20300]])

Sign up or log in to comment