padding_side discrepancy
#84
by
Muennighoff
- opened
PreTrainedTokenizerFast(name_or_path='bigscience/tokenizer', vocab_size=250680, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'})
PreTrainedTokenizerFast(name_or_path='bigscience/bloom', vocab_size=250680, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'})
I think the two should be the same no? cc @ybelkada
Yeah you are right, the padding_side
of bigscience/tokenizer
should be set to left
, opened a PR at: https://huggingface.co/bigscience/tokenizer/discussions/3
Feel free to merge it ;)
TimeRobber
changed discussion status to
closed
padding_side='left'?? if it's a left-to-right LM the padding should be on the right. Here's the current behavior:
In [72]: tokenizer(["foo","foo bar baz"],return_tensors="pt", padding="longest")['input_ids']
Out[72]:
tensor([[ 3, 3, 27988],
[27988, 2879, 20300]])
If it is correct it is also needed in https://huggingface.co/bigscience/bloomz-7b1-mt/blob/main/tokenizer_config.json