Better Tokenizer

#5
by Ehsanjahanbakhsh - opened

The code used to create the tokenizer adds 100 extra tokens by default. The padding token for T5Tokenizer is "<pad>" by default, which doesn't exist in the sentencepiece model, thus adding another extra token, making batch inference impossible.

tokenizer = T5Tokenizer('256k_vocab/spm.model', legacy=False)

using the code below makes it work fine:

tokenizer = T5Tokenizer('vocabulary_256k_vocab_spm.model', extra_ids = 0, pad_token = '<s>', legacy = False)

Thank you for reporting this and for the fix in #6.

jbochi changed discussion status to closed

Sign up or log in to comment