Fix slow tokenizer

#14
by pcuenq HF staff - opened
Google org
No description provided.
Google org

Before:

tokenizer  = GemmaTokenizer.from_pretrained("google/gemma-1.1-7b-it")
tokenizer.decode([139])
'▁▁'

With this PR:

tokenizer  = GemmaTokenizer.from_pretrained("google/gemma-1.1-7b-it", revision="refs/pr/14")
tokenizer.decode([139])
'  '

This was introduced when the user defined tokens were added, in a PR similar to this one: https://huggingface.co/google/gemma-1.1-2b-it/discussions/1.

I don't know if we should keep them in the config using character " " instead of "▁". I tested the slow and fast tokenizers for encoding and decoding but not sure if I'm missing something.

Google org
>>> tokenizer.convert_ids_to_tokens([139])
>>> ['▁▁']

which means the token is correct, but the decode function is not. tokenizers does not complain. But transformers never post-processes the replacements.
These should be the only affected tokens. (because based on the replacement).

Could also have a fix in transformers

pcuenq changed pull request status to merged

Sign up or log in to comment