Fix slow tokenizer

#14

by pcuenq HF staff - opened 22 days ago

base: refs/heads/main

←

from: refs/pr/14

Discussion Files changed

-240

Fix slow tokenizerfc76c780

pcuenq

Google org 22 days ago

No description provided.

pcuenq

Google org 22 days ago

Before:

tokenizer  = GemmaTokenizer.from_pretrained("google/gemma-1.1-7b-it")
tokenizer.decode([139])

'▁▁'

With this PR:

tokenizer  = GemmaTokenizer.from_pretrained("google/gemma-1.1-7b-it", revision="refs/pr/14")
tokenizer.decode([139])

'  '

This was introduced when the user defined tokens were added, in a PR similar to this one: https://huggingface.co/google/gemma-1.1-2b-it/discussions/1.

I don't know if we should keep them in the config using character " " instead of "▁". I tested the slow and fast tokenizers for encoding and decoding but not sure if I'm missing something.

ArthurZ

Google org 22 days ago

>>> tokenizer.convert_ids_to_tokens([139])
>>> ['▁▁']

which means the token is correct, but the decode function is not. tokenizers does not complain. But transformers never post-processes the replacements.
These should be the only affected tokens. (because based on the replacement).

Could also have a fix in transformers

pcuenq changed pull request status to merged 16 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment