Tokenizer inconsistencies related to HTML tags

#11
by sanderland - opened

Similar to https://huggingface.co/google/gemma-7b/discussions/76

model = '01-ai/Yi-9B'
slowtok = AutoTokenizer.from_pretrained(model,use_fast=False)
fasttok = AutoTokenizer.from_pretrained(model,use_fast=True)
phrase = 'this is html <h5>'
st = slowtok.encode(phrase) # use the <h5> token
ft = fasttok.encode(phrase)   # uses four tokens: <, h,5, >
01-ai org

our tokenzier does not support fast, just like llama tokenizer

@YShow maybe it could be disabled, eg the opposite of https://huggingface.co/stabilityai/stablelm-2-12b/discussions/1 ?

01-ai org

Thank you for your suggestion. We will pay attention to this in the next version

Llama tokenizer does support fast, this token in particular is super weird.
With Llama, (and legacy=False) both produce the same output.
The main issue is that the fast conversion should be revisited, for Yi, the user_defined_symbols include a lot of tokens:

print(proto.trainer_spec.user_defined_symbols)
['<fim_prefix>', '<fim_middle>', '<fim_suffix>', '<fim_pad>', '<filename>', '<gh_stars>', '<issue_start>', '<issue_comment>', '<issue_closed>', '<jupyter_start>', '<jupyter_text>', '<jupyter_code>', '<jupyter_output>', '<empty_output>', '<commit_before>', '<commit_msg>', '<commit_after>', '<reponame>', '<h1>', '<h1/>', '</h1>', '<h2>', '<h2/>', '</h2>', '<h3>', '<h3/>', '</h3>', '<h4>', '<h4/>', '</h4>', '<h5>', '<h5/>', '</h5>', '<br>', '<br/>', '</br>', '<strong>', '<strong/>', '</strong>', '<p>', '<p/>', '</p>', '<table>', '<table/>', '</table>', '<li>', '<li/>', '</li>', '<tr>', '<tr/>', '</tr>', '<tbody>', '<tbody/>', '</tbody>', '<img>', '<img/>', '</img>', '<b>', '<b/>', '</b>', '<td>', '<td/>', '</td>', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ',', '.', '!', '?', ',', '。', '!', '?', '、', ':', '¥', '《', '》', '【', '】', '『', '』', '```', '<!--', '-->', '---', '<!DOCTYPE>', '\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t', '\t\t\t\t\t\t', '\t\t\t\t\t', '\t\t\t\t', '\t\t\t', '\t\t', '\t', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁', '▁▁▁▁▁▁', '▁▁▁▁▁', '▁▁▁▁', '▁▁▁', '▁▁', '\x08', '\r', '<|unused999|>', '<|unused000|>', '<|unused001|>', '<|unused002|>', '<|unused003|>', '<|unused004|>', '<|unused005|>', '<|unused006|>', '<|unused007|>', '<|unused008|>', '<|unused009|>', '<|unused010|>', '<|unused011|>', '<|unused012|>', '<|unused013|>', '<|unused014|>', '<|unused015|>', '<|unused016|>', '<|unused017|>', '<|unused018|>', '<|unused019|>', '<|unused020|>', '<|unused021|>', '<|unused022|>', '<|unused023|>', '<|unused024|>', '<|unused025|>', '<|unused026|>', '<|unused027|>', '<|unused028|>', '<|unused029|>', '<|unused030|>', '<|unused031|>', '<|unused032|>', '<|unused033|>', '<|unused034|>', '<|unused035|>', '<|unused036|>', '<|unused037|>', '<|unused038|>', '<|unused039|>', '<|unused040|>', '<|unused041|>', '<|unused042|>', '<|unused043|>', '<|unused044|>', '<|unused045|>', '<|unused046|>', '<|unused047|>', '<|unused048|>', '<|unused049|>', '<|unused050|>', '<|unused051|>', '<|unused052|>', '<|unused053|>', '<|unused054|>', '<|unused055|>', '<|unused056|>', '<|unused057|>', '<|unused058|>', '<|unused059|>', '<|unused060|>', '<|unused061|>', '<|unused062|>', '<|unused063|>', '<|unused064|>', '<|unused065|>', '<|unused066|>', '<|unused067|>', '<|unused068|>', '<|unused069|>', '<|unused070|>', '<|unused071|>', '<|unused072|>', '<|unused073|>', '<|unused074|>', '<|unused075|>', '<|unused076|>', '<|unused077|>', '<|unused078|>', '<|unused079|>', '<|unused080|>', '<|unused081|>', '<|unused082|>', '<|unused083|>', '<|unused084|>', '<|unused085|>', '<|unused086|>', '<|unused087|>', '<|unused088|>', '<|unused089|>', '<|unused090|>', '<|unused091|>', '<|unused092|>', '<|unused093|>', '<|unused094|>', '<|unused095|>', '<|unused096|>', '<|unused097|>', '<|unused098|>', '<|unused099|>', '<|unused100|>', '<|unused101|>', '<|unused102|>', '<|unused103|>', '<|unused104|>', '<|unused105|>', '<|unused106|>', '<|unused107|>', '<|unused108|>', '<|unused109|>', '<|unused110|>', '<|unused111|>', '<|unused112|>', '<|unused113|>', '<|unused114|>', '<|unused115|>', '<|unused116|>', '<|unused117|>', '<|unused118|>', '<|unused119|>', '<|unused120|>', '<|unused121|>', '<|unused122|>', '<|unused123|>', '<|unused124|>', '<|unused125|>']

But these are not handled by the conversion. I can update that of course to make sure it's taken into account.
What you should already do is just add all of these token to the slow tokenizer, which will add them to the fast one.

The second issue is that the sentencepiece tokenizer also does not add a prefix space. You can disable that in Llama using add_dummy_prefix_space=True. Or

fastttok._tokenizer.pre_tokenizer.prepend_scheme = "never"

this can only be done if you are using legacy=False which I highly recommend if you add tokens.

Sign up or log in to comment