Tokenizer issues

#4
by sanderland - opened
from transformers import AutoTokenizer
tid = 31750
for use_fast in [False,True]:
    tok = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-33b-base", use_fast=use_fast)
    decoded = tok.decode([tid])
    reencoded = [(i,tok.decode([i])) for i in tok.encode(s, add_special_tokens=False)]
    print(f"{use_fast=}\t{tid=} {decoded=} {reencoded=}") 

Gives

use_fast=False	tid=31750 decoded=' indústria' reencoded=[(1539, ' ind'), (32007, '�'), (292, 'st'), (2122, 'ria')]
use_fast=True	tid=31750 decoded=' indústria' reencoded=[(1539, ' ind'), (32007, '�'), (292, 'st'), (2122, 'ria')]

It seems the tokens at the end were an attempt to manually add missing (but unused) UTF-8 bytes, but they were instead mapped to the unicode points, and are overriding other tokens.
The non-code models are similarly affected.

Apparently fixed in the llm/math models, but a wont-fix in code. cf https://github.com/huggingface/tokenizers/issues/1392

sanderland changed discussion status to closed

Sign up or log in to comment