Tokenizer issues
#4
by
sanderland
- opened
from transformers import AutoTokenizer
tid = 31750
for use_fast in [False,True]:
tok = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-33b-base", use_fast=use_fast)
decoded = tok.decode([tid])
reencoded = [(i,tok.decode([i])) for i in tok.encode(s, add_special_tokens=False)]
print(f"{use_fast=}\t{tid=} {decoded=} {reencoded=}")
Gives
use_fast=False tid=31750 decoded=' indústria' reencoded=[(1539, ' ind'), (32007, '�'), (292, 'st'), (2122, 'ria')]
use_fast=True tid=31750 decoded=' indústria' reencoded=[(1539, ' ind'), (32007, '�'), (292, 'st'), (2122, 'ria')]
It seems the tokens at the end were an attempt to manually add missing (but unused) UTF-8 bytes, but they were instead mapped to the unicode points, and are overriding other tokens.
The non-code models are similarly affected.
Apparently fixed in the llm/math models, but a wont-fix in code. cf https://github.com/huggingface/tokenizers/issues/1392
sanderland
changed discussion status to
closed