deepseek-ai/deepseek-coder-33b-base

May 8, 2024

•

edited May 8, 2024

from transformers import AutoTokenizer
tid = 31750
for use_fast in [False,True]:
    tok = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-33b-base", use_fast=use_fast)
    decoded = tok.decode([tid])
    reencoded = [(i,tok.decode([i])) for i in tok.encode(s, add_special_tokens=False)]
    print(f"{use_fast=}\t{tid=} {decoded=} {reencoded=}")

Gives

use_fast=False	tid=31750 decoded=' indústria' reencoded=[(1539, ' ind'), (32007, '�'), (292, 'st'), (2122, 'ria')]
use_fast=True	tid=31750 decoded=' indústria' reencoded=[(1539, ' ind'), (32007, '�'), (292, 'st'), (2122, 'ria')]

It seems the tokens at the end were an attempt to manually add missing (but unused) UTF-8 bytes, but they were instead mapped to the unicode points, and are overriding other tokens.
The non-code models are similarly affected.

sanderland

May 27, 2024

Apparently fixed in the llm/math models, but a wont-fix in code. cf https://github.com/huggingface/tokenizers/issues/1392

sanderland changed discussion status to closed May 27, 2024

deepseek-ai
/

deepseek-coder-33b-base

Tokenizer issues