Errors in Tokenizer „convert_id_to_token“

#3
by michaelfeil - opened

Dear Authors,

We are integrating CodeGen2.5 into CTranslate2, a open source inference engine.
I previously wrote the code to do so for CodeGen1 and 2.

CTranslate2 encodes the vocabulary to ids. It seems like there is some tokens and their byte/utf-8/.. encoding.
See:

https://github.com/OpenNMT/CTranslate2/issues/1334

To you have any guidance or intuition, how this bug may be resolved?

Not sure if this will help, but I was running into something similar. I found that tokens 94-187 aren't utf-8 characters, and after experimenting a bit it seemed like they wanted to be decoded using 'latin-1' encoding.

Sign up or log in to comment