Errors in Tokenizer „convert_id_to_token“

by michaelfeil - opened Jul 11, 2023

Discussion

michaelfeil

Jul 11, 2023

•

edited Jul 11, 2023

Dear Authors,

We are integrating CodeGen2.5 into CTranslate2, a open source inference engine.
I previously wrote the code to do so for CodeGen1 and 2.

CTranslate2 encodes the vocabulary to ids. It seems like there is some tokens and their byte/utf-8/.. encoding.
See:

https://github.com/OpenNMT/CTranslate2/issues/1334

To you have any guidance or intuition, how this bug may be resolved?

anthonygiuliano

Nov 13, 2023

Not sure if this will help, but I was running into something similar. I found that tokens 94-187 aren't utf-8 characters, and after experimenting a bit it seemed like they wanted to be decoded using 'latin-1' encoding.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment