Encoding of vocabulary

#1
by SDFASDGA - opened

Can you please tell me what is the encoding of the vocabulary text? For example, the readme example shows tokenizer output of ['ìķĦë§Ī', 'ì¡´', 'ĠìĦ¸', 'ìĿ´ì§Ģ', 'ë©ĶìĿ´', '커'] when tokenizing "아마존 세이지메이커". How do I map the tokens to the korean characters unicode characters?

The tokenizer used in phi-2 is not a BPE, but a byte-level BPE like GPT-2, so Unicode conversion is required to check normal characters.

See Colab below for more details.
https://colab.research.google.com/drive/15tMASZ0NLm8bnxkM4uXCRgdzznSpbp9L?usp=sharing#scrollTo=UwJT5ZSfwOGc

See the code snippet below to restore the original string.

input_str = "아마존 세이지메이커"
tokenized = tokenizer.tokenize(input_str)
encoded = tokenizer.encode_plus(input_str)
decoded = tokenizer.decode(encoded['input_ids'])
print(encoded)
print(tokenized) # just mapping to token id; you cannot see the actual character!
print(decoded)

daekeun-ml changed discussion status to closed

Sign up or log in to comment