daekeun-ml/phi-2-ko-v0.1 · Encoding of vocabulary

Feb 10, 2024

Can you please tell me what is the encoding of the vocabulary text? For example, the readme example shows tokenizer output of ['ìķĦë§Ī', 'ì¡´', 'ĠìĦ¸', 'ìĿ´ì§Ģ', 'ë©ĶìĿ´', 'ì»¤'] when tokenizing "아마존 세이지메이커". How do I map the tokens to the korean characters unicode characters?

daekeun-ml

Owner Feb 12, 2024

•

edited Feb 12, 2024

The tokenizer used in phi-2 is not a BPE, but a byte-level BPE like GPT-2, so Unicode conversion is required to check normal characters.

See Colab below for more details.
https://colab.research.google.com/drive/15tMASZ0NLm8bnxkM4uXCRgdzznSpbp9L?usp=sharing#scrollTo=UwJT5ZSfwOGc

See the code snippet below to restore the original string.

input_str = "아마존 세이지메이커"
tokenized = tokenizer.tokenize(input_str)
encoded = tokenizer.encode_plus(input_str)
decoded = tokenizer.decode(encoded['input_ids'])
print(encoded)
print(tokenized) # just mapping to token id; you cannot see the actual character!
print(decoded)

daekeun-ml changed discussion status to closed Feb 12, 2024