What is the id 9 in japanese-gpt2-medium tokenizer?

#1
by gojiteji - opened

I'm trying to use japanese-gpt2-medium for my research.
I found that sometimes the tokenizer outputs id 9 on the head like below.

print(tokenizer("hello world").input_ids)
>  [9, 22848, 463, 7375, 2]
print(tokenizer("dog").input_ids)
> [6832, 275, 2]

But it looks like number 9 decodes nothing.

print(tokenizer.decode([9]) ,len(tokenizer.decode([9])))
> 0

What does id 9 token mean? When fine-tuning, should id 9 be left?

It is a special symbol (meta symbol "▁" (U+2581)) produced by sentencepiece.
Please refer to the sentencepiece repo for details: https://github.com/google/sentencepiece

>>> tokenizer.tokenize("hello world")
['▁', 'hell', 'o', '▁world']
>>> tokenizer.tokenize("dog")
['▁do', 'g']

You can leave it as it is for finetuning.

Thank you for your reply.
I see.
I’ll do so.

gojiteji changed discussion status to closed

Sign up or log in to comment