The tokenizer problem

#2
by ghosthamlet - opened

Thanks for open source this large GPT2 model.
I found the tokenizer will tokenize many single word to two tokens, then any text after tokenize will have double or more length .
don't know why did you make this strange tokenizer?
you can see BPE tokenizer in the bloom model: https://huggingface.co/Langboat/bloom-6b4-zh, it has similar size vocab, but they tokenize most single word to single tokens.

Fengshenbang-LM org

For chinese, single word to single tokens is more convenient. We use the original GPT2 tokenizer for some early training reason, and look forward to reuse the large model in multilingual area. The model 'IDEA-CCNL/YuyuanQA-GPT2-3.5B' is based on it. We will release some GPT model which use 'single word to single tokens' tokenizer like bert soon.

Thanks for the detailed answer, Looking forward to the new GPT models.

ghosthamlet changed discussion status to closed

Sign up or log in to comment