The tokenizer problem

by ghosthamlet - opened Oct 22, 2022

Oct 22, 2022

•

edited Oct 22, 2022

Thanks for open source this large GPT2 model.
I found the tokenizer will tokenize many single word to two tokens, then any text after tokenize will have double or more length .
don't know why did you make this strange tokenizer?
you can see BPE tokenizer in the bloom model: https://huggingface.co/Langboat/bloom-6b4-zh, it has similar size vocab, but they tokenize most single word to single tokens.

roygan

Fengshenbang-LM org Oct 22, 2022

For chinese, single word to single tokens is more convenient. We use the original GPT2 tokenizer for some early training reason, and look forward to reuse the large model in multilingual area. The model 'IDEA-CCNL/YuyuanQA-GPT2-3.5B' is based on it. We will release some GPT model which use 'single word to single tokens' tokenizer like bert soon.

ghosthamlet

Oct 22, 2022

Thanks for the detailed answer, Looking forward to the new GPT models.

ghosthamlet changed discussion status to closed Oct 22, 2022

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment