`CpmTokenizer` is different from the original CPM-1 tokenizer in GitHub

#1
by ShaneTian - opened

transformers.CpmTokenizer is based on transformers.XLNetTokenizer, but the original CPM-1 tokenizer is not.

I found in fine-tuning:

  • the original tokenizer always add an eod_token = <eod> in the end of sentence , see here.
  • the transformers.CpmTokenizer always add sep_token = <sep> and cls_token = <cls> in the end of sentence, see here.

I am confused.
In LM fine-tuning, how to prepare the input data?

  • [token_id_1, token_id_2, ..., eod_token_id], where eod_token_id is the id of <eod> token in transformers.CpmTokenizer
  • [token_id_1, token_id_2, ..., eos_token_id], where eos_token_id is the id of </s> token in transformers.CpmTokenizer
  • [token_id_1, token_id_2, ..., eos_token_id], where eos_token_id is the id of <|endoftext|> token in transformers.GPT2Tokenizer
  • [token_id_1, token_id_2, ..., sep_token_id, cls_token_id], just call CpmTokenizer

Wow so sorry for the very much late reply! You are right, we should probably correct the build_inputs_with_special_tokens function, which is used when you set add_special_tokens = True (to format the inputs)

You can also change the template processor if you are using a fast tokenizer.

Thanks

ShaneTian changed discussion status to closed

Sign up or log in to comment