TsinghuaAI/CPM-Generate · `CpmTokenizer` is different from the original CPM-1 tokenizer in GitHub

ShaneTian

Aug 10, 2022

transformers.CpmTokenizer is based on transformers.XLNetTokenizer, but the original CPM-1 tokenizer is not.

I found in fine-tuning:

the original tokenizer always add an eod_token = <eod> in the end of sentence , see here.
the transformers.CpmTokenizer always add sep_token = <sep> and cls_token = <cls> in the end of sentence, see here.

I am confused.
In LM fine-tuning, how to prepare the input data?

[token_id_1, token_id_2, ..., eod_token_id], where eod_token_id is the id of <eod> token in transformers.CpmTokenizer
[token_id_1, token_id_2, ..., eos_token_id], where eos_token_id is the id of </s> token in transformers.CpmTokenizer
[token_id_1, token_id_2, ..., eos_token_id], where eos_token_id is the id of <|endoftext|> token in transformers.GPT2Tokenizer
[token_id_1, token_id_2, ..., sep_token_id, cls_token_id], just call CpmTokenizer

ArthurZ

Sep 6, 2023

Wow so sorry for the very much late reply! You are right, we should probably correct the build_inputs_with_special_tokens function, which is used when you set add_special_tokens = True (to format the inputs)

ArthurZ

Sep 6, 2023

You can also change the template processor if you are using a fast tokenizer.

ShaneTian

Mar 6

Thanks

ShaneTian changed discussion status to closed Mar 6