`CpmTokenizer` is different from the original CPM-1 tokenizer in GitHub
#1
by
ShaneTian
- opened
transformers.CpmTokenizer
is based on transformers.XLNetTokenizer
, but the original CPM-1 tokenizer is not.
I found in fine-tuning:
- the original tokenizer always add an
eod_token = <eod>
in the end of sentence , see here. - the
transformers.CpmTokenizer
always addsep_token = <sep>
andcls_token = <cls>
in the end of sentence, see here.
I am confused.
In LM fine-tuning, how to prepare the input data?
[token_id_1, token_id_2, ..., eod_token_id]
, whereeod_token_id
is the id of<eod>
token intransformers.CpmTokenizer
[token_id_1, token_id_2, ..., eos_token_id]
, whereeos_token_id
is the id of</s>
token intransformers.CpmTokenizer
[token_id_1, token_id_2, ..., eos_token_id]
, whereeos_token_id
is the id of<|endoftext|>
token intransformers.GPT2Tokenizer
[token_id_1, token_id_2, ..., sep_token_id, cls_token_id]
, just callCpmTokenizer
Wow so sorry for the very much late reply! You are right, we should probably correct the build_inputs_with_special_tokens
function, which is used when you set add_special_tokens = True
(to format the inputs)
You can also change the template processor if you are using a fast tokenizer.
Thanks
ShaneTian
changed discussion status to
closed