Are there any special tokens formatted as '<PERSON>', '<LOC>' in the training set or fine-tuning set?
We fine-tuned BLOOMZ to do customized translation task, and it surprisingly works well on text masked with those entity labels like '<PERSON>', '<LOC>', etc. However, when we slightly changed the label as '<PERSON_id>' (to discriminate different entities), its performance dropped dramatically. Hence, we suspect that labels like '<PERSON>' are somewhat specially treated in the pretraining or multi-task fine-tuning process. Is our guessing correct? If not, what could be possible reasons? Thanks!
All special tokens of the model are here: https://huggingface.co/bigscience/bloomz/blob/main/special_tokens_map.json
& they do not include such tokens.
I imagine that things like <PERSON>
may naturally appear somewhere in the datasets, but it was not added by us on purpose at least for the finetuning data.
All special tokens of the model are here: https://huggingface.co/bigscience/bloomz/blob/main/special_tokens_map.json
& they do not include such tokens.I imagine that things like
<PERSON>
may naturally appear somewhere in the datasets, but it was not added by us on purpose at least for the finetuning data.
We guess so. Thanks for your reply!