Are there any special tokens formatted as '<PERSON>', '<LOC>' in the training set or fine-tuning set?

#47
by tingxinli - opened

We fine-tuned BLOOMZ to do customized translation task, and it surprisingly works well on text masked with those entity labels like '<PERSON>', '<LOC>', etc. However, when we slightly changed the label as '<PERSON_id>' (to discriminate different entities), its performance dropped dramatically. Hence, we suspect that labels like '<PERSON>' are somewhat specially treated in the pretraining or multi-task fine-tuning process. Is our guessing correct? If not, what could be possible reasons? Thanks!

BigScience Workshop org

All special tokens of the model are here: https://huggingface.co/bigscience/bloomz/blob/main/special_tokens_map.json
& they do not include such tokens.

I imagine that things like <PERSON> may naturally appear somewhere in the datasets, but it was not added by us on purpose at least for the finetuning data.

All special tokens of the model are here: https://huggingface.co/bigscience/bloomz/blob/main/special_tokens_map.json
& they do not include such tokens.

I imagine that things like <PERSON> may naturally appear somewhere in the datasets, but it was not added by us on purpose at least for the finetuning data.

We guess so. Thanks for your reply!

tingxinli changed discussion status to closed

Sign up or log in to comment