issue about ner task

#1
by neavo - opened

When I was training the NER task using this series of models, I encountered some issues.
After the normal training process, I obtained a model with relatively ideal evaluation metrics such as F1 score.
However, in actual use, I found that the models based on the base and small versions identified some discontinuous characters with very low scores (<0.5), while the large version did not have this problem.
In conjunction with your instructions, could this be caused by the difference between unigram and BPE? How should I deal with this issue?

I think I've solved this issue. The problem was that SentencePiece might add a token with the content "_" at the beginning of a sentence. The corresponding offset map value for this token was not the special token's expected (0, 0) but (0, 1), which interfered with the normal operation of char_to_token. After correcting the alignment, the data seems to be normal. However, it still needs to be observed.

Sign up or log in to comment