Whitespace as ORG
Hi,
running xlm-roberta-base-ner-hrl with the following code I get an unexpected ORG, i.e. a whitespace. I would have expectd "MZ" to be "_MZ" and this would fix the issue. Or is this simply a false-positive?
Does somebody has an advice?
example = "Ein Jahr lang hat die MZ das Agrarunternehmen Barnstädt und die Agrargenossenschaft Bad Dürrenberg begleitet."
print(pd.DataFrame(nlp(example)), end="\n\n")
print(pd.DataFrame(nlp(example, aggregation_strategy="simple")))
entity score index word start end
0 B-ORG 0.998993 6 ▁ 21 22
1 B-ORG 0.674700 7 MZ 22 24
2 B-ORG 0.908934 12 ▁Barn 45 50
3 I-ORG 0.987022 13 stä 50 53
4 I-ORG 0.999098 14 dt 53 55
5 I-ORG 0.717301 22 ▁Bad 83 87
6 I-ORG 0.686630 23 ▁Dü 87 90
7 I-ORG 0.736514 24 rren 90 94
8 I-ORG 0.568195 25 berg 94 98
entity_group score word start end
0 ORG 0.998993 21 22
1 ORG 0.674700 MZ 22 24
2 ORG 0.965018 Barnstädt 45 55
3 ORG 0.677160 Bad Dürrenberg 83 98
Many thanks, Thomas
hmm, this is the tokenization issue with the pipeline model, especially with XLM-R. Can you try out "Davlan/bert-base-multilingual-cased-ner-hrl", did you find the same issue? There are ways of dealing with this, e.g using the official prediction function of the official NER code https://github.com/huggingface/transformers/blob/main/examples/pytorch/token-classification/run_ner.py#L487
With "Davlan/bert-base-multilingual-cased-ner-hrl" all ORGs are as expected and the tagging format is IOB2 as well (nice):
entity score index word start end
0 B-ORG 0.999398 6 M 22 23
1 I-ORG 0.998075 7 ##Z 23 24
2 B-ORG 0.835974 9 A 29 30
3 I-ORG 0.544346 10 ##gra 30 33
4 I-ORG 0.837517 11 ##runt 33 37
5 I-ORG 0.894039 12 ##erne 37 41
6 I-ORG 0.942287 13 ##hmen 41 45
7 I-ORG 0.854009 14 Barn 46 50
8 I-ORG 0.999464 15 ##stä 50 53
9 I-ORG 0.999405 16 ##dt 53 55
10 B-ORG 0.859655 19 A 64 65
11 I-ORG 0.666555 20 ##gra 65 68
12 I-ORG 0.837543 21 ##rgen 68 72
13 I-ORG 0.873205 22 ##ossen 72 77
14 I-ORG 0.895725 23 ##schaft 77 83
15 I-ORG 0.915099 24 Bad 84 87
16 I-ORG 0.997494 25 D 88 89
17 I-ORG 0.996793 26 ##ür 89 91
18 I-ORG 0.997110 27 ##ren 91 94
19 I-ORG 0.999065 28 ##berg 94 98
entity_group score word start end
0 ORG 0.998736 MZ 22 24
1 ORG 0.863380 Agrarunternehmen Barnstädt 29 55
2 ORG 0.903824 Agrargenossenschaft Bad Dürrenberg 64 98
On the other hand, the XML-R based model would be the preferred one... Is there an easy way to achive this? - Thx!
Oh I see, maybe you can use my prediction function, but you need to modify a bit (tag set, model type , model name etc). https://github.com/masakhane-io/masakhane-ner/blob/main/code/predict_ner.py
The model cannot recognize "DATE" entity which in the config