Whitespace as ORG

#1
by TJC - opened

Hi,

running xlm-roberta-base-ner-hrl with the following code I get an unexpected ORG, i.e. a whitespace. I would have expectd "MZ" to be "_MZ" and this would fix the issue. Or is this simply a false-positive?

Does somebody has an advice?

example = "Ein Jahr lang hat die MZ das Agrarunternehmen Barnstädt und die Agrargenossenschaft Bad Dürrenberg begleitet."
print(pd.DataFrame(nlp(example)), end="\n\n")
print(pd.DataFrame(nlp(example, aggregation_strategy="simple")))
  entity     score  index   word  start  end
0  B-ORG  0.998993      6      ▁     21   22
1  B-ORG  0.674700      7     MZ     22   24
2  B-ORG  0.908934     12  ▁Barn     45   50
3  I-ORG  0.987022     13    stä     50   53
4  I-ORG  0.999098     14     dt     53   55
5  I-ORG  0.717301     22   ▁Bad     83   87
6  I-ORG  0.686630     23    ▁Dü     87   90
7  I-ORG  0.736514     24   rren     90   94
8  I-ORG  0.568195     25   berg     94   98

  entity_group     score            word  start  end
0          ORG  0.998993                     21   22
1          ORG  0.674700              MZ     22   24
2          ORG  0.965018       Barnstädt     45   55
3          ORG  0.677160  Bad Dürrenberg     83   98

Many thanks, Thomas

hmm, this is the tokenization issue with the pipeline model, especially with XLM-R. Can you try out "Davlan/bert-base-multilingual-cased-ner-hrl", did you find the same issue? There are ways of dealing with this, e.g using the official prediction function of the official NER code https://github.com/huggingface/transformers/blob/main/examples/pytorch/token-classification/run_ner.py#L487

With "Davlan/bert-base-multilingual-cased-ner-hrl" all ORGs are as expected and the tagging format is IOB2 as well (nice):

   entity     score  index      word  start  end
0   B-ORG  0.999398      6         M     22   23
1   I-ORG  0.998075      7       ##Z     23   24
2   B-ORG  0.835974      9         A     29   30
3   I-ORG  0.544346     10     ##gra     30   33
4   I-ORG  0.837517     11    ##runt     33   37
5   I-ORG  0.894039     12    ##erne     37   41
6   I-ORG  0.942287     13    ##hmen     41   45
7   I-ORG  0.854009     14      Barn     46   50
8   I-ORG  0.999464     15     ##stä     50   53
9   I-ORG  0.999405     16      ##dt     53   55
10  B-ORG  0.859655     19         A     64   65
11  I-ORG  0.666555     20     ##gra     65   68
12  I-ORG  0.837543     21    ##rgen     68   72
13  I-ORG  0.873205     22   ##ossen     72   77
14  I-ORG  0.895725     23  ##schaft     77   83
15  I-ORG  0.915099     24       Bad     84   87
16  I-ORG  0.997494     25         D     88   89
17  I-ORG  0.996793     26      ##ür     89   91
18  I-ORG  0.997110     27     ##ren     91   94
19  I-ORG  0.999065     28    ##berg     94   98

  entity_group     score                                word  start  end
0          ORG  0.998736                                  MZ     22   24
1          ORG  0.863380          Agrarunternehmen Barnstädt     29   55
2          ORG  0.903824  Agrargenossenschaft Bad Dürrenberg     64   98

On the other hand, the XML-R based model would be the preferred one... Is there an easy way to achive this? - Thx!

Oh I see, maybe you can use my prediction function, but you need to modify a bit (tag set, model type , model name etc). https://github.com/masakhane-io/masakhane-ner/blob/main/code/predict_ner.py

The model cannot recognize "DATE" entity which in the config

Sign up or log in to comment