Whitespace as ORG

by TJC - opened Jun 23, 2022

TJC

Jun 23, 2022

Hi,

running xlm-roberta-base-ner-hrl with the following code I get an unexpected ORG, i.e. a whitespace. I would have expectd "MZ" to be "_MZ" and this would fix the issue. Or is this simply a false-positive?

Does somebody has an advice?

example = "Ein Jahr lang hat die MZ das Agrarunternehmen Barnstädt und die Agrargenossenschaft Bad Dürrenberg begleitet."
print(pd.DataFrame(nlp(example)), end="\n\n")
print(pd.DataFrame(nlp(example, aggregation_strategy="simple")))

  entity     score  index   word  start  end
0  B-ORG  0.998993      6      ▁     21   22
1  B-ORG  0.674700      7     MZ     22   24
2  B-ORG  0.908934     12  ▁Barn     45   50
3  I-ORG  0.987022     13    stä     50   53
4  I-ORG  0.999098     14     dt     53   55
5  I-ORG  0.717301     22   ▁Bad     83   87
6  I-ORG  0.686630     23    ▁Dü     87   90
7  I-ORG  0.736514     24   rren     90   94
8  I-ORG  0.568195     25   berg     94   98

  entity_group     score            word  start  end
0          ORG  0.998993                     21   22
1          ORG  0.674700              MZ     22   24
2          ORG  0.965018       Barnstädt     45   55
3          ORG  0.677160  Bad Dürrenberg     83   98

Many thanks, Thomas

Davlan

Owner Jun 25, 2022

hmm, this is the tokenization issue with the pipeline model, especially with XLM-R. Can you try out "Davlan/bert-base-multilingual-cased-ner-hrl", did you find the same issue? There are ways of dealing with this, e.g using the official prediction function of the official NER code https://github.com/huggingface/transformers/blob/main/examples/pytorch/token-classification/run_ner.py#L487

TJC

Jul 27, 2022

With "Davlan/bert-base-multilingual-cased-ner-hrl" all ORGs are as expected and the tagging format is IOB2 as well (nice):

   entity     score  index      word  start  end
0   B-ORG  0.999398      6         M     22   23
1   I-ORG  0.998075      7       ##Z     23   24
2   B-ORG  0.835974      9         A     29   30
3   I-ORG  0.544346     10     ##gra     30   33
4   I-ORG  0.837517     11    ##runt     33   37
5   I-ORG  0.894039     12    ##erne     37   41
6   I-ORG  0.942287     13    ##hmen     41   45
7   I-ORG  0.854009     14      Barn     46   50
8   I-ORG  0.999464     15     ##stä     50   53
9   I-ORG  0.999405     16      ##dt     53   55
10  B-ORG  0.859655     19         A     64   65
11  I-ORG  0.666555     20     ##gra     65   68
12  I-ORG  0.837543     21    ##rgen     68   72
13  I-ORG  0.873205     22   ##ossen     72   77
14  I-ORG  0.895725     23  ##schaft     77   83
15  I-ORG  0.915099     24       Bad     84   87
16  I-ORG  0.997494     25         D     88   89
17  I-ORG  0.996793     26      ##ür     89   91
18  I-ORG  0.997110     27     ##ren     91   94
19  I-ORG  0.999065     28    ##berg     94   98

  entity_group     score                                word  start  end
0          ORG  0.998736                                  MZ     22   24
1          ORG  0.863380          Agrarunternehmen Barnstädt     29   55
2          ORG  0.903824  Agrargenossenschaft Bad Dürrenberg     64   98

On the other hand, the XML-R based model would be the preferred one... Is there an easy way to achive this? - Thx!

Davlan

Owner Jul 27, 2022

Oh I see, maybe you can use my prediction function, but you need to modify a bit (tag set, model type , model name etc). https://github.com/masakhane-io/masakhane-ner/blob/main/code/predict_ner.py

RoacherM

Feb 1, 2023

The model cannot recognize "DATE" entity which in the config

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment