rjuez00
/

meddocan-flair-spanish-fast-bilstm-crf

Model card Files Files and versions Community

rjuez00 commited on May 3, 2022

Commit

a354058

•

1 Parent(s): 224ab0f

Create README.md

Files changed (1) hide show

README.md +73 -0

README.md ADDED Viewed

	@@ -0,0 +1,73 @@

+The [MEDDOCAN dataset](https://github.com/PlanTL-GOB-ES/SPACCC_MEDDOCAN) has some entities not separated by a space but a dot. For example such is the case of Alicante.Villajoyosa which are two separate entities but with traditional tokenizers are only one Token. Spacy tokenizers also don't work, when I was trying to assign the entities two the tokens on training SpaCy v3 frecuently reported errors that it could not match some entities to tokens due to this problem.
+That is why I have created a Tokenizer with manual regex rules so that it improves the performance when using the model:
+```
+from flair.models import SequenceTagger
+from flair.data import Sentence
+from flair.data import Tokenizer
+import re
+class CustomTokenizer(Tokenizer):
+    def tokenize(self, text):
+        finaltokens = []
+        tokens = text.split()
+        for token in tokens:
+            for i in list(filter(None, re.split("-|\/" , token))):
+                if len(re.findall("(\w)\.(\w)", i)) > 0:
+                    #print(i)
+                    for j in filter(None, i.split(".")):
+                        finaltokens.append(j)
+                else:
+                    #print(i)
+                    finaltokens.append(i)
+        #print(finaltokens)
+        return finaltokens
+flairTagger = SequenceTagger.load("rjuez00/meddocan-flair-spanish-fast-bilstm-crf")
+```
+For using the model you just have to instanciate it like above and then create a Flair Sentence with the text and the tokenizer like this:
+```documentFlair = Sentence(text, use_tokenizer = CustomTokenizer())```
+Unfortunately the spans that Flair provides while performing NER on the MEDDOCAN dataset are not correct, I'm not aware if its a bug of my version (0.11). But I've developed a system that corrects the slight deviations of the offsets.
+```
+documentEntities = []
+documentFlair = Sentence(text, use_tokenizer = CustomTokenizer())
+flairTagger.predict(documentFlair)
+predictedEntities = []
+for idxentity, entity in enumerate(documentFlair.get_spans("ner")):
+    predictedEntities.append(entity)
+```
+```
+for idxentity, entity in enumerate(reversed(predictedEntities), start = 1):
+      entityType = entity.get_label("ner").value
+      startEntity = entity.start_position
+      endEntity = entity.end_position
+      while text[startEntity] in [" ", "(", ")", ",", ".", ";", ":", "!", "?", "-", "\n"]:
+          startEntity += 1
+      while len(text) > endEntity and (text[endEntity].isalpha() or text[endEntity].isnumeric()):
+          #print("ALARGADO FINAL")
+          endEntity += 1
+      while text[endEntity-1] in [" ", ",", ".", ";", ":", "!", "?", "-", ")", "(", "\\", "/", "\"", "'", "+", "*", "&", "%", "$", "#", "@", "~", "`", "^", "|", "=", ":", ";", ">", "<", "]"]:
+          endEntity -= 1
+      #print(f"PREDICHO:{entity.text}\t\t\t\tARREGLADO:{text[startEntity:endEntity]}\n")
+      f.write(   "T" + str(idxentity) + "\t"
+      + entityType + " " +  str(startEntity)  + " " +  str(endEntity) +
+      "\t" + text[startEntity:endEntity]  +  "\n" )
+```