rjuez00/meddocan-flair-spanish-fast-bilstm-crf

The MEDDOCAN dataset has some entities not separated by a space but a dot. For example such is the case of Alicante.Villajoyosa which are two separate entities but with traditional tokenizers are only one Token. Spacy tokenizers also don't work, when I was trying to assign the entities two the tokens on training SpaCy v3 frecuently reported errors that it could not match some entities to tokens due to this problem.

That is why I have created a Tokenizer with manual regex rules so that it improves the performance when using the model:

from flair.models import SequenceTagger
from flair.data import Sentence
from flair.data import Tokenizer
import re

class CustomTokenizer(Tokenizer):
    def tokenize(self, text):
        finaltokens = []
        tokens = text.split()
        for token in tokens:
            for i in list(filter(None, re.split("-|\/" , token))):
                
                if len(re.findall("(\w)\.(\w)", i)) > 0:
                    #print(i)
                    for j in filter(None, i.split(".")):
                        finaltokens.append(j)
                else:
                    #print(i)
                    finaltokens.append(i)
        #print(finaltokens)       
        return finaltokens

flairTagger = SequenceTagger.load("rjuez00/meddocan-flair-spanish-fast-bilstm-crf")

For using the model you just have to instanciate it like above and then create a Flair Sentence with the text and the tokenizer like this: documentFlair = Sentence(text, use_tokenizer = CustomTokenizer())

Unfortunately the spans that Flair provides while performing NER on the MEDDOCAN dataset are not correct, I'm not aware if its a bug of my version (0.11). But I've developed a system that corrects the slight deviations of the offsets.

documentEntities = []    
documentFlair = Sentence(text, use_tokenizer = CustomTokenizer())
flairTagger.predict(documentFlair)      

predictedEntities = []
for idxentity, entity in enumerate(documentFlair.get_spans("ner")):
    predictedEntities.append(entity)

for idxentity, entity in enumerate(reversed(predictedEntities), start = 1):
      entityType = entity.get_label("ner").value
      startEntity = entity.start_position
      endEntity = entity.end_position
      

      while text[startEntity] in [" ", "(", ")", ",", ".", ";", ":", "!", "?", "-", "\n"]:
          startEntity += 1
      
      
      while len(text) > endEntity and (text[endEntity].isalpha() or text[endEntity].isnumeric()):
          #print("ALARGADO FINAL")
          endEntity += 1

      while text[endEntity-1] in [" ", ",", ".", ";", ":", "!", "?", "-", ")", "(", "\\", "/", "\"", "'", "+", "*", "&", "%", "$", "#", "@", "~", "`", "^", "|", "=", ":", ";", ">", "<", "]"]:
          endEntity -= 1

      #print(f"PREDICHO:{entity.text}\t\t\t\tARREGLADO:{text[startEntity:endEntity]}\n")

      f.write(   "T" + str(idxentity) + "\t" 
      + entityType + " " +  str(startEntity)  + " " +  str(endEntity) + 
      "\t" + text[startEntity:endEntity]  +  "\n" )