rjuez00 commited on
Commit
a354058
1 Parent(s): 224ab0f

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +73 -0
README.md ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ The [MEDDOCAN dataset](https://github.com/PlanTL-GOB-ES/SPACCC_MEDDOCAN) has some entities not separated by a space but a dot. For example such is the case of Alicante.Villajoyosa which are two separate entities but with traditional tokenizers are only one Token. Spacy tokenizers also don't work, when I was trying to assign the entities two the tokens on training SpaCy v3 frecuently reported errors that it could not match some entities to tokens due to this problem.
2
+
3
+ That is why I have created a Tokenizer with manual regex rules so that it improves the performance when using the model:
4
+
5
+ ```
6
+ from flair.models import SequenceTagger
7
+ from flair.data import Sentence
8
+ from flair.data import Tokenizer
9
+ import re
10
+
11
+ class CustomTokenizer(Tokenizer):
12
+ def tokenize(self, text):
13
+ finaltokens = []
14
+ tokens = text.split()
15
+ for token in tokens:
16
+ for i in list(filter(None, re.split("-|\/" , token))):
17
+
18
+ if len(re.findall("(\w)\.(\w)", i)) > 0:
19
+ #print(i)
20
+ for j in filter(None, i.split(".")):
21
+ finaltokens.append(j)
22
+ else:
23
+ #print(i)
24
+ finaltokens.append(i)
25
+ #print(finaltokens)
26
+ return finaltokens
27
+
28
+ flairTagger = SequenceTagger.load("rjuez00/meddocan-flair-spanish-fast-bilstm-crf")
29
+ ```
30
+
31
+
32
+ For using the model you just have to instanciate it like above and then create a Flair Sentence with the text and the tokenizer like this:
33
+ ```documentFlair = Sentence(text, use_tokenizer = CustomTokenizer())```
34
+
35
+
36
+ Unfortunately the spans that Flair provides while performing NER on the MEDDOCAN dataset are not correct, I'm not aware if its a bug of my version (0.11). But I've developed a system that corrects the slight deviations of the offsets.
37
+
38
+
39
+ ```
40
+ documentEntities = []
41
+ documentFlair = Sentence(text, use_tokenizer = CustomTokenizer())
42
+ flairTagger.predict(documentFlair)
43
+
44
+ predictedEntities = []
45
+ for idxentity, entity in enumerate(documentFlair.get_spans("ner")):
46
+ predictedEntities.append(entity)
47
+ ```
48
+
49
+ ```
50
+ for idxentity, entity in enumerate(reversed(predictedEntities), start = 1):
51
+ entityType = entity.get_label("ner").value
52
+ startEntity = entity.start_position
53
+ endEntity = entity.end_position
54
+
55
+
56
+ while text[startEntity] in [" ", "(", ")", ",", ".", ";", ":", "!", "?", "-", "\n"]:
57
+ startEntity += 1
58
+
59
+
60
+ while len(text) > endEntity and (text[endEntity].isalpha() or text[endEntity].isnumeric()):
61
+ #print("ALARGADO FINAL")
62
+ endEntity += 1
63
+
64
+ while text[endEntity-1] in [" ", ",", ".", ";", ":", "!", "?", "-", ")", "(", "\\", "/", "\"", "'", "+", "*", "&", "%", "$", "#", "@", "~", "`", "^", "|", "=", ":", ";", ">", "<", "]"]:
65
+ endEntity -= 1
66
+
67
+ #print(f"PREDICHO:{entity.text}\t\t\t\tARREGLADO:{text[startEntity:endEntity]}\n")
68
+
69
+ f.write( "T" + str(idxentity) + "\t"
70
+ + entityType + " " + str(startEntity) + " " + str(endEntity) +
71
+ "\t" + text[startEntity:endEntity] + "\n" )
72
+
73
+ ```