Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,73 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
The [MEDDOCAN dataset](https://github.com/PlanTL-GOB-ES/SPACCC_MEDDOCAN) has some entities not separated by a space but a dot. For example such is the case of Alicante.Villajoyosa which are two separate entities but with traditional tokenizers are only one Token. Spacy tokenizers also don't work, when I was trying to assign the entities two the tokens on training SpaCy v3 frecuently reported errors that it could not match some entities to tokens due to this problem.
|
2 |
+
|
3 |
+
That is why I have created a Tokenizer with manual regex rules so that it improves the performance when using the model:
|
4 |
+
|
5 |
+
```
|
6 |
+
from flair.models import SequenceTagger
|
7 |
+
from flair.data import Sentence
|
8 |
+
from flair.data import Tokenizer
|
9 |
+
import re
|
10 |
+
|
11 |
+
class CustomTokenizer(Tokenizer):
|
12 |
+
def tokenize(self, text):
|
13 |
+
finaltokens = []
|
14 |
+
tokens = text.split()
|
15 |
+
for token in tokens:
|
16 |
+
for i in list(filter(None, re.split("-|\/" , token))):
|
17 |
+
|
18 |
+
if len(re.findall("(\w)\.(\w)", i)) > 0:
|
19 |
+
#print(i)
|
20 |
+
for j in filter(None, i.split(".")):
|
21 |
+
finaltokens.append(j)
|
22 |
+
else:
|
23 |
+
#print(i)
|
24 |
+
finaltokens.append(i)
|
25 |
+
#print(finaltokens)
|
26 |
+
return finaltokens
|
27 |
+
|
28 |
+
flairTagger = SequenceTagger.load("rjuez00/meddocan-flair-spanish-fast-bilstm-crf")
|
29 |
+
```
|
30 |
+
|
31 |
+
|
32 |
+
For using the model you just have to instanciate it like above and then create a Flair Sentence with the text and the tokenizer like this:
|
33 |
+
```documentFlair = Sentence(text, use_tokenizer = CustomTokenizer())```
|
34 |
+
|
35 |
+
|
36 |
+
Unfortunately the spans that Flair provides while performing NER on the MEDDOCAN dataset are not correct, I'm not aware if its a bug of my version (0.11). But I've developed a system that corrects the slight deviations of the offsets.
|
37 |
+
|
38 |
+
|
39 |
+
```
|
40 |
+
documentEntities = []
|
41 |
+
documentFlair = Sentence(text, use_tokenizer = CustomTokenizer())
|
42 |
+
flairTagger.predict(documentFlair)
|
43 |
+
|
44 |
+
predictedEntities = []
|
45 |
+
for idxentity, entity in enumerate(documentFlair.get_spans("ner")):
|
46 |
+
predictedEntities.append(entity)
|
47 |
+
```
|
48 |
+
|
49 |
+
```
|
50 |
+
for idxentity, entity in enumerate(reversed(predictedEntities), start = 1):
|
51 |
+
entityType = entity.get_label("ner").value
|
52 |
+
startEntity = entity.start_position
|
53 |
+
endEntity = entity.end_position
|
54 |
+
|
55 |
+
|
56 |
+
while text[startEntity] in [" ", "(", ")", ",", ".", ";", ":", "!", "?", "-", "\n"]:
|
57 |
+
startEntity += 1
|
58 |
+
|
59 |
+
|
60 |
+
while len(text) > endEntity and (text[endEntity].isalpha() or text[endEntity].isnumeric()):
|
61 |
+
#print("ALARGADO FINAL")
|
62 |
+
endEntity += 1
|
63 |
+
|
64 |
+
while text[endEntity-1] in [" ", ",", ".", ";", ":", "!", "?", "-", ")", "(", "\\", "/", "\"", "'", "+", "*", "&", "%", "$", "#", "@", "~", "`", "^", "|", "=", ":", ";", ">", "<", "]"]:
|
65 |
+
endEntity -= 1
|
66 |
+
|
67 |
+
#print(f"PREDICHO:{entity.text}\t\t\t\tARREGLADO:{text[startEntity:endEntity]}\n")
|
68 |
+
|
69 |
+
f.write( "T" + str(idxentity) + "\t"
|
70 |
+
+ entityType + " " + str(startEntity) + " " + str(endEntity) +
|
71 |
+
"\t" + text[startEntity:endEntity] + "\n" )
|
72 |
+
|
73 |
+
```
|