gabrielandrade2 commited on
Commit
348db8b
1 Parent(s): e4e53ba

Update README, add example code

Browse files
Files changed (3) hide show
  1. README.md +109 -1
  2. example.py +42 -0
  3. requirements.txt +28 -0
README.md CHANGED
@@ -1,3 +1,111 @@
1
  ---
2
- license: mit
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: ja
3
+ license: gpl-3.0
4
  ---
5
+
6
+ A model used to estimate the start and end of a Named Entity (NE) span based on a Point annotation, as used in the paper "Is boundary annotation necessary? Evaluating boundary-free approaches to improve clinical named entity annotation efficiency".
7
+
8
+ Basically, the goal of this model is to convert a point annotation to a corresponding span annotation with the correct span.
9
+
10
+ The model locates an identifier token (⧫) and based on its surround context estimates where the NE concept starts and ends.
11
+
12
+ The model is trained to estimate the spans of diseases and symptom names in Japanese medical texts.
13
+
14
+ ## Concepts
15
+
16
+ ### Point annotation
17
+
18
+ Unlike span-based paradigms, a point annotation is composed by a single position within the NE span.
19
+ It is a simple and fast way to annotate NEs, but it introduces ambiguity in the information captured by the annotation.
20
+
21
+ On this repository implementation, a point annotation is represented by a lozenge character (⧫).
22
+
23
+ Example:
24
+ ```
25
+ The patient has a history of dia⧫betes.
26
+ ```
27
+
28
+ ### Span annotation
29
+
30
+ A span annotation is composed by the two markings, identifying both start and end positions of the NE span.
31
+
32
+ The implementation on this repository is based on the span annotation schema defined by [Yada et al. (2020)](https://aclanthology.org/2020.lrec-1.561/).
33
+
34
+ Example:
35
+ ```
36
+ The patient has a history of <C>diabetes</C>.
37
+ ```
38
+
39
+ ## Model architecture
40
+
41
+ This model was fine-tuned on top of [cl-tohoku/bert-base-japanese-char-v2] (https://huggingface.co/cl-tohoku/bert-base-japanese-char-v2).
42
+
43
+ The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.
44
+
45
+ To be executed, this model requires the following dependencies:
46
+ - fugashi
47
+ - unidic-lite
48
+
49
+ ## Training data
50
+
51
+ The model was finetuned using a dataset of Japanese medical texts (which is not available pubicly), comprised of 1027 synthetic medication history notes generated through crowd-sourcing.
52
+
53
+ Ten experienced dispensing pharmacists were hired as writers to craft the corpus. Each writer was assigned one of 285 drug names and tasked with creating a ``typical'' clinical narrative. This corpus was later fully annotated for symptoms and disease names.
54
+
55
+ Each annotation received a ⧫ token within its span based on a Truncated normal distribution.
56
+
57
+ The model was then trained to identify this token and output a span corresponding to the surrounding concept.
58
+
59
+ ## Usage
60
+
61
+ The `requirements.txt` file contains all the dependencies needed to run the example code.
62
+
63
+ ```python
64
+ import mojimoji
65
+ import numpy as np
66
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
67
+
68
+ import iob_util #pip install git+https://github.com/gabrielandrade2/IOB-util.git
69
+
70
+ model_name = "gabrielandrade2/point-to-span-estimation"
71
+
72
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
73
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
74
+
75
+ # Point-annotated text
76
+ text = "肥大型心⧫筋症、心房⧫細動に対してWF投与が開始となった。\
77
+ 治療経過中に非持続性心⧫室頻拍が認められたためアミオダロンが併用となった。"
78
+
79
+ # Convert to zenkaku and tokenize
80
+ text = mojimoji.han_to_zen(text)
81
+ tokenized = tokenizer.tokenize(text)
82
+
83
+ # Encode text
84
+ input_ids = tokenizer.encode(text, return_tensors="pt")
85
+
86
+ # Predict spans
87
+ output = model(input_ids)
88
+ logits = output[0].detach().cpu().numpy()
89
+ tags = np.argmax(logits, axis=2)[:, :].tolist()[0]
90
+
91
+ # Convert model output to IOB format
92
+ id2label = model.config.id2label
93
+ tags = [id2label[t] for t in tags]
94
+
95
+ # Convert input_ids back to chars
96
+ tokens = [tokenizer.convert_ids_to_tokens(t) for t in input_ids][0]
97
+
98
+ # Remove model special tokens (CLS, SEP, PAD)
99
+ tags = [y for x, y in zip(tokens, tags) if x not in ['[CLS]', '[SEP]', '[PAD]']]
100
+ tokens = [x for x in tokens if x not in ['[CLS]', '[SEP]', '[PAD]']]
101
+
102
+ # Convert from IOB to XML tag format
103
+ xml_text = iob_util.convert_iob_to_xml(tokens, tags)
104
+ xml_text = xml_text.replace('⧫', '')
105
+ print(xml_text)
106
+ ```
107
+
108
+ ### Output
109
+ ```xml
110
+ <C>肥大型心筋症</C>、<C>心房細動</C>に対してWF投与が開始となった。治療経過中に<C>非持続性心室頻拍</C>が認められたためアミオダロンが併用となった。
111
+ ```
example.py ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import mojimoji
2
+ import numpy as np
3
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
4
+
5
+ import iob_util #pip install git+https://github.com/gabrielandrade2/IOB-util.git
6
+
7
+ model_name = "gabrielandrade2/point-to-span-estimation"
8
+
9
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
10
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
11
+
12
+ # Point-annotated text
13
+ text = "肥大型心⧫筋症、心房⧫細動に対してWF投与が開始となった。\
14
+ 治療経過中に非持続性心⧫室頻拍が認められたためアミオダロンが併用となった。"
15
+
16
+ # Convert to zenkaku and tokenize
17
+ text = mojimoji.han_to_zen(text)
18
+ tokenized = tokenizer.tokenize(text)
19
+
20
+ # Encode text
21
+ input_ids = tokenizer.encode(text, return_tensors="pt")
22
+
23
+ # Predict spans
24
+ output = model(input_ids)
25
+ logits = output[0].detach().cpu().numpy()
26
+ tags = np.argmax(logits, axis=2)[:, :].tolist()[0]
27
+
28
+ # Convert model output to IOB format
29
+ id2label = model.config.id2label
30
+ tags = [id2label[t] for t in tags]
31
+
32
+ # Convert input_ids back to chars
33
+ tokens = [tokenizer.convert_ids_to_tokens(t) for t in input_ids][0]
34
+
35
+ # Remove model special tokens (CLS, SEP, PAD)
36
+ tags = [y for x, y in zip(tokens, tags) if x not in ['[CLS]', '[SEP]', '[PAD]']]
37
+ tokens = [x for x in tokens if x not in ['[CLS]', '[SEP]', '[PAD]']]
38
+
39
+ # Convert from IOB to XML tag format
40
+ xml_text = iob_util.convert_iob_to_xml(tokens, tags)
41
+ xml_text = xml_text.replace('⧫', '')
42
+ print(xml_text)
requirements.txt ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ certifi==2024.2.2
2
+ charset-normalizer==3.3.2
3
+ filelock==3.13.1
4
+ fsspec==2024.2.0
5
+ fugashi==1.3.0
6
+ huggingface-hub==0.20.3
7
+ idna==3.6
8
+ iob_util @ git+https://github.com/gabrielandrade2/IOB-util.git@b5d522aa50238a25cdda19f2cf6908833acd6d64
9
+ Jinja2==3.1.3
10
+ lxml==5.1.0
11
+ MarkupSafe==2.1.5
12
+ mojimoji==0.0.13
13
+ mpmath==1.3.0
14
+ networkx==3.2.1
15
+ numpy==1.26.4
16
+ packaging==23.2
17
+ PyYAML==6.0.1
18
+ regex==2023.12.25
19
+ requests==2.31.0
20
+ safetensors==0.4.2
21
+ sympy==1.12
22
+ tokenizers==0.15.2
23
+ torch==2.2.0
24
+ tqdm==4.66.2
25
+ transformers==4.38.1
26
+ typing_extensions==4.9.0
27
+ unidic-lite==1.0.8
28
+ urllib3==2.2.1