File size: 6,469 Bytes
22846cf
 
 
 
 
 
 
 
 
 
 
e02e104
22846cf
 
 
e02e104
 
 
 
 
 
 
7cfd1f1
e02e104
 
 
 
cb1f218
 
e02e104
 
 
 
 
 
 
 
 
cb1f218
 
 
 
 
2f78cfd
59da2f9
 
 
 
 
 
 
 
cb1f218
2f78cfd
cb1f218
7cfd1f1
2f78cfd
cb1f218
381f900
 
59da2f9
381f900
 
 
 
 
 
 
 
 
 
 
 
 
2f78cfd
381f900
2f78cfd
59da2f9
 
7cfd1f1
 
 
 
 
 
2f78cfd
 
 
677d92b
082946b
 
e6db38c
082946b
7cfd1f1
 
203df2f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7cfd1f1
cb1f218
22846cf
7cfd1f1
 
082946b
 
 
 
 
 
 
 
 
 
 
 
 
 
7cfd1f1
22846cf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
license: mit
language:
- ko
base_model:
- klue/bert-base
pipeline_tag: feature-extraction
tags:
- medical
---

# ๐ŸŠ Korean Medical DPR(Dense Passage Retrieval)

## 1. Intro
**์˜๋ฃŒ ๋ถ„์•ผ**์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” Bi-Encoder ๊ตฌ์กฐ์˜ ๊ฒ€์ƒ‰ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.        
ํ•œยท์˜ ํ˜ผ์šฉ์ฒด์˜ ์˜๋ฃŒ ๊ธฐ๋ก์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด **SapBERT-KO-EN** ์„ ๋ฒ ์ด์Šค ๋ชจ๋ธ๋กœ ์ด์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.            
์งˆ๋ฌธ์€ Question Encoder๋กœ, ํ…์ŠคํŠธ๋Š” Context Encoder๋ฅผ ์ด์šฉํ•ด ์ธ์ฝ”๋”ฉํ•ฉ๋‹ˆ๋‹ค.       

- Question Encoder : [https://huggingface.co/snumin44/medical-biencoder-ko-bert-question](https://huggingface.co/snumin44/medical-biencoder-ko-bert-question)

(โ€ป ์ด ๋ชจ๋ธ์€ AI Hub์˜ [์ดˆ๊ฑฐ๋Œ€ AI ํ—ฌ์Šค์ผ€์–ด ์งˆ์˜ ์‘๋‹ต ๋ฐ์ดํ„ฐ](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=71762)๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.)


## 2. Model

**(1) Self Alignment Pretraining (SAP)**

ํ•œ๊ตญ ์˜๋ฃŒ ๊ธฐ๋ก์€ **ํ•œยท์˜ ํ˜ผ์šฉ์ฒด**๋กœ ์“ฐ์—ฌ, ์˜์–ด ์šฉ์–ด๋„ ์ธ์‹ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.        
Multi Similarity Loss๋ฅผ ์ด์šฉํ•ด **๋™์ผํ•œ ์ฝ”๋“œ์˜ ์šฉ์–ด** ๊ฐ„์— ๋†’์€ ์œ ์‚ฌ๋„๋ฅผ ๊ฐ–๋„๋ก ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค.        
```
์˜ˆ) C3843080 || ๊ณ ํ˜ˆ์•• ์งˆํ™˜ 
    C3843080 || Hypertension
    C3843080 || High Blood Pressure
    C3843080 || HTN
    C3843080 || HBP
```


- SapBERT-KO-EN : [https://huggingface.co/snumin44/sap-bert-ko-en](https://huggingface.co/snumin44/sap-bert-ko-en)
- Github : [https://github.com/snumin44/SapBERT-KO-EN](https://github.com/snumin44/SapBERT-KO-EN)

**(2) Dense Passage Retrieval (DPR)**

SapBERT-KO-EN์„ ๊ฒ€์ƒ‰ ๋ชจ๋ธ๋กœ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ์ถ”๊ฐ€์ ์ธ Fine-tuning์„ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.      
Bi-Encoder ๊ตฌ์กฐ๋กœ ์งˆ์˜์™€ ํ…์ŠคํŠธ์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” DPR ๋ฐฉ์‹์œผ๋กœ Fine-tuning ํ–ˆ์Šต๋‹ˆ๋‹ค.    
๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ธฐ์กด์˜ ๋ฐ์ดํ„ฐ ์…‹์— **ํ•œยท์˜ ํ˜ผ์šฉ์ฒด ์ƒ˜ํ”Œ์„ ์ฆ๊ฐ•**ํ•œ ๋ฐ์ดํ„ฐ ์…‹์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
```
์˜ˆ) ํ•œ๊ตญ์–ด ๋ณ‘๋ช…: ๊ณ ํ˜ˆ์••
    ์˜์–ด ๋ณ‘๋ช…: Hypertenstion
    ์งˆ์˜ (์›๋ณธ): ์•„๋ฒ„์ง€๊ฐ€ ๊ณ ํ˜ˆ์••์ธ๋ฐ ๊ทธ๊ฒŒ ๋ญ”์ง€ ๋ชจ๋ฅด๊ฒ ์–ด. ๊ณ ํ˜ˆ์••์ด ๋ญ”์ง€ ์„ค๋ช…์ข€ ํ•ด์ค˜.
    ์งˆ์˜ (์ฆ๊ฐ•): ์•„๋ฒ„์ง€๊ฐ€ Hypertenstion ์ธ๋ฐ ๊ทธ๊ฒŒ ๋ญ”์ง€ ๋ชจ๋ฅด๊ฒ ์–ด. Hypertenstion ์ด ๋ญ”์ง€ ์„ค๋ช…์ข€ ํ•ด์ค˜.
```

- Github : [https://github.com/snumin44/DPR-KO](https://github.com/snumin44/DPR-KO)


## 3. Training

**(1) Self Alignment Pretraining (SAP)**

SapBERT-KO-EN ํ•™์Šต์— ํ™œ์šฉํ•œ ๋ฒ ์ด์Šค ๋ชจ๋ธ ๋ฐ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.    
ํ•œยท์˜ ์˜๋ฃŒ ์šฉ์–ด๋ฅผ ์ˆ˜๋กํ•œ ์˜๋ฃŒ ์šฉ์–ด ์‚ฌ์ „์ธ **KOSTOM**์„ ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

- Model : klue/bert-base
- Dataset : **KOSTOM**
- Epochs : 1
- Batch Size : 64
- Max Length : 64
- Dropout : 0.1
- Pooler : 'cls'
- Eval Step : 100
- Threshold : 0.8
- Scale Positive Sample : 1
- Scale Negative Sample : 60 

**(2) Dense Passage Retrieval (DPR)**

Fine-tuning์— ํ™œ์šฉํ•œ ๋ฒ ์ด์Šค ๋ชจ๋ธ ๋ฐ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

- Model : SapBERT-KO-EN(klue/bert-base)
- Dataset : **์ดˆ๊ฑฐ๋Œ€ AI ํ—ฌ์Šค์ผ€์–ด ์งˆ์˜ ์‘๋‹ต ๋ฐ์ดํ„ฐ(AI Hub)**
- Epochs : 10
- Batch Size : 64
- Dropout : 0.1
- Pooler : 'cls' 


## 4. Example
์ด ๋ชจ๋ธ์€ Context๋ฅผ ์ธ์ฝ”๋”ฉํ•˜๋Š” ๋ชจ๋ธ๋กœ, Question ๋ชจ๋ธ๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.       
๋™์ผํ•œ ์งˆ๋ณ‘์— ๊ด€ํ•œ ์งˆ๋ฌธ๊ณผ ํ…์ŠคํŠธ๊ฐ€ ๋†’์€ ์œ ์‚ฌ๋„๋ฅผ ๋ณด์ธ๋‹ค๋Š” ์‚ฌ์‹ค์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.     

(โ€ป ์•„๋ž˜ ์ฝ”๋“œ์˜ ์˜ˆ์‹œ๋Š” ChatGPT๋ฅผ ์ด์šฉํ•ด ์ƒ์„ฑํ•œ ์˜๋ฃŒ ํ…์ŠคํŠธ์ž…๋‹ˆ๋‹ค.)      
(โ€ป ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ ์ƒ ์˜ˆ์‹œ ๋ณด๋‹ค ์ •์ œ๋œ ํ…์ŠคํŠธ์— ๋Œ€ํ•ด ๋” ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.)

```python
import numpy as np
from transformers import AutoModel, AutoTokenizer

# Question Model
q_model_path = 'snumin44/medical-biencoder-ko-bert-question'
q_model = AutoModel.from_pretrained(q_model_path)
q_tokenizer = AutoTokenizer.from_pretrained(q_model_path)

# Context Model
c_model_path = 'snumin44/medical-biencoder-ko-bert-context'
c_model = AutoModel.from_pretrained(c_model_path)
c_tokenizer = AutoTokenizer.from_pretrained(c_model_path)


query = 'high blood pressure ์ฒ˜๋ฐฉ ์‚ฌ๋ก€'

targets = [
    """๊ณ ํ˜ˆ์•• ์ง„๋‹จ.
    ํ™˜์ž ์ƒ๋‹ด ๋ฐ ์ƒํ™œ์Šต๊ด€ ๊ต์ • ๊ถŒ๊ณ . ์ €์—ผ์‹, ๊ทœ์น™์ ์ธ ์šด๋™, ๊ธˆ์—ฐ, ๊ธˆ์ฃผ ์ง€์‹œ.
    ํ™˜์ž ์žฌ๋ฐฉ๋ฌธ. ํ˜ˆ์••: 150/95mmHg. ์•ฝ๋ฌผ์น˜๋ฃŒ ์‹œ์ž‘. Amlodipine 5mg 1์ผ 1ํšŒ ์ฒ˜๋ฐฉ.""",
    
    """์‘๊ธ‰์‹ค ๋„์ฐฉ ํ›„ ์œ„ ๋‚ด์‹œ๊ฒฝ ์ง„ํ–‰.
    ์†Œ๊ฒฌ: Gastric ulcer์—์„œ Forrest IIb ๊ด€์ฐฐ๋จ. ์ถœํ˜ˆ์€ ์†Œ๋Ÿ‰์˜ ์‚ผ์ถœ์„ฑ ์ถœํ˜ˆ ํ˜•ํƒœ.
    ์ฒ˜์น˜: ์—ํ”ผ๋„คํ”„๋ฆฐ ์ฃผ์‚ฌ๋กœ ์ถœํ˜ˆ ๊ฐ์†Œ ํ™•์ธ. Hemoclip 2๊ฐœ๋กœ ์ถœํ˜ˆ ๋ถ€์œ„ ํด๋ฆฌํ•‘ํ•˜์—ฌ ์ง€ํ˜ˆ ์™„๋ฃŒ.""",
    
    """ํ˜ˆ์ค‘ ๋†’์€ ์ง€๋ฐฉ ์ˆ˜์น˜ ๋ฐ ์ง€๋ฐฉ๊ฐ„ ์†Œ๊ฒฌ.
    ๋‹ค๋ฐœ์„ฑ gallstones ํ™•์ธ. ์ฆ์ƒ ์—†์„ ๊ฒฝ์šฐ ๊ฒฝ๊ณผ ๊ด€์ฐฐ ๊ถŒ์žฅ.
    ์šฐ์ธก renal cyst, ์–‘์„ฑ ๊ฐ€๋Šฅ์„ฑ ๋†’์œผ๋ฉฐ ์ถ”๊ฐ€์ ์ธ ์ฒ˜์น˜ ๋ถˆํ•„์š” ํ•จ."""
]

query_feature = q_tokenizer(query, return_tensors='pt')
query_outputs = q_model(**query_feature, return_dict=True)
query_embeddings = query_outputs.pooler_output.detach().numpy().squeeze()

def cos_sim(A, B):
    return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))

for idx, target in enumerate(targets):
    target_feature = c_tokenizer(target, return_tensors='pt')
    target_outputs = c_model(**target_feature, return_dict=True)
    target_embeddings = target_outputs.pooler_output.detach().numpy().squeeze()
    similarity = cos_sim(query_embeddings, target_embeddings)
    print(f"Similarity between query and target {idx}: {similarity:.4f}")
```
```
Similarity between query and target 0: 0.2674
Similarity between query and target 1: 0.0416
Similarity between query and target 2: 0.0476
```


## Citing
```
@inproceedings{liu2021self,
    title={Self-Alignment Pretraining for Biomedical Entity Representations},
    author={Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel},
    booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
    pages={4228--4238},
    month = jun,
    year={2021}
}
@article{karpukhin2020dense,
  title={Dense Passage Retrieval for Open-Domain Question Answering},
  author={Vladimir Karpukhin, Barlas OฤŸuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih},
  journal={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2020}
}
```