File size: 6,469 Bytes
22846cf e02e104 22846cf e02e104 7cfd1f1 e02e104 cb1f218 e02e104 cb1f218 2f78cfd 59da2f9 cb1f218 2f78cfd cb1f218 7cfd1f1 2f78cfd cb1f218 381f900 59da2f9 381f900 2f78cfd 381f900 2f78cfd 59da2f9 7cfd1f1 2f78cfd 677d92b 082946b e6db38c 082946b 7cfd1f1 203df2f 7cfd1f1 cb1f218 22846cf 7cfd1f1 082946b 7cfd1f1 22846cf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
---
license: mit
language:
- ko
base_model:
- klue/bert-base
pipeline_tag: feature-extraction
tags:
- medical
---
# ๐ Korean Medical DPR(Dense Passage Retrieval)
## 1. Intro
**์๋ฃ ๋ถ์ผ**์์ ์ฌ์ฉํ ์ ์๋ Bi-Encoder ๊ตฌ์กฐ์ ๊ฒ์ ๋ชจ๋ธ์
๋๋ค.
ํยท์ ํผ์ฉ์ฒด์ ์๋ฃ ๊ธฐ๋ก์ ์ฒ๋ฆฌํ๊ธฐ ์ํด **SapBERT-KO-EN** ์ ๋ฒ ์ด์ค ๋ชจ๋ธ๋ก ์ด์ฉํ์ต๋๋ค.
์ง๋ฌธ์ Question Encoder๋ก, ํ
์คํธ๋ Context Encoder๋ฅผ ์ด์ฉํด ์ธ์ฝ๋ฉํฉ๋๋ค.
- Question Encoder : [https://huggingface.co/snumin44/medical-biencoder-ko-bert-question](https://huggingface.co/snumin44/medical-biencoder-ko-bert-question)
(โป ์ด ๋ชจ๋ธ์ AI Hub์ [์ด๊ฑฐ๋ AI ํฌ์ค์ผ์ด ์ง์ ์๋ต ๋ฐ์ดํฐ](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=71762)๋ก ํ์ตํ ๋ชจ๋ธ์
๋๋ค.)
## 2. Model
**(1) Self Alignment Pretraining (SAP)**
ํ๊ตญ ์๋ฃ ๊ธฐ๋ก์ **ํยท์ ํผ์ฉ์ฒด**๋ก ์ฐ์ฌ, ์์ด ์ฉ์ด๋ ์ธ์ํ ์ ์๋ ๋ชจ๋ธ์ด ํ์ํฉ๋๋ค.
Multi Similarity Loss๋ฅผ ์ด์ฉํด **๋์ผํ ์ฝ๋์ ์ฉ์ด** ๊ฐ์ ๋์ ์ ์ฌ๋๋ฅผ ๊ฐ๋๋ก ํ์ตํ์ต๋๋ค.
```
์) C3843080 || ๊ณ ํ์ ์งํ
C3843080 || Hypertension
C3843080 || High Blood Pressure
C3843080 || HTN
C3843080 || HBP
```
- SapBERT-KO-EN : [https://huggingface.co/snumin44/sap-bert-ko-en](https://huggingface.co/snumin44/sap-bert-ko-en)
- Github : [https://github.com/snumin44/SapBERT-KO-EN](https://github.com/snumin44/SapBERT-KO-EN)
**(2) Dense Passage Retrieval (DPR)**
SapBERT-KO-EN์ ๊ฒ์ ๋ชจ๋ธ๋ก ๋ง๋ค๊ธฐ ์ํด ์ถ๊ฐ์ ์ธ Fine-tuning์ ํด์ผ ํฉ๋๋ค.
Bi-Encoder ๊ตฌ์กฐ๋ก ์ง์์ ํ
์คํธ์ ์ ์ฌ๋๋ฅผ ๊ณ์ฐํ๋ DPR ๋ฐฉ์์ผ๋ก Fine-tuning ํ์ต๋๋ค.
๋ค์๊ณผ ๊ฐ์ด ๊ธฐ์กด์ ๋ฐ์ดํฐ ์
์ **ํยท์ ํผ์ฉ์ฒด ์ํ์ ์ฆ๊ฐ**ํ ๋ฐ์ดํฐ ์
์ ์ฌ์ฉํ์ต๋๋ค.
```
์) ํ๊ตญ์ด ๋ณ๋ช
: ๊ณ ํ์
์์ด ๋ณ๋ช
: Hypertenstion
์ง์ (์๋ณธ): ์๋ฒ์ง๊ฐ ๊ณ ํ์์ธ๋ฐ ๊ทธ๊ฒ ๋ญ์ง ๋ชจ๋ฅด๊ฒ ์ด. ๊ณ ํ์์ด ๋ญ์ง ์ค๋ช
์ข ํด์ค.
์ง์ (์ฆ๊ฐ): ์๋ฒ์ง๊ฐ Hypertenstion ์ธ๋ฐ ๊ทธ๊ฒ ๋ญ์ง ๋ชจ๋ฅด๊ฒ ์ด. Hypertenstion ์ด ๋ญ์ง ์ค๋ช
์ข ํด์ค.
```
- Github : [https://github.com/snumin44/DPR-KO](https://github.com/snumin44/DPR-KO)
## 3. Training
**(1) Self Alignment Pretraining (SAP)**
SapBERT-KO-EN ํ์ต์ ํ์ฉํ ๋ฒ ์ด์ค ๋ชจ๋ธ ๋ฐ ํ์ดํผ ํ๋ผ๋ฏธํฐ๋ ๋ค์๊ณผ ๊ฐ์ต๋๋ค.
ํยท์ ์๋ฃ ์ฉ์ด๋ฅผ ์๋กํ ์๋ฃ ์ฉ์ด ์ฌ์ ์ธ **KOSTOM**์ ํ์ต ๋ฐ์ดํฐ๋ก ์ฌ์ฉํ์ต๋๋ค.
- Model : klue/bert-base
- Dataset : **KOSTOM**
- Epochs : 1
- Batch Size : 64
- Max Length : 64
- Dropout : 0.1
- Pooler : 'cls'
- Eval Step : 100
- Threshold : 0.8
- Scale Positive Sample : 1
- Scale Negative Sample : 60
**(2) Dense Passage Retrieval (DPR)**
Fine-tuning์ ํ์ฉํ ๋ฒ ์ด์ค ๋ชจ๋ธ ๋ฐ ํ์ดํผ ํ๋ผ๋ฏธํฐ๋ ๋ค์๊ณผ ๊ฐ์ต๋๋ค.
- Model : SapBERT-KO-EN(klue/bert-base)
- Dataset : **์ด๊ฑฐ๋ AI ํฌ์ค์ผ์ด ์ง์ ์๋ต ๋ฐ์ดํฐ(AI Hub)**
- Epochs : 10
- Batch Size : 64
- Dropout : 0.1
- Pooler : 'cls'
## 4. Example
์ด ๋ชจ๋ธ์ Context๋ฅผ ์ธ์ฝ๋ฉํ๋ ๋ชจ๋ธ๋ก, Question ๋ชจ๋ธ๊ณผ ํจ๊ป ์ฌ์ฉํด์ผ ํฉ๋๋ค.
๋์ผํ ์ง๋ณ์ ๊ดํ ์ง๋ฌธ๊ณผ ํ
์คํธ๊ฐ ๋์ ์ ์ฌ๋๋ฅผ ๋ณด์ธ๋ค๋ ์ฌ์ค์ ํ์ธํ ์ ์์ต๋๋ค.
(โป ์๋ ์ฝ๋์ ์์๋ ChatGPT๋ฅผ ์ด์ฉํด ์์ฑํ ์๋ฃ ํ
์คํธ์
๋๋ค.)
(โป ํ์ต ๋ฐ์ดํฐ์ ํน์ฑ ์ ์์ ๋ณด๋ค ์ ์ ๋ ํ
์คํธ์ ๋ํด ๋ ์ ์๋ํฉ๋๋ค.)
```python
import numpy as np
from transformers import AutoModel, AutoTokenizer
# Question Model
q_model_path = 'snumin44/medical-biencoder-ko-bert-question'
q_model = AutoModel.from_pretrained(q_model_path)
q_tokenizer = AutoTokenizer.from_pretrained(q_model_path)
# Context Model
c_model_path = 'snumin44/medical-biencoder-ko-bert-context'
c_model = AutoModel.from_pretrained(c_model_path)
c_tokenizer = AutoTokenizer.from_pretrained(c_model_path)
query = 'high blood pressure ์ฒ๋ฐฉ ์ฌ๋ก'
targets = [
"""๊ณ ํ์ ์ง๋จ.
ํ์ ์๋ด ๋ฐ ์ํ์ต๊ด ๊ต์ ๊ถ๊ณ . ์ ์ผ์, ๊ท์น์ ์ธ ์ด๋, ๊ธ์ฐ, ๊ธ์ฃผ ์ง์.
ํ์ ์ฌ๋ฐฉ๋ฌธ. ํ์: 150/95mmHg. ์ฝ๋ฌผ์น๋ฃ ์์. Amlodipine 5mg 1์ผ 1ํ ์ฒ๋ฐฉ.""",
"""์๊ธ์ค ๋์ฐฉ ํ ์ ๋ด์๊ฒฝ ์งํ.
์๊ฒฌ: Gastric ulcer์์ Forrest IIb ๊ด์ฐฐ๋จ. ์ถํ์ ์๋์ ์ผ์ถ์ฑ ์ถํ ํํ.
์ฒ์น: ์ํผ๋คํ๋ฆฐ ์ฃผ์ฌ๋ก ์ถํ ๊ฐ์ ํ์ธ. Hemoclip 2๊ฐ๋ก ์ถํ ๋ถ์ ํด๋ฆฌํํ์ฌ ์งํ ์๋ฃ.""",
"""ํ์ค ๋์ ์ง๋ฐฉ ์์น ๋ฐ ์ง๋ฐฉ๊ฐ ์๊ฒฌ.
๋ค๋ฐ์ฑ gallstones ํ์ธ. ์ฆ์ ์์ ๊ฒฝ์ฐ ๊ฒฝ๊ณผ ๊ด์ฐฐ ๊ถ์ฅ.
์ฐ์ธก renal cyst, ์์ฑ ๊ฐ๋ฅ์ฑ ๋์ผ๋ฉฐ ์ถ๊ฐ์ ์ธ ์ฒ์น ๋ถํ์ ํจ."""
]
query_feature = q_tokenizer(query, return_tensors='pt')
query_outputs = q_model(**query_feature, return_dict=True)
query_embeddings = query_outputs.pooler_output.detach().numpy().squeeze()
def cos_sim(A, B):
return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))
for idx, target in enumerate(targets):
target_feature = c_tokenizer(target, return_tensors='pt')
target_outputs = c_model(**target_feature, return_dict=True)
target_embeddings = target_outputs.pooler_output.detach().numpy().squeeze()
similarity = cos_sim(query_embeddings, target_embeddings)
print(f"Similarity between query and target {idx}: {similarity:.4f}")
```
```
Similarity between query and target 0: 0.2674
Similarity between query and target 1: 0.0416
Similarity between query and target 2: 0.0476
```
## Citing
```
@inproceedings{liu2021self,
title={Self-Alignment Pretraining for Biomedical Entity Representations},
author={Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel},
booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
pages={4228--4238},
month = jun,
year={2021}
}
@article{karpukhin2020dense,
title={Dense Passage Retrieval for Open-Domain Question Answering},
author={Vladimir Karpukhin, Barlas Oฤuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih},
journal={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year={2020}
}
```
|