snumin44's picture
Update README.md
e6db38c verified
metadata
license: mit
language:
  - ko
base_model:
  - klue/bert-base
pipeline_tag: feature-extraction
tags:
  - medical

๐ŸŠ Korean Medical DPR(Dense Passage Retrieval)

1. Intro

์˜๋ฃŒ ๋ถ„์•ผ์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” Bi-Encoder ๊ตฌ์กฐ์˜ ๊ฒ€์ƒ‰ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
ํ•œยท์˜ ํ˜ผ์šฉ์ฒด์˜ ์˜๋ฃŒ ๊ธฐ๋ก์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด SapBERT-KO-EN ์„ ๋ฒ ์ด์Šค ๋ชจ๋ธ๋กœ ์ด์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
์งˆ๋ฌธ์€ Question Encoder๋กœ, ํ…์ŠคํŠธ๋Š” Context Encoder๋ฅผ ์ด์šฉํ•ด ์ธ์ฝ”๋”ฉํ•ฉ๋‹ˆ๋‹ค.

(โ€ป ์ด ๋ชจ๋ธ์€ AI Hub์˜ ์ดˆ๊ฑฐ๋Œ€ AI ํ—ฌ์Šค์ผ€์–ด ์งˆ์˜ ์‘๋‹ต ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.)

2. Model

(1) Self Alignment Pretraining (SAP)

ํ•œ๊ตญ ์˜๋ฃŒ ๊ธฐ๋ก์€ ํ•œยท์˜ ํ˜ผ์šฉ์ฒด๋กœ ์“ฐ์—ฌ, ์˜์–ด ์šฉ์–ด๋„ ์ธ์‹ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
Multi Similarity Loss๋ฅผ ์ด์šฉํ•ด ๋™์ผํ•œ ์ฝ”๋“œ์˜ ์šฉ์–ด ๊ฐ„์— ๋†’์€ ์œ ์‚ฌ๋„๋ฅผ ๊ฐ–๋„๋ก ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค.

์˜ˆ) C3843080 || ๊ณ ํ˜ˆ์•• ์งˆํ™˜ 
    C3843080 || Hypertension
    C3843080 || High Blood Pressure
    C3843080 || HTN
    C3843080 || HBP

(2) Dense Passage Retrieval (DPR)

SapBERT-KO-EN์„ ๊ฒ€์ƒ‰ ๋ชจ๋ธ๋กœ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ์ถ”๊ฐ€์ ์ธ Fine-tuning์„ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
Bi-Encoder ๊ตฌ์กฐ๋กœ ์งˆ์˜์™€ ํ…์ŠคํŠธ์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” DPR ๋ฐฉ์‹์œผ๋กœ Fine-tuning ํ–ˆ์Šต๋‹ˆ๋‹ค.
๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ธฐ์กด์˜ ๋ฐ์ดํ„ฐ ์…‹์— ํ•œยท์˜ ํ˜ผ์šฉ์ฒด ์ƒ˜ํ”Œ์„ ์ฆ๊ฐ•ํ•œ ๋ฐ์ดํ„ฐ ์…‹์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

์˜ˆ) ํ•œ๊ตญ์–ด ๋ณ‘๋ช…: ๊ณ ํ˜ˆ์••
    ์˜์–ด ๋ณ‘๋ช…: Hypertenstion
    ์งˆ์˜ (์›๋ณธ): ์•„๋ฒ„์ง€๊ฐ€ ๊ณ ํ˜ˆ์••์ธ๋ฐ ๊ทธ๊ฒŒ ๋ญ”์ง€ ๋ชจ๋ฅด๊ฒ ์–ด. ๊ณ ํ˜ˆ์••์ด ๋ญ”์ง€ ์„ค๋ช…์ข€ ํ•ด์ค˜.
    ์งˆ์˜ (์ฆ๊ฐ•): ์•„๋ฒ„์ง€๊ฐ€ Hypertenstion ์ธ๋ฐ ๊ทธ๊ฒŒ ๋ญ”์ง€ ๋ชจ๋ฅด๊ฒ ์–ด. Hypertenstion ์ด ๋ญ”์ง€ ์„ค๋ช…์ข€ ํ•ด์ค˜.

3. Training

(1) Self Alignment Pretraining (SAP)

SapBERT-KO-EN ํ•™์Šต์— ํ™œ์šฉํ•œ ๋ฒ ์ด์Šค ๋ชจ๋ธ ๋ฐ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
ํ•œยท์˜ ์˜๋ฃŒ ์šฉ์–ด๋ฅผ ์ˆ˜๋กํ•œ ์˜๋ฃŒ ์šฉ์–ด ์‚ฌ์ „์ธ KOSTOM์„ ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

  • Model : klue/bert-base
  • Dataset : KOSTOM
  • Epochs : 1
  • Batch Size : 64
  • Max Length : 64
  • Dropout : 0.1
  • Pooler : 'cls'
  • Eval Step : 100
  • Threshold : 0.8
  • Scale Positive Sample : 1
  • Scale Negative Sample : 60

(2) Dense Passage Retrieval (DPR)

Fine-tuning์— ํ™œ์šฉํ•œ ๋ฒ ์ด์Šค ๋ชจ๋ธ ๋ฐ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • Model : SapBERT-KO-EN(klue/bert-base)
  • Dataset : ์ดˆ๊ฑฐ๋Œ€ AI ํ—ฌ์Šค์ผ€์–ด ์งˆ์˜ ์‘๋‹ต ๋ฐ์ดํ„ฐ(AI Hub)
  • Epochs : 10
  • Batch Size : 64
  • Dropout : 0.1
  • Pooler : 'cls'

4. Example

์ด ๋ชจ๋ธ์€ Context๋ฅผ ์ธ์ฝ”๋”ฉํ•˜๋Š” ๋ชจ๋ธ๋กœ, Question ๋ชจ๋ธ๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
๋™์ผํ•œ ์งˆ๋ณ‘์— ๊ด€ํ•œ ์งˆ๋ฌธ๊ณผ ํ…์ŠคํŠธ๊ฐ€ ๋†’์€ ์œ ์‚ฌ๋„๋ฅผ ๋ณด์ธ๋‹ค๋Š” ์‚ฌ์‹ค์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

(โ€ป ์•„๋ž˜ ์ฝ”๋“œ์˜ ์˜ˆ์‹œ๋Š” ChatGPT๋ฅผ ์ด์šฉํ•ด ์ƒ์„ฑํ•œ ์˜๋ฃŒ ํ…์ŠคํŠธ์ž…๋‹ˆ๋‹ค.)
(โ€ป ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ ์ƒ ์˜ˆ์‹œ ๋ณด๋‹ค ์ •์ œ๋œ ํ…์ŠคํŠธ์— ๋Œ€ํ•ด ๋” ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.)

import numpy as np
from transformers import AutoModel, AutoTokenizer

# Question Model
q_model_path = 'snumin44/medical-biencoder-ko-bert-question'
q_model = AutoModel.from_pretrained(q_model_path)
q_tokenizer = AutoTokenizer.from_pretrained(q_model_path)

# Context Model
c_model_path = 'snumin44/medical-biencoder-ko-bert-context'
c_model = AutoModel.from_pretrained(c_model_path)
c_tokenizer = AutoTokenizer.from_pretrained(c_model_path)


query = 'high blood pressure ์ฒ˜๋ฐฉ ์‚ฌ๋ก€'

targets = [
    """๊ณ ํ˜ˆ์•• ์ง„๋‹จ.
    ํ™˜์ž ์ƒ๋‹ด ๋ฐ ์ƒํ™œ์Šต๊ด€ ๊ต์ • ๊ถŒ๊ณ . ์ €์—ผ์‹, ๊ทœ์น™์ ์ธ ์šด๋™, ๊ธˆ์—ฐ, ๊ธˆ์ฃผ ์ง€์‹œ.
    ํ™˜์ž ์žฌ๋ฐฉ๋ฌธ. ํ˜ˆ์••: 150/95mmHg. ์•ฝ๋ฌผ์น˜๋ฃŒ ์‹œ์ž‘. Amlodipine 5mg 1์ผ 1ํšŒ ์ฒ˜๋ฐฉ.""",
    
    """์‘๊ธ‰์‹ค ๋„์ฐฉ ํ›„ ์œ„ ๋‚ด์‹œ๊ฒฝ ์ง„ํ–‰.
    ์†Œ๊ฒฌ: Gastric ulcer์—์„œ Forrest IIb ๊ด€์ฐฐ๋จ. ์ถœํ˜ˆ์€ ์†Œ๋Ÿ‰์˜ ์‚ผ์ถœ์„ฑ ์ถœํ˜ˆ ํ˜•ํƒœ.
    ์ฒ˜์น˜: ์—ํ”ผ๋„คํ”„๋ฆฐ ์ฃผ์‚ฌ๋กœ ์ถœํ˜ˆ ๊ฐ์†Œ ํ™•์ธ. Hemoclip 2๊ฐœ๋กœ ์ถœํ˜ˆ ๋ถ€์œ„ ํด๋ฆฌํ•‘ํ•˜์—ฌ ์ง€ํ˜ˆ ์™„๋ฃŒ.""",
    
    """ํ˜ˆ์ค‘ ๋†’์€ ์ง€๋ฐฉ ์ˆ˜์น˜ ๋ฐ ์ง€๋ฐฉ๊ฐ„ ์†Œ๊ฒฌ.
    ๋‹ค๋ฐœ์„ฑ gallstones ํ™•์ธ. ์ฆ์ƒ ์—†์„ ๊ฒฝ์šฐ ๊ฒฝ๊ณผ ๊ด€์ฐฐ ๊ถŒ์žฅ.
    ์šฐ์ธก renal cyst, ์–‘์„ฑ ๊ฐ€๋Šฅ์„ฑ ๋†’์œผ๋ฉฐ ์ถ”๊ฐ€์ ์ธ ์ฒ˜์น˜ ๋ถˆํ•„์š” ํ•จ."""
]

query_feature = q_tokenizer(query, return_tensors='pt')
query_outputs = q_model(**query_feature, return_dict=True)
query_embeddings = query_outputs.pooler_output.detach().numpy().squeeze()

def cos_sim(A, B):
    return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))

for idx, target in enumerate(targets):
    target_feature = c_tokenizer(target, return_tensors='pt')
    target_outputs = c_model(**target_feature, return_dict=True)
    target_embeddings = target_outputs.pooler_output.detach().numpy().squeeze()
    similarity = cos_sim(query_embeddings, target_embeddings)
    print(f"Similarity between query and target {idx}: {similarity:.4f}")
Similarity between query and target 0: 0.2674
Similarity between query and target 1: 0.0416
Similarity between query and target 2: 0.0476

Citing

@inproceedings{liu2021self,
    title={Self-Alignment Pretraining for Biomedical Entity Representations},
    author={Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel},
    booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
    pages={4228--4238},
    month = jun,
    year={2021}
}
@article{karpukhin2020dense,
  title={Dense Passage Retrieval for Open-Domain Question Answering},
  author={Vladimir Karpukhin, Barlas OฤŸuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih},
  journal={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2020}
}