🍊 Korean Medical DPR(Dense Passage Retrieval)

1. Intro

의료 λΆ„μ•Όμ—μ„œ μ‚¬μš©ν•  수 μžˆλŠ” Bi-Encoder ꡬ쑰의 검색 λͺ¨λΈμž…λ‹ˆλ‹€.
ν•œΒ·μ˜ 혼용체의 의료 기둝을 μ²˜λ¦¬ν•˜κΈ° μœ„ν•΄ SapBERT-KO-EN 을 베이슀 λͺ¨λΈλ‘œ μ΄μš©ν–ˆμŠ΅λ‹ˆλ‹€.
μ§ˆλ¬Έμ€ Question Encoder둜, ν…μŠ€νŠΈλŠ” Context Encoderλ₯Ό μ΄μš©ν•΄ μΈμ½”λ”©ν•©λ‹ˆλ‹€.

(β€» 이 λͺ¨λΈμ€ AI Hub의 μ΄ˆκ±°λŒ€ AI ν—¬μŠ€μΌ€μ–΄ 질의 응닡 λ°μ΄ν„°λ‘œ ν•™μŠ΅ν•œ λͺ¨λΈμž…λ‹ˆλ‹€.)

2. Model

(1) Self Alignment Pretraining (SAP)

ν•œκ΅­ 의료 기둝은 ν•œΒ·μ˜ 혼용체둜 μ“°μ—¬, μ˜μ–΄ μš©μ–΄λ„ 인식할 수 μžˆλŠ” λͺ¨λΈμ΄ ν•„μš”ν•©λ‹ˆλ‹€.
Multi Similarity Lossλ₯Ό μ΄μš©ν•΄ λ™μΌν•œ μ½”λ“œμ˜ μš©μ–΄ 간에 높은 μœ μ‚¬λ„λ₯Ό 갖도둝 ν•™μŠ΅ν–ˆμŠ΅λ‹ˆλ‹€.

예) C3843080 || κ³ ν˜ˆμ•• μ§ˆν™˜ 
    C3843080 || Hypertension
    C3843080 || High Blood Pressure
    C3843080 || HTN
    C3843080 || HBP

(2) Dense Passage Retrieval (DPR)

SapBERT-KO-EN을 검색 λͺ¨λΈλ‘œ λ§Œλ“€κΈ° μœ„ν•΄ 좔가적인 Fine-tuning을 ν•΄μ•Ό ν•©λ‹ˆλ‹€.
Bi-Encoder ꡬ쑰둜 μ§ˆμ˜μ™€ ν…μŠ€νŠΈμ˜ μœ μ‚¬λ„λ₯Ό κ³„μ‚°ν•˜λŠ” DPR λ°©μ‹μœΌλ‘œ Fine-tuning ν–ˆμŠ΅λ‹ˆλ‹€.
λ‹€μŒκ³Ό 같이 기쑴의 데이터 셋에 ν•œΒ·μ˜ 혼용체 μƒ˜ν”Œμ„ μ¦κ°•ν•œ 데이터 셋을 μ‚¬μš©ν–ˆμŠ΅λ‹ˆλ‹€.

예) ν•œκ΅­μ–΄ 병λͺ…: κ³ ν˜ˆμ••
    μ˜μ–΄ 병λͺ…: Hypertenstion
    질의 (원본): 아버지가 κ³ ν˜ˆμ••μΈλ° 그게 뭔지 λͺ¨λ₯΄κ² μ–΄. κ³ ν˜ˆμ••μ΄ 뭔지 μ„€λͺ…μ’€ ν•΄μ€˜.
    질의 (증강): 아버지가 Hypertenstion 인데 그게 뭔지 λͺ¨λ₯΄κ² μ–΄. Hypertenstion 이 뭔지 μ„€λͺ…μ’€ ν•΄μ€˜.

3. Training

(1) Self Alignment Pretraining (SAP)

SapBERT-KO-EN ν•™μŠ΅μ— ν™œμš©ν•œ 베이슀 λͺ¨λΈ 및 ν•˜μ΄νΌ νŒŒλΌλ―Έν„°λŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.
ν•œΒ·μ˜ 의료 μš©μ–΄λ₯Ό μˆ˜λ‘ν•œ 의료 μš©μ–΄ 사전인 KOSTOM을 ν•™μŠ΅ λ°μ΄ν„°λ‘œ μ‚¬μš©ν–ˆμŠ΅λ‹ˆλ‹€.

  • Model : klue/bert-base
  • Dataset : KOSTOM
  • Epochs : 1
  • Batch Size : 64
  • Max Length : 64
  • Dropout : 0.1
  • Pooler : 'cls'
  • Eval Step : 100
  • Threshold : 0.8
  • Scale Positive Sample : 1
  • Scale Negative Sample : 60

(2) Dense Passage Retrieval (DPR)

Fine-tuning에 ν™œμš©ν•œ 베이슀 λͺ¨λΈ 및 ν•˜μ΄νΌ νŒŒλΌλ―Έν„°λŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.

  • Model : SapBERT-KO-EN(klue/bert-base)
  • Dataset : μ΄ˆκ±°λŒ€ AI ν—¬μŠ€μΌ€μ–΄ 질의 응닡 데이터(AI Hub)
  • Epochs : 10
  • Batch Size : 64
  • Dropout : 0.1
  • Pooler : 'cls'

4. Example

이 λͺ¨λΈμ€ μ§ˆλ¬Έμ„ μΈμ½”λ”©ν•˜λŠ” λͺ¨λΈλ‘œ, Context λͺ¨λΈκ³Ό ν•¨κ»˜ μ‚¬μš©ν•΄μ•Ό ν•©λ‹ˆλ‹€.
λ™μΌν•œ μ§ˆλ³‘μ— κ΄€ν•œ 질문과 ν…μŠ€νŠΈκ°€ 높은 μœ μ‚¬λ„λ₯Ό λ³΄μΈλ‹€λŠ” 사싀을 확인할 수 μžˆμŠ΅λ‹ˆλ‹€.

(β€» μ•„λž˜ μ½”λ“œμ˜ μ˜ˆμ‹œλŠ” ChatGPTλ₯Ό μ΄μš©ν•΄ μƒμ„±ν•œ 의료 ν…μŠ€νŠΈμž…λ‹ˆλ‹€.)
(β€» ν•™μŠ΅ λ°μ΄ν„°μ˜ νŠΉμ„± 상 μ˜ˆμ‹œ 보닀 μ •μ œλœ ν…μŠ€νŠΈμ— λŒ€ν•΄ 더 잘 μž‘λ™ν•©λ‹ˆλ‹€.)

import numpy as np
from transformers import AutoModel, AutoTokenizer

# Question Model
q_model_path = 'snumin44/medical-biencoder-ko-bert-question'
q_model = AutoModel.from_pretrained(q_model_path)
q_tokenizer = AutoTokenizer.from_pretrained(q_model_path)

# Context Model
c_model_path = 'snumin44/medical-biencoder-ko-bert-context'
c_model = AutoModel.from_pretrained(c_model_path)
c_tokenizer = AutoTokenizer.from_pretrained(c_model_path)


query = 'high blood pressure 처방 사둀'

targets = [
    """κ³ ν˜ˆμ•• 진단.
    ν™˜μž 상담 및 μƒν™œμŠ΅κ΄€ ꡐ정 ꢌ고. 저염식, κ·œμΉ™μ μΈ μš΄λ™, κΈˆμ—°, 금주 μ§€μ‹œ.
    ν™˜μž 재방문. ν˜ˆμ••: 150/95mmHg. μ•½λ¬ΌμΉ˜λ£Œ μ‹œμž‘. Amlodipine 5mg 1일 1회 처방.""",
    
    """응급싀 도착 ν›„ μœ„ λ‚΄μ‹œκ²½ 진행.
    μ†Œκ²¬: Gastric ulcerμ—μ„œ Forrest IIb 관찰됨. μΆœν˜ˆμ€ μ†ŒλŸ‰μ˜ μ‚ΌμΆœμ„± 좜혈 ν˜•νƒœ.
    처치: 에피넀프린 μ£Όμ‚¬λ‘œ 좜혈 κ°μ†Œ 확인. Hemoclip 2개둜 좜혈 λΆ€μœ„ ν΄λ¦¬ν•‘ν•˜μ—¬ μ§€ν˜ˆ μ™„λ£Œ.""",
    
    """ν˜ˆμ€‘ 높은 지방 수치 및 지방간 μ†Œκ²¬.
    λ‹€λ°œμ„± gallstones 확인. 증상 없을 경우 κ²½κ³Ό κ΄€μ°° ꢌμž₯.
    우츑 renal cyst, μ–‘μ„± κ°€λŠ₯μ„± λ†’μœΌλ©° 좔가적인 처치 λΆˆν•„μš” 함."""
]

query_feature = q_tokenizer(query, return_tensors='pt')
query_outputs = q_model(**query_feature, return_dict=True)
query_embeddings = query_outputs.pooler_output.detach().numpy().squeeze()

def cos_sim(A, B):
    return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))

for idx, target in enumerate(targets):
    target_feature = c_tokenizer(target, return_tensors='pt')
    target_outputs = c_model(**target_feature, return_dict=True)
    target_embeddings = target_outputs.pooler_output.detach().numpy().squeeze()
    similarity = cos_sim(query_embeddings, target_embeddings)
    print(f"Similarity between query and target {idx}: {similarity:.4f}")
Similarity between query and target 0: 0.2674
Similarity between query and target 1: 0.0416
Similarity between query and target 2: 0.0476

Citing

@inproceedings{liu2021self,
    title={Self-Alignment Pretraining for Biomedical Entity Representations},
    author={Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel},
    booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
    pages={4228--4238},
    month = jun,
    year={2021}
}
@article{karpukhin2020dense,
  title={Dense Passage Retrieval for Open-Domain Question Answering},
  author={Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih},
  journal={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2020}
}
Downloads last month
5
Safetensors
Model size
111M params
Tensor type
F32
Β·
Inference Examples
Unable to determine this model's library. Check the docs .

Model tree for snumin44/medical-biencoder-ko-bert-question

Base model

klue/bert-base
Finetuned
(65)
this model