File size: 10,805 Bytes
4b25bc3
474c348
4b25bc3
 
98c7a24
 
 
 
 
 
4b25bc3
 
ae26548
 
 
7e5ba11
4b25bc3
f74e22f
4b25bc3
 
654d56a
 
f74e22f
4b25bc3
ae26548
4b25bc3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
279eff6
4b25bc3
 
 
 
 
ceba511
 
 
4b25bc3
 
 
 
 
 
 
4402dd6
4b25bc3
 
 
 
 
 
 
 
 
 
 
 
 
 
ae816b1
4b25bc3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8ef191b
4b25bc3
51bd7af
 
49406e8
 
f161882
 
49406e8
 
9ea8dd1
 
49406e8
9ea8dd1
49406e8
9ea8dd1
49406e8
 
f161882
 
 
 
 
 
9ea8dd1
f161882
 
9ea8dd1
 
51bd7af
4b25bc3
 
 
 
 
 
6e3b138
8ef191b
fef58ed
 
 
 
 
 
 
 
 
 
 
 
8ef191b
 
b4866a8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
license: cc-by-nc-4.0
pipeline_tag: text-generation
tags:
- medical
- small LM
- instruction-tuned
- usmle
- chain-of-thought
- synthetic data
---


# Meerkat-7B (Version 1.0)

<center><img src = "https://cdn-uploads.huggingface.co/production/uploads/5efbdc4ac3896117eab961a9/IH0nR9HxYwNvrJBjP2dYQ.png" width="200" height="200"></center>

🚀 Meerkat-7B-v1.0 is an instruction-tuned medical AI system that surpasses the passing threshold of 60% for the United States Medical Licensing Examination (USMLE) for the first time among all 7B-parameter models. 
The model was trained using our new synthetic dataset consisting of high-quality chain-of-thought reasoning paths sourced from 18 medical textbooks, along with diverse instruction-following datasets. 
This equips the model with high-level medical reasoning capabilities required for solving complex medical problems.
For further insights into our model, please refer to our paper!

📄 **Paper**: [Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks](https://arxiv.org/abs/2404.00376) 


## Quick Start

The input query should always end with "ASSISTANT:" as shown below.
```
query = "USER: What should I do when I get cold? ASSISTANT:"
```

We can use our model using the [apply_chat_template](https://huggingface.co/docs/transformers/main/chat_templating) function as follows:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"  # cuda or cpu
checkpoint = "dmis-lab/meerkat-7b-v1.0"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    torch_dtype=torch.bfloat16,  # You can choose to use this when there's not enough GPU memory available.
)

# Multi-turn dialogue example
messages = [
    {"role": "system", "content": "You are a helpful doctor or healthcare professional. Guide the conversation to provide useful, complete, and scientifically-grounded answers to user questions. You have the option to compose a concise, single-turn conversation if the user's input is comprehensive to provide accurate answers. However, if essential details are missing, you should engage in a multi-turn dialogue, asking follow-up questions to gather a thorough medical history and records.\n\n"},
    {"role": "user", "content": "Hello, doctor. I'm really concerned about my 10-year-old son. We recently discovered a painless mass in his left testicle, so we brought him to the pediatrician."},
    {"role": "assistant", "content": "I understand your concern. Let's gather some more information. Has your son experienced any other symptoms along with the mass?"},
    {"role": "user", "content": "Other than the mass, my son hasn't shown any symptoms. He's been his usual self, playing and eating normally."}
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)
model.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])
```

## Prompt Details

To reproduce the results reported in our paper, it is advisable to utilize the identical system messages used during model training. Please refer to the guidelines detailed below.

### USMLE or Clinical Cases

When solving USMLE-style questions such as [MedQA](https://arxiv.org/abs/2009.13081) and [MedBullets](https://arxiv.org/abs/2402.18060), or dealing with complex clinical cases like the [JAMA Clinical Challenge](https://arxiv.org/abs/2402.18060), use the following system message:
```
messages = [
    {"role": "system", "content": "The following is a multiple-choice question about medical knowledge. Solve this in a step-by-step fashion, starting by summarizing the available information. Output a single option from the given options as the final answer. You are strongly required to follow the specified output format; conclude your response with the phrase \"the answer is ([option_id]) [answer_string]\".\n\n"},
    {"role": "user", "content": "Two weeks after undergoing an emergency cardiac catherization with stenting for unstable angina pectoris, a 61-year-old man has decreased urinary output and malaise. He has type 2 diabetes mellitus and osteoarthritis of the hips. Prior to admission, his medications were insulin and naproxen. He was also started on aspirin, clopidogrel, and metoprolol after the coronary intervention. His temperature is 38\u00b0C (100.4\u00b0F), pulse is 93/min, and blood pressure is 125/85 mm Hg. Examination shows mottled, reticulated purplish discoloration of the feet. Laboratory studies show:\nHemoglobin count 14 g/dL\nLeukocyte count 16,400/mm3\nSegmented neutrophils 56%\nEosinophils 11%\nLymphocytes 31%\nMonocytes 2%\nPlatelet count 260,000/mm3\nErythrocyte sedimentation rate 68 mm/h\nSerum\nUrea nitrogen 25 mg/dL\nCreatinine 4.2 mg/dL\nRenal biopsy shows intravascular spindle-shaped vacuoles. Which of the following is the most likely cause of this patient's symptoms?\" (A) Renal papillary necrosis (B) Cholesterol embolization (C) Eosinophilic granulomatosis with polyangiitis (D) Polyarteritis nodosa"},
]
```
The model generates reasoning paths to solve the problem and then sequentially provides the predicted answers. 
Since the model ends its response with "the answer is,"  it is straightforward to extract the predicted answer for comparison with the actual answer.

### Multiple-choice Exams

For other types of multiple-choice exams such as [MedMCQA](https://arxiv.org/abs/2203.14371) or [MMLU](https://arxiv.org/abs/2009.03300), use the following simple system message:
```
messages = [
    {"role": "system", "content": "Answer the multiple-choice question about medical knowledge.\n\n"},
    {"role": "user", "content": "In a Robertsonian translocation fusion occurs at the: (A) telomeres. (B) centromeres. (C) histones. (D) ends of the long arms."},
]
```

### Other Use Cases
Our model was trained using the [AlpaCare](https://github.com/xzhang97666/alpacare) instruction dataset comprising 52K examples, to enhance its generalization capabilities across diverse user prompts. 
Feel free to design and test your prompts and to share your thoughts with us, whether the model exceeds expectations or falls short!

## Evaluation

We tested models on seven medical benchmarks: [MedQA](https://arxiv.org/abs/2009.13081), [USMLE sample test](https://www.usmle.org/prepare-your-exam), [Medbullets-4](https://arxiv.org/abs/2402.18060), [Medbullets-5](https://arxiv.org/abs/2402.18060) , [MedMCQA](https://arxiv.org/abs/2203.14371), [MMLU-Medical](https://arxiv.org/abs/2009.03300), and [JAMA Clinical Challenge](https://arxiv.org/abs/2402.18060).

| **Model**                       | **Average** | **MedQA** | **USMLE** | **Medbullets-4** | **Medbullets-5** | **MedMCQA** | **MMLU-Medical** | **JAMA** |
|:--------------------------------|:-----------:|:---------:|:---------:|:----------------:|:----------------:|:-----------:|:----------------:|:--------:|
| GPT-4                           | 75.2        | 81.4      | 86.6      | 68.8             | 63.3             | 72.4        | 87.1             | 67.1     |
| GPT-3.5                         | 54.1        | 53.6      | 58.5      | 51.0             | 47.4             | 51.0        | 67.3             | 50.1     |
| MediTron-70B (Ensemble, 5 runs) | -           | 70.2      | -         | -                | -                | 66.0        | 78.0             |  -       |
|*Open-source (7B)*|
| MediTron-7B                     | 50.8        | 50.2      | 44.6      | 51.1             | 45.5             | 57.9        | 56.7             | 49.3     |
| BioMistral-7B                   | 54.4        | 54.3      | 51.4      | 52.3             | 48.7             | **61.1**    | 64.6             | 48.6     |
| Meerkat-7B                      | 62.4        | 70.6      | 70.3      | 58.7             | 52.9             | 60.6        | 70.5             | 53.1     |
| Meerkat-7B (Ensemble, 5 runs)   | **64.2**    | **74.3**  | **71.4**  | **61.0**         | **55.3**         | 60.7        | **72.4**         | **54.0** |

Please note that the scores in MMLU-Medical were calculated based on the average accuracies across six medical-related subjects in the original MMLU benchmark, and each result for a single subject is presented below.

| **Model**                       | **Average** | **Cliniq Knowledge** | **Medical Genetics** | **Anatomy** | **Professional Medicine** | **College Biology** | **College Medicine** |
|:--------------------------------|:-----------:|:--------------------:|:--------------------:|:-----------:|:-------------------------:|:-------------------:|:--------------------:|
| GPT-4                           | 87.1        | 86.4                 | 92.0                 | 80.0        | 93.8                      | 93.8                | 76.3                 |
| GPT-3.5                         | 67.3        | 68.7                 | 68.0                 | 60.7        | 69.9                      | 72.9                | 63.6                 |
| MediTron-70B (Ensemble, 5 runs) | 78.0        | 75.5                 | 85.9                 | 69.4        | 82.3                      | 86.7                | 68.0                 |
|*Open-source (7B)*|
| MediTron-7B                     | 56.7        | 57.7                 | 63.8                 | 56.9        | 56.0                      | 57.1                | 48.9                 |
| BioMistral-7B                   | 64.6        | 59.9                 | 64.0                 | 56.5        | 60.4                      | 59.0                | 54.7                 |
| Meerkat-7B                      | 70.5        | 71.6                 | 74.8                 | 63.2        | 77.3                      | 70.8                | **65.2**             |
| Meerkat-7B (Ensemble, 5 runs)   | **72.4**    | **74.1**             | **79.4**             | **64.1**    | **78.8**                  | **75.8**            | 62.4                 |

## Model Architecture

Our model was based on [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) because of its accuracy and run-time efficiency.

## Training Data

We plan to release our training dataset publicly.

## Reference

Please see the information below to cite our paper.
```bibtex
@article{kim2024small,
  title={Small language models learn enhanced reasoning skills from medical textbooks},
  author={Kim, Hyunjae and Hwang, Hyeon and Lee, Jiwoo and Park, Sihyeon and Kim, Dain and Lee, Taewhoo and Yoon, Chanwoong and Sohn, Jiwoong and Choi, Donghee and Kang, Jaewoo},
  journal={arXiv preprint arXiv:2404.00376},
  year={2024}
}
```

## Contact

Feel free to email `hyunjae-kim@korea.ac.kr` if you have any questions.