File size: 8,057 Bytes
99289a9 13027ae 99289a9 4db4ca7 99289a9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
---
license: apache-2.0
language:
- en
tag: text-generation
tags:
- medical
datasets:
- Open-Orca/OpenOrca
- pubmed
- medmcqa
- maximegmd/medqa_alpaca_format
base_model: mistralai/Mistral-7B-v0.1
metrics:
- accuracy
---
<img width=30% src="assets/logo.png" alt="logo" title="logo">
# Model Card for Internist.ai 7b
Internist.ai 7b is a medical domain large language model trained by medical doctors to demonstrate the benefits of a **physician-in-the-loop** approach. The training data was carefully curated by medical doctors to ensure clinical relevance and required quality for clinical practice.
**With this 7b model we release the first 7b model to score above the 60% pass threshold on MedQA (USMLE) and outperfoms models of similar size accross most medical evaluations.**
This model serves as a proof of concept and larger models trained on a larger corpus of medical literature are planned. Do not hesitate to reach out to us if you would like to sponsor some compute to speed up this training.
<details open>
<summary><strong>Advisory Notice</strong></summary>
<blockquote style="padding: 10px; margin: 0 0 10px; border-left: 5px solid #ddd;">
The model was designed by medical doctors for medical doctors and did not undergo specific training to address potential security issues when used by non medical professionals.
We highly recommend against the use of this model in a live environment without extensive evaluation through prospective clinical trials and additional training to meet the required safety levels.
</blockquote>
</details>
## Model Details
- **Developed by:** [UCLouvain](https://uclouvain.be/) and [Cliniques Universitaires Saint-Luc](https://saintluc.be/)
- **Language(s):** English (mainly)
- **Model License:** [APACHE 2.0 LICENSE](LICENSE)
- **Code License:** [APACHE 2.0 LICENSE](LICENSE)
- **Continue-pretrained from model:** [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
- **Context length:** 4096 tokens
- **Knowledge Cutoff:** October 2023
### Model Sources
- **Trainer:** [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl)
- **Paper:** Accepted, awaiting publication date (*[Impact of High-Quality, Mixed-Domain Data on the Performance of Medical Language Models](#)*)
## Uses
This model was trained to demonstrate the benefit of using high quality and relevant medical literature as well as general data to retain capabilities in other domains. Therefore the model was trained for any specific use and did not benefit from additional instruction tuning to ensure safety.
The model in its current state can be useful for medical professionals as an assistant, be it for clinical decision support or documentation. We do not recommend the use of this model by non professionals who may not be able to notice errors.
We recommend additional task specific training and safety evaluation before using the model in a real-world setting.
### Format
The model uses the Alpaca format, it is available as a chat template:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained("internistai/base-7b-v0.2")
tokenizer = AutoTokenizer.from_pretrained("internistai/base-7b-v0.2")
messages = [
{"role": "user", "content": "Describe the anatomy of nutcracker syndrome"},
]
encodeds = tokenizer.apply_chat_template(messages, add_generation_prompt=True ,return_tensors="pt")
model_inputs = encodeds.to(device)
model.to(device)
generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])
```
### Out-of-Scope Use
We do not recommend using this model for natural language generation in a production environment, finetuned or otherwise.
## Professional Evaluation
We created a free response evaluation dataset of 100 questions and prompted the model and GPT-4 as a comparison with these questions. We then recolted the prompt/answer pairs and presented them to 10 medical doctors of different specialties with questions to be answered with a 7 point likert scale (See the paper for more information).
<img width=800px src="assets/likert.png" alt="Likert scale" title="likert">
## Training Details
### Training Data
Internist.ai 7b contains a total of 2.3B tokens:
- [**General Domain**](https://huggingface.co/datasets/Open-Orca/OpenOrca): OpenOrca-GPT4 is a state-of-the-art general domain dataset generated from Flan prompts using GPT-4.
- **Medical Guidelines**: 11,332 articles from UpToDate were included as well as domain specific guidelines provided by physicians to cover the [USMLE Content Outline](https://www.usmle.org/sites/default/files/2021-08/USMLE_Content_Outline.pdf).
- **Medical Books**: 10,376 textbooks were sourced from PMC LitArch and our university library.
- **Synthetic Data**: We generated 400M tokens by prompting a larger model with instructions to transform and adapt extracts from the Medical Guidelines.
*Data Availability*: Considering the datasets contain proprietary information, we will not be releasing the datasets publicly. Regarding the synthetic dataset, as we show in the paper, the model trained exclusively on this dataset performs very poorly and was not up to our standards. Due to its poor quality we decided not to release it.
<img src="assets/loss.png" alt="Loss" title="loss">
### Training Procedure
We used Axolotl to train on a server with 4 NVIDIA A100 80GB GPUs for a total of 450 GPU hours. We used FlashAttention, NEFTune and sample packing with the parameters described below.
#### Training Hyperparameters
| | |
| --- | ------ |
| bf16 | true |
| lr | 6e-6 |
| eps | 1e-5 |
| epochs | 4 |
| betas | \[0.9, 0.95\] |
| weight decay | 0.1 |
| Batch size | 192,000 tokens |
| seq length | 4096 |
| lr scheduler | cosine|
| min lr | 1e-8 |
| NEFT alpha | 5 |
| warmup iteration | 100 |
| | |
## Evaluation
### Testing Data & Metrics
#### Testing Data
- [MedQA (USMLE) - 4 options](https://huggingface.co/datasets/bigbio/med_qa)
- [MedMCQA](https://huggingface.co/datasets/medmcqa)
- [PubMedQA](https://huggingface.co/datasets/bigbio/pubmed_qa)
- [MMLU](https://huggingface.co/datasets/hails/mmlu_no_train)
#### Metrics
- Accuracy: we ran standardized 0-shot benchmarks using [lm-evaluation-harness](https://github.com/maximegmd/lm-evaluation-harness/tree/big-refactor/lm_eval).
### Results
We include benchmarks on MedQA (4 options), MedMCQA and PubMedQA of our model and models of similar size and achieve the first USMLE passing score of 60% on the MedQA benchmark.
| | Internist.ai 7b | PMC LLaMA 7b* | Mistral 7b | Meditron 7b** |
| ----------- | ------------- | ------------ | ---------- | ----------- |
| MedQA | **60.5** | 27.7 (44.7) | 48.7 | 52.0 |
| MedMCQA | 55.8 | 32.2 (51.4) | 45.7 | **59.2** |
| PubMedQA | **79.4** | 67.8 (74.6) | 75.8 | 74.4 |
| MMLU Professional Medicine | **76.1** | 19.5 | 65.8 | 26.6 |
| MMLU Clinical Knowledge | **70.6** | 23.8 | 61.1 | 35.5 |
| MMLU Anatomy | **65.9** | 18.5 | 52.6 | 42.6 |
| MMLU College Medicine | **63.0** | 23.7 | 55.5 | 28.9 |
| MMLU Medical Genetics | **71.0** | 32.0 | 68.0 | 46.0 |
\*: PMC LLaMA 7b performed poorly on the benchmark, likely due to a mismatch of formating and a lack of instruction tuning, we include in parenthesis the results reported by the authors when available.
\*\*: Meditron 7b's results in MMLU are reported for transparency but are inconsistent with the average of 54.2 reported in their paper, do not hesitate to communicate the details on each category so we can update the table.
## Citation
**BibTeX:**
If you use Internist.ai 7b, please cite us:
```
```
|