Update README.md
Browse files
README.md
CHANGED
@@ -57,6 +57,155 @@ This model was developed using English corpora, and thus may be considered Engli
|
|
57 |
Further, this model was developed in part using the [PMC-15M](https://aka.ms/biomedclip-paper) dataset. The figure-caption pairs that make up this dataset may contain biases reflecting the current practice of academic publication. For example, the corresponding papers may be enriched for positive findings, contain examples of extreme cases, and otherwise reflect distributions that are not representative of other sources of biomedical data.
|
58 |
|
59 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
60 |
## Acknowledgement
|
61 |
|
62 |
- Our project is built upon [LLaVA](https://github.com/lm-sys/FastChat) and [Vicuna](https://github.com/lm-sys/FastChat): They provide our base models with the amazing multimodal and langauge capabilities, respectively!
|
|
|
57 |
Further, this model was developed in part using the [PMC-15M](https://aka.ms/biomedclip-paper) dataset. The figure-caption pairs that make up this dataset may contain biases reflecting the current practice of academic publication. For example, the corresponding papers may be enriched for positive findings, contain examples of extreme cases, and otherwise reflect distributions that are not representative of other sources of biomedical data.
|
58 |
|
59 |
|
60 |
+
## Evaluation
|
61 |
+
|
62 |
+
### Medical Visual Chat (GPT-assisted Evaluation)
|
63 |
+
|
64 |
+
Our GPT-assisted evaluation pipeline for multimodal modeling is provided for a comprehensive understanding of the capabilities of vision-language models. Please see our paper for more details.
|
65 |
+
|
66 |
+
1. Generate LLaVA-Med responses
|
67 |
+
|
68 |
+
```Shell
|
69 |
+
python model_vqa.py \
|
70 |
+
--model-name ./checkpoints/LLaVA-7B-v0 \
|
71 |
+
--question-file data/eval/llava_med_eval_qa50_qa.jsonl \
|
72 |
+
--image-folder data/images/ \
|
73 |
+
--answers-file /path/to/answer-file.jsonl
|
74 |
+
```
|
75 |
+
|
76 |
+
2. Evaluate the generated responses. In our case, [`llava_med_eval_qa50_qa.jsonl`](/data/eval/llava_med_eval_qa50_qa.jsonl) contains the questions, context (captions and inline-mentions) and responses generated by text-only GPT-4 (0314), which we treat as ground truth.
|
77 |
+
|
78 |
+
```Shell
|
79 |
+
python llava/eval/eval_multimodal_chat_gpt_score.py \
|
80 |
+
--question_input_path data/eval/llava_med_eval_qa50_qa.jsonl \
|
81 |
+
--input_path /path/to/answer-file.jsonl \
|
82 |
+
--output_path /path/to/save/gpt4-eval-for-individual-answers.jsonl
|
83 |
+
```
|
84 |
+
|
85 |
+
3. Summarize the evaluation results
|
86 |
+
|
87 |
+
```Shell
|
88 |
+
python summarize_gpt_review.py
|
89 |
+
```
|
90 |
+
|
91 |
+
### Medical VQA
|
92 |
+
|
93 |
+
Three Medical VQA datasets are considered in our experiments, including VQA-Rad, SLAKE, Pathology-VQA. We use VQA-Rad as the running example to illustrate how LLaVA-Med is applied to a downstream scenario.
|
94 |
+
|
95 |
+
#### - Prepare Data
|
96 |
+
1. Please see VQA-Rad [repo](https://paperswithcode.com/dataset/vqa-rad) for setting up the dataset.
|
97 |
+
2. Generate VQA-Rad dataset for LLaVA-Med conversation-style format (the same format with instruct tuning). For each dataset, we process it into three components: `train.json`, `test.json`, `images`.
|
98 |
+
|
99 |
+
|
100 |
+
#### - Fine-tuning
|
101 |
+
|
102 |
+
To achieve the higher performance for given a downstream dataset, the same full-model tuning script with instruct tuning is used to continue train LLaVA-Med.
|
103 |
+
|
104 |
+
<details>
|
105 |
+
<summary> Detailed script to fine-tune to downstream datasets: LLaVA-Med-7B, 8x A100 (40G). Time: ~1 hour.</summary>
|
106 |
+
|
107 |
+
```Shell
|
108 |
+
torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
|
109 |
+
llava/train/train_mem.py \
|
110 |
+
--model_name_or_path /path/to/checkpoint_llava_med_instruct_60k_inline_mention \
|
111 |
+
--data_path /path/to/eval/vqa_rad/train.json \
|
112 |
+
--image_folder /path/to/eval/vqa_rad/images \
|
113 |
+
--vision_tower openai/clip-vit-large-patch14 \
|
114 |
+
--mm_vision_select_layer -2 \
|
115 |
+
--mm_use_im_start_end True \
|
116 |
+
--bf16 True \
|
117 |
+
--output_dir /path/to/checkpoint_llava_med_instruct_60k_inline_mention/eval/fine_tuned/vqa_rad \
|
118 |
+
--num_train_epochs 3 \
|
119 |
+
--per_device_train_batch_size 1 \
|
120 |
+
--per_device_eval_batch_size 4 \
|
121 |
+
--gradient_accumulation_steps 8 \
|
122 |
+
--evaluation_strategy "no" \
|
123 |
+
--save_strategy "steps" \
|
124 |
+
--save_steps 5000 \
|
125 |
+
--save_total_limit 3 \
|
126 |
+
--learning_rate 2e-5 \
|
127 |
+
--weight_decay 0. \
|
128 |
+
--warmup_ratio 0.03 \
|
129 |
+
--lr_scheduler_type "cosine" \
|
130 |
+
--logging_steps 1 \
|
131 |
+
--tf32 True \
|
132 |
+
--fsdp "full_shard auto_wrap" \
|
133 |
+
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
|
134 |
+
--model_max_length 2048 \
|
135 |
+
--gradient_checkpointing True \
|
136 |
+
--lazy_preprocess True \
|
137 |
+
--report_to wandb
|
138 |
+
```
|
139 |
+
</details>
|
140 |
+
|
141 |
+
## Serving
|
142 |
+
|
143 |
+
| Model Delta Weights | Size |
|
144 |
+
| --- | ---: |
|
145 |
+
| [llava_med_in_text_60k_delta.zip](https://hanoverprod.z21.web.core.windows.net/med_llava/models/llava_med_in_text_60k_delta.zip) | 11.06 GB |
|
146 |
+
|
147 |
+
The model weights above are *delta* weights. The usage of LLaVA-Med checkpoints should comply with the base LLM's model license: [LLaMA](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md).
|
148 |
+
|
149 |
+
Instructions:
|
150 |
+
|
151 |
+
1. Download the delta weights [llava_med_in_text_60k_delta.zip](https://hanoverprod.z21.web.core.windows.net/med_llava/models/llava_med_in_text_60k_delta.zip) and unzip.
|
152 |
+
1. Get the original LLaMA weights in the huggingface format by following the instructions [here](https://huggingface.co/docs/transformers/main/model_doc/llama).
|
153 |
+
1. Use the following scripts to get LLaVA-Med weights by applying our delta. In the script below, set the --delta argument to the path of the unzipped `llava_med_in_text_60k_delta` directory. It can be adapted for other delta weights by changing the `--delta` argument (and base/target accordingly).
|
154 |
+
|
155 |
+
```bash
|
156 |
+
python3 -m llava.model.apply_delta \
|
157 |
+
--base /path/to/llama-7b \
|
158 |
+
--target /output/path/to/llava_med_in_text_60k \
|
159 |
+
--delta path/to/llava_med_in_text_60k_delta
|
160 |
+
```
|
161 |
+
|
162 |
+
|
163 |
+
## - Evaluation
|
164 |
+
|
165 |
+
Depending on which checkpoint is employed in evaluation, zero-shot performance is reported on medical instruct tuned checkpoint (eg, [LLaVA-Med-7B](/path/to/checkpoint_llava_med_instruct_60k_inline_mention)), and fine-tuned performance is reported on checkpoint that has been further tuned on training set of the downstream datasets (eg, [LLaVA-Med-7B-VQA-Rad](/path/to/checkpoint_llava_med_instruct_60k_inline_mention/fine_tuned/vqa_rad) ).
|
166 |
+
|
167 |
+
(a) Generate LLaVA responses on ScienceQA dataset
|
168 |
+
|
169 |
+
(a.1). [Option 1] Multiple-GPU inference
|
170 |
+
You may evaluate this with multiple GPUs, and concatenate the generated jsonl files. Please refer to our script for [batch evaluation](scripts/chunyl/finetune_on_benchmarks/eval_med_dataset_batch.sh).
|
171 |
+
|
172 |
+
```Shell
|
173 |
+
python llava/eval/run_med_datasets_eval_batch.py --num-chunks 8 --model-name /path/to/checkpoint_llava_med_instruct_60k_inline_mention/eval/fine_tuned/vqa_rad \
|
174 |
+
--question-file path/to/eval/vqa_rad/test.json \
|
175 |
+
--image-folder path/to/eval/vqa_rad/images \
|
176 |
+
--answers-file /path/to/checkpoint_llava_med_instruct_60k_inline_mention/eval/fine_tuned/vqa_rad/test-answer-file.jsonl
|
177 |
+
```
|
178 |
+
(a.2). [Option 2] Single-GPU inference
|
179 |
+
|
180 |
+
```Shell
|
181 |
+
python llava/eval/model_vqa_med.py --model-name /path/to/checkpoint_llava_med_instruct_60k_inline_mention/eval/fine_tuned/vqa_rad \
|
182 |
+
--question-file path/to/eval/vqa_rad/test.json \
|
183 |
+
--image-folder path/to/eval/vqa_rad/images \
|
184 |
+
--answers-file /path/to/checkpoint_llava_med_instruct_60k_inline_mention/eval/fine_tuned/vqa_rad/test-answer-file.jsonl
|
185 |
+
```
|
186 |
+
|
187 |
+
(b) Evaluate the generated responses
|
188 |
+
|
189 |
+
(b.1). [Option 1] Evaluation for all three VQA datasets
|
190 |
+
```Shell
|
191 |
+
|
192 |
+
python llava/eval/run_eval_batch.py \
|
193 |
+
--pred_file_parent_path /path/to/llava-med \
|
194 |
+
--target_test_type test-answer-file
|
195 |
+
```
|
196 |
+
|
197 |
+
It collects the decoding results of all predictions files under the project path, computes the corresponding evaluation metrics, and outputs the results in "`eval_results_med_datasets.jsonl`". To analyze the score, we provdie ipython notebook [run_eval_metrics.ipynb](llava/notebook/run_eval_metrics.ipynb).
|
198 |
+
|
199 |
+
(b.2). [Option 2] Evaluation for on one specific VQA dataset
|
200 |
+
```Shell
|
201 |
+
python llava/eval/run_eval.py \
|
202 |
+
--gt /path/to/eval/vqa_rad/test.json \
|
203 |
+
--pred /path/to/checkpoint_llava_med_instruct_60k_inline_mention/eval/fine_tuned/vqa_rad/test-answer-file.jsonl
|
204 |
+
```
|
205 |
+
|
206 |
+
Please find the LLaVA-Med performance in [llava_med_performance.md](docs/llava_med_performance.md) or in the paper.
|
207 |
+
|
208 |
+
|
209 |
## Acknowledgement
|
210 |
|
211 |
- Our project is built upon [LLaVA](https://github.com/lm-sys/FastChat) and [Vicuna](https://github.com/lm-sys/FastChat): They provide our base models with the amazing multimodal and langauge capabilities, respectively!
|