katielink commited on
Commit
d1216da
1 Parent(s): 9b7f3a6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +149 -0
README.md CHANGED
@@ -57,6 +57,155 @@ This model was developed using English corpora, and thus may be considered Engli
57
  Further, this model was developed in part using the [PMC-15M](https://aka.ms/biomedclip-paper) dataset. The figure-caption pairs that make up this dataset may contain biases reflecting the current practice of academic publication. For example, the corresponding papers may be enriched for positive findings, contain examples of extreme cases, and otherwise reflect distributions that are not representative of other sources of biomedical data.
58
 
59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
  ## Acknowledgement
61
 
62
  - Our project is built upon [LLaVA](https://github.com/lm-sys/FastChat) and [Vicuna](https://github.com/lm-sys/FastChat): They provide our base models with the amazing multimodal and langauge capabilities, respectively!
 
57
  Further, this model was developed in part using the [PMC-15M](https://aka.ms/biomedclip-paper) dataset. The figure-caption pairs that make up this dataset may contain biases reflecting the current practice of academic publication. For example, the corresponding papers may be enriched for positive findings, contain examples of extreme cases, and otherwise reflect distributions that are not representative of other sources of biomedical data.
58
 
59
 
60
+ ## Evaluation
61
+
62
+ ### Medical Visual Chat (GPT-assisted Evaluation)
63
+
64
+ Our GPT-assisted evaluation pipeline for multimodal modeling is provided for a comprehensive understanding of the capabilities of vision-language models. Please see our paper for more details.
65
+
66
+ 1. Generate LLaVA-Med responses
67
+
68
+ ```Shell
69
+ python model_vqa.py \
70
+ --model-name ./checkpoints/LLaVA-7B-v0 \
71
+ --question-file data/eval/llava_med_eval_qa50_qa.jsonl \
72
+ --image-folder data/images/ \
73
+ --answers-file /path/to/answer-file.jsonl
74
+ ```
75
+
76
+ 2. Evaluate the generated responses. In our case, [`llava_med_eval_qa50_qa.jsonl`](/data/eval/llava_med_eval_qa50_qa.jsonl) contains the questions, context (captions and inline-mentions) and responses generated by text-only GPT-4 (0314), which we treat as ground truth.
77
+
78
+ ```Shell
79
+ python llava/eval/eval_multimodal_chat_gpt_score.py \
80
+ --question_input_path data/eval/llava_med_eval_qa50_qa.jsonl \
81
+ --input_path /path/to/answer-file.jsonl \
82
+ --output_path /path/to/save/gpt4-eval-for-individual-answers.jsonl
83
+ ```
84
+
85
+ 3. Summarize the evaluation results
86
+
87
+ ```Shell
88
+ python summarize_gpt_review.py
89
+ ```
90
+
91
+ ### Medical VQA
92
+
93
+ Three Medical VQA datasets are considered in our experiments, including VQA-Rad, SLAKE, Pathology-VQA. We use VQA-Rad as the running example to illustrate how LLaVA-Med is applied to a downstream scenario.
94
+
95
+ #### - Prepare Data
96
+ 1. Please see VQA-Rad [repo](https://paperswithcode.com/dataset/vqa-rad) for setting up the dataset.
97
+ 2. Generate VQA-Rad dataset for LLaVA-Med conversation-style format (the same format with instruct tuning). For each dataset, we process it into three components: `train.json`, `test.json`, `images`.
98
+
99
+
100
+ #### - Fine-tuning
101
+
102
+ To achieve the higher performance for given a downstream dataset, the same full-model tuning script with instruct tuning is used to continue train LLaVA-Med.
103
+
104
+ <details>
105
+ <summary> Detailed script to fine-tune to downstream datasets: LLaVA-Med-7B, 8x A100 (40G). Time: ~1 hour.</summary>
106
+
107
+ ```Shell
108
+ torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
109
+ llava/train/train_mem.py \
110
+ --model_name_or_path /path/to/checkpoint_llava_med_instruct_60k_inline_mention \
111
+ --data_path /path/to/eval/vqa_rad/train.json \
112
+ --image_folder /path/to/eval/vqa_rad/images \
113
+ --vision_tower openai/clip-vit-large-patch14 \
114
+ --mm_vision_select_layer -2 \
115
+ --mm_use_im_start_end True \
116
+ --bf16 True \
117
+ --output_dir /path/to/checkpoint_llava_med_instruct_60k_inline_mention/eval/fine_tuned/vqa_rad \
118
+ --num_train_epochs 3 \
119
+ --per_device_train_batch_size 1 \
120
+ --per_device_eval_batch_size 4 \
121
+ --gradient_accumulation_steps 8 \
122
+ --evaluation_strategy "no" \
123
+ --save_strategy "steps" \
124
+ --save_steps 5000 \
125
+ --save_total_limit 3 \
126
+ --learning_rate 2e-5 \
127
+ --weight_decay 0. \
128
+ --warmup_ratio 0.03 \
129
+ --lr_scheduler_type "cosine" \
130
+ --logging_steps 1 \
131
+ --tf32 True \
132
+ --fsdp "full_shard auto_wrap" \
133
+ --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
134
+ --model_max_length 2048 \
135
+ --gradient_checkpointing True \
136
+ --lazy_preprocess True \
137
+ --report_to wandb
138
+ ```
139
+ </details>
140
+
141
+ ## Serving
142
+
143
+ | Model Delta Weights | Size |
144
+ | --- | ---: |
145
+ | [llava_med_in_text_60k_delta.zip](https://hanoverprod.z21.web.core.windows.net/med_llava/models/llava_med_in_text_60k_delta.zip) | 11.06 GB |
146
+
147
+ The model weights above are *delta* weights. The usage of LLaVA-Med checkpoints should comply with the base LLM's model license: [LLaMA](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md).
148
+
149
+ Instructions:
150
+
151
+ 1. Download the delta weights [llava_med_in_text_60k_delta.zip](https://hanoverprod.z21.web.core.windows.net/med_llava/models/llava_med_in_text_60k_delta.zip) and unzip.
152
+ 1. Get the original LLaMA weights in the huggingface format by following the instructions [here](https://huggingface.co/docs/transformers/main/model_doc/llama).
153
+ 1. Use the following scripts to get LLaVA-Med weights by applying our delta. In the script below, set the --delta argument to the path of the unzipped `llava_med_in_text_60k_delta` directory. It can be adapted for other delta weights by changing the `--delta` argument (and base/target accordingly).
154
+
155
+ ```bash
156
+ python3 -m llava.model.apply_delta \
157
+ --base /path/to/llama-7b \
158
+ --target /output/path/to/llava_med_in_text_60k \
159
+ --delta path/to/llava_med_in_text_60k_delta
160
+ ```
161
+
162
+
163
+ ## - Evaluation
164
+
165
+ Depending on which checkpoint is employed in evaluation, zero-shot performance is reported on medical instruct tuned checkpoint (eg, [LLaVA-Med-7B](/path/to/checkpoint_llava_med_instruct_60k_inline_mention)), and fine-tuned performance is reported on checkpoint that has been further tuned on training set of the downstream datasets (eg, [LLaVA-Med-7B-VQA-Rad](/path/to/checkpoint_llava_med_instruct_60k_inline_mention/fine_tuned/vqa_rad) ).
166
+
167
+ (a) Generate LLaVA responses on ScienceQA dataset
168
+
169
+ (a.1). [Option 1] Multiple-GPU inference
170
+ You may evaluate this with multiple GPUs, and concatenate the generated jsonl files. Please refer to our script for [batch evaluation](scripts/chunyl/finetune_on_benchmarks/eval_med_dataset_batch.sh).
171
+
172
+ ```Shell
173
+ python llava/eval/run_med_datasets_eval_batch.py --num-chunks 8 --model-name /path/to/checkpoint_llava_med_instruct_60k_inline_mention/eval/fine_tuned/vqa_rad \
174
+ --question-file path/to/eval/vqa_rad/test.json \
175
+ --image-folder path/to/eval/vqa_rad/images \
176
+ --answers-file /path/to/checkpoint_llava_med_instruct_60k_inline_mention/eval/fine_tuned/vqa_rad/test-answer-file.jsonl
177
+ ```
178
+ (a.2). [Option 2] Single-GPU inference
179
+
180
+ ```Shell
181
+ python llava/eval/model_vqa_med.py --model-name /path/to/checkpoint_llava_med_instruct_60k_inline_mention/eval/fine_tuned/vqa_rad \
182
+ --question-file path/to/eval/vqa_rad/test.json \
183
+ --image-folder path/to/eval/vqa_rad/images \
184
+ --answers-file /path/to/checkpoint_llava_med_instruct_60k_inline_mention/eval/fine_tuned/vqa_rad/test-answer-file.jsonl
185
+ ```
186
+
187
+ (b) Evaluate the generated responses
188
+
189
+ (b.1). [Option 1] Evaluation for all three VQA datasets
190
+ ```Shell
191
+
192
+ python llava/eval/run_eval_batch.py \
193
+ --pred_file_parent_path /path/to/llava-med \
194
+ --target_test_type test-answer-file
195
+ ```
196
+
197
+ It collects the decoding results of all predictions files under the project path, computes the corresponding evaluation metrics, and outputs the results in "`eval_results_med_datasets.jsonl`". To analyze the score, we provdie ipython notebook [run_eval_metrics.ipynb](llava/notebook/run_eval_metrics.ipynb).
198
+
199
+ (b.2). [Option 2] Evaluation for on one specific VQA dataset
200
+ ```Shell
201
+ python llava/eval/run_eval.py \
202
+ --gt /path/to/eval/vqa_rad/test.json \
203
+ --pred /path/to/checkpoint_llava_med_instruct_60k_inline_mention/eval/fine_tuned/vqa_rad/test-answer-file.jsonl
204
+ ```
205
+
206
+ Please find the LLaVA-Med performance in [llava_med_performance.md](docs/llava_med_performance.md) or in the paper.
207
+
208
+
209
  ## Acknowledgement
210
 
211
  - Our project is built upon [LLaVA](https://github.com/lm-sys/FastChat) and [Vicuna](https://github.com/lm-sys/FastChat): They provide our base models with the amazing multimodal and langauge capabilities, respectively!