AdaptLLM commited on
Commit
d286c25
·
verified ·
1 Parent(s): 7d50d47

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +94 -3
README.md CHANGED
@@ -1,3 +1,94 @@
1
- ---
2
- license: llama3.2
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3.2
3
+ datasets:
4
+ - AdaptLLM/medicine-visual-instructions
5
+ language:
6
+ - en
7
+ base_model:
8
+ - meta-llama/Llama-3.2-11B-Vision-Instruct
9
+ tags:
10
+ - biology
11
+ - medical
12
+ - chemistry
13
+ ---
14
+ # Adapting Multimodal Large Language Models to Domains via Post-Training
15
+
16
+ This repos contains the biomedicine MLLM developed from Llama-3.2-11B in our paper: [On Domain-Specific Post-Training for Multimodal Large Language Models](https://huggingface.co/papers/2411.19930). The correspoding training data is in [medicine-visual-instructions](https://huggingface.co/datasets/AdaptLLM/medicine-visual-instructions).
17
+
18
+ The main project page is: [Adapt-MLLM-to-Domains](https://huggingface.co/AdaptLLM/Adapt-MLLM-to-Domains/edit/main/README.md)
19
+
20
+ We investigate domain adaptation of MLLMs through post-training, focusing on data synthesis, training pipelines, and task evaluation.
21
+ **(1) Data Synthesis**: Using open-source models, we develop a visual instruction synthesizer that effectively generates diverse visual instruction tasks from domain-specific image-caption pairs. **Our synthetic tasks surpass those generated by manual rules, GPT-4, and GPT-4V in enhancing the domain-specific performance of MLLMs.**
22
+ **(2) Training Pipeline**: While the two-stage training--initially on image-caption pairs followed by visual instruction tasks--is commonly adopted for developing general MLLMs, we apply a single-stage training pipeline to enhance task diversity for domain-specific post-training.
23
+ **(3) Task Evaluation**: We conduct experiments in two domains, biomedicine and food, by post-training MLLMs of different sources and scales (e.g., Qwen2-VL-2B, LLaVA-v1.6-8B, Llama-3.2-11B), and then evaluating MLLM performance on various domain-specific tasks.
24
+
25
+ <p align='center'>
26
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/650801ced5578ef7e20b33d4/-Jp7pAsCR2Tj4WwfwsbCo.png" width="600">
27
+ </p>
28
+
29
+ ## How to use
30
+
31
+ Starting with transformers >= 4.45.0 onward, you can run inference using conversational messages that may include an image you can query about.
32
+
33
+ Make sure to update your transformers installation via pip install --upgrade transformers.
34
+
35
+ ```bash
36
+ import requests
37
+ import torch
38
+ from PIL import Image
39
+ from transformers import MllamaForConditionalGeneration, AutoProcessor
40
+
41
+ model_id = "AdaptLLM/medicine-Llama-3.2-11B-Vision-Instruct"
42
+
43
+ model = MllamaForConditionalGeneration.from_pretrained(
44
+ model_id,
45
+ torch_dtype=torch.bfloat16,
46
+ device_map="auto",
47
+ )
48
+ processor = AutoProcessor.from_pretrained(model_id)
49
+
50
+ url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
51
+ image = Image.open(requests.get(url, stream=True).raw)
52
+
53
+ messages = [
54
+ {"role": "user", "content": [
55
+ {"type": "image"},
56
+ {"type": "text", "text": "If I had to write a haiku for this one, it would be: "}
57
+ ]}
58
+ ]
59
+ input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
60
+ inputs = processor(
61
+ image,
62
+ input_text,
63
+ add_special_tokens=False,
64
+ return_tensors="pt"
65
+ ).to(model.device)
66
+
67
+ output = model.generate(**inputs, max_new_tokens=30)
68
+ print(processor.decode(output[0]))
69
+ ```
70
+
71
+ ## Citation
72
+ If you find our work helpful, please cite us.
73
+
74
+ AdaMLLM
75
+ ```bibtex
76
+ @article{adamllm,
77
+ title={On Domain-Specific Post-Training for Multimodal Large Language Models},
78
+ author={Cheng, Daixuan and Huang, Shaohan and Zhu, Ziyu and Zhang, Xintong and Zhao, Wayne Xin and Luan, Zhongzhi and Dai, Bo and Zhang, Zhenliang},
79
+ journal={arXiv preprint arXiv:2411.19930},
80
+ year={2024}
81
+ }
82
+ ```
83
+
84
+ [AdaptLLM](https://huggingface.co/papers/2309.09530) (ICLR 2024)
85
+ ```bibtex
86
+ @inproceedings{
87
+ adaptllm,
88
+ title={Adapting Large Language Models via Reading Comprehension},
89
+ author={Daixuan Cheng and Shaohan Huang and Furu Wei},
90
+ booktitle={The Twelfth International Conference on Learning Representations},
91
+ year={2024},
92
+ url={https://openreview.net/forum?id=y886UXPEZ0}
93
+ }
94
+ ```