AdaptLLM commited on
Commit
02118bb
·
verified ·
1 Parent(s): 5fde2d9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -3
README.md CHANGED
@@ -1,3 +1,91 @@
1
- ---
2
- license: llama3.2
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3.2
3
+ language:
4
+ - en
5
+ base_model:
6
+ - meta-llama/Llama-3.2-11B-Vision-Instruct
7
+ tags:
8
+ - food
9
+ - recipe
10
+ ---
11
+ # Adapting Multimodal Large Language Models to Domains via Post-Training
12
+
13
+ This repos contains the **food MLLM developed from Llama-3.2-11B** in our paper: [On Domain-Specific Post-Training for Multimodal Large Language Models](https://huggingface.co/papers/2411.19930).
14
+
15
+ The main project page is: [Adapt-MLLM-to-Domains](https://huggingface.co/AdaptLLM/Adapt-MLLM-to-Domains/edit/main/README.md)
16
+
17
+ We investigate domain adaptation of MLLMs through post-training, focusing on data synthesis, training pipelines, and task evaluation.
18
+ **(1) Data Synthesis**: Using open-source models, we develop a visual instruction synthesizer that effectively generates diverse visual instruction tasks from domain-specific image-caption pairs. **Our synthetic tasks surpass those generated by manual rules, GPT-4, and GPT-4V in enhancing the domain-specific performance of MLLMs.**
19
+ **(2) Training Pipeline**: While the two-stage training--initially on image-caption pairs followed by visual instruction tasks--is commonly adopted for developing general MLLMs, we apply a single-stage training pipeline to enhance task diversity for domain-specific post-training.
20
+ **(3) Task Evaluation**: We conduct experiments in two domains, biomedicine and food, by post-training MLLMs of different sources and scales (e.g., Qwen2-VL-2B, LLaVA-v1.6-8B, Llama-3.2-11B), and then evaluating MLLM performance on various domain-specific tasks.
21
+
22
+ <p align='left'>
23
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/650801ced5578ef7e20b33d4/bRu85CWwP9129bSCRzos2.png" width="1000">
24
+ </p>
25
+
26
+ ## How to use
27
+
28
+ Starting with transformers >= 4.45.0 onward, you can run inference using conversational messages that may include an image you can query about.
29
+
30
+ Make sure to update your transformers installation via pip install --upgrade transformers.
31
+
32
+ ```bash
33
+ import requests
34
+ import torch
35
+ from PIL import Image
36
+ from transformers import MllamaForConditionalGeneration, AutoProcessor
37
+
38
+ model_id = "AdaptLLM/medicine-Llama-3.2-11B-Vision-Instruct"
39
+
40
+ model = MllamaForConditionalGeneration.from_pretrained(
41
+ model_id,
42
+ torch_dtype=torch.bfloat16,
43
+ device_map="auto",
44
+ )
45
+ processor = AutoProcessor.from_pretrained(model_id)
46
+
47
+ url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
48
+ image = Image.open(requests.get(url, stream=True).raw)
49
+
50
+ messages = [
51
+ {"role": "user", "content": [
52
+ {"type": "image"},
53
+ {"type": "text", "text": "If I had to write a haiku for this one, it would be: "}
54
+ ]}
55
+ ]
56
+ input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
57
+ inputs = processor(
58
+ image,
59
+ input_text,
60
+ add_special_tokens=False,
61
+ return_tensors="pt"
62
+ ).to(model.device)
63
+
64
+ output = model.generate(**inputs, max_new_tokens=30)
65
+ print(processor.decode(output[0]))
66
+ ```
67
+
68
+ ## Citation
69
+ If you find our work helpful, please cite us.
70
+
71
+ AdaMLLM
72
+ ```bibtex
73
+ @article{adamllm,
74
+ title={On Domain-Specific Post-Training for Multimodal Large Language Models},
75
+ author={Cheng, Daixuan and Huang, Shaohan and Zhu, Ziyu and Zhang, Xintong and Zhao, Wayne Xin and Luan, Zhongzhi and Dai, Bo and Zhang, Zhenliang},
76
+ journal={arXiv preprint arXiv:2411.19930},
77
+ year={2024}
78
+ }
79
+ ```
80
+
81
+ [AdaptLLM](https://huggingface.co/papers/2309.09530) (ICLR 2024)
82
+ ```bibtex
83
+ @inproceedings{
84
+ adaptllm,
85
+ title={Adapting Large Language Models via Reading Comprehension},
86
+ author={Daixuan Cheng and Shaohan Huang and Furu Wei},
87
+ booktitle={The Twelfth International Conference on Learning Representations},
88
+ year={2024},
89
+ url={https://openreview.net/forum?id=y886UXPEZ0}
90
+ }
91
+ ```