rthapa84 commited on
Commit
169dd2b
โ€ข
1 Parent(s): bcb8d25

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +154 -3
README.md CHANGED
@@ -1,3 +1,154 @@
1
- ---
2
- license: llama3.1
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3.1
3
+ language:
4
+ - en
5
+ pipeline_tag: image-text-to-text
6
+ tags:
7
+ - text-generation-inference
8
+ ---
9
+
10
+ # Dragonfly Model Card
11
+
12
+ **Note: Users are permitted to use this model in accordance with the Llama 3 Community License Agreement.**
13
+
14
+ ## Model Details
15
+
16
+ Dragonfly is a multimodal visual-language model, trained by instruction tuning on Llama 3.1.
17
+
18
+ - **Developed by:** [Together AI](https://www.together.ai/)
19
+ - **Model type:** An autoregressive visual-language model based on the transformer architecture
20
+ - **License:** [Llama 3.1 Community License Agreement](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
21
+ - **Finetuned from model:** [Llama 3.1](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
22
+
23
+ ### Model Sources
24
+
25
+ - **Repository:** https://github.com/togethercomputer/Dragonfly
26
+ - **Paper:** https://arxiv.org/abs/2406.00977
27
+
28
+ ## Uses
29
+
30
+ The primary use of Dragonfly is research on large visual-language models.
31
+ It is primarily intended for researchers and hobbyists in natural language processing, machine learning, and artificial intelligence.
32
+
33
+
34
+ ## How to Get Started with the Model
35
+
36
+ ### ๐Ÿ’ฟ Installation
37
+
38
+ Create a conda environment and install necessary packages
39
+ ```bash
40
+ conda env create -f environment.yml
41
+ conda activate dragonfly_env
42
+ ```
43
+
44
+ Install flash attention
45
+ ```bash
46
+ pip install flash-attn --no-build-isolation
47
+ ```
48
+
49
+ As a final step, please run the following command.
50
+ ```bash
51
+ pip install --upgrade -e .
52
+ ```
53
+
54
+ ### ๐Ÿง  Inference
55
+
56
+ If you have successfully completed the installation process, then you should be able to follow the steps below.
57
+
58
+ Question: What is so funny about this image?
59
+
60
+ ![Monalisa Dog](monalisa_dog.jpg)
61
+
62
+ Load necessary packages
63
+ ```python
64
+ import torch
65
+ from PIL import Image
66
+ from transformers import AutoProcessor, AutoTokenizer
67
+
68
+ from dragonfly.models.modeling_dragonfly import DragonflyForCausalLM
69
+ from dragonfly.models.processing_dragonfly import DragonflyProcessor
70
+ from pipeline.train.train_utils import random_seed
71
+ ```
72
+
73
+ Instantiate the tokenizer, processor, and model.
74
+ ```python
75
+ device = torch.device("cuda:0")
76
+
77
+ tokenizer = AutoTokenizer.from_pretrained("togethercomputer/Llama-3.1-8B-Dragonfly-v1")
78
+ clip_processor = AutoProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
79
+ image_processor = clip_processor.image_processor
80
+ processor = DragonflyProcessor(image_processor=image_processor, tokenizer=tokenizer, image_encoding_style="llava-hd")
81
+
82
+ model = DragonflyForCausalLM.from_pretrained("togethercomputer/Llama-3.1-8B-Dragonfly-v1")
83
+ model = model.to(torch.bfloat16)
84
+ model = model.to(device)
85
+ ```
86
+
87
+ Now, lets load the image and process them.
88
+ ```python
89
+ image = Image.open("./test_images/skateboard.png")
90
+ image = image.convert("RGB")
91
+ images = [image]
92
+ # images = [None] # if you do not want to pass any images
93
+
94
+ text_prompt = "<|start_header_id|>user<|end_header_id|>\n\nWhat is so funny about this image?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
95
+
96
+ inputs = processor(text=[text_prompt], images=images, max_length=4096, return_tensors="pt", is_generate=True)
97
+ inputs = inputs.to(device)
98
+ ```
99
+
100
+ Finally, let us generate the responses from the model
101
+ ```python
102
+ temperature = 0
103
+
104
+ with torch.inference_mode():
105
+ generation_output = model.generate(**inputs, max_new_tokens=1024, eos_token_id=tokenizer.encode("<|eot_id|>"), do_sample=temperature > 0, temperature=temperature, use_cache=True)
106
+
107
+ generation_text = processor.batch_decode(generation_output, skip_special_tokens=False)
108
+ ```
109
+
110
+ An example response.
111
+ ```plaintext
112
+ The humor in this image comes from the surreal juxtaposition of a dog's face with the body of the Mona Lisa, a famous painting by Leonardo da Vinci.
113
+ The Mona Lisa is known for her enigmatic smile and is often considered one of the most famous paintings in the world. By combining the dog's face with
114
+ the body of the Mona Lisa, the artist has created a whimsical and amusing image that plays on the viewer 's expectations and familiarity with the
115
+ original paintings. The contrast between the dog's natural, expressive features and the serene, mysterious expression of the Mona Lisa creates a
116
+ humerous effect that is likely to elicit laughter<|eot_id|>
117
+ ```
118
+
119
+ ## Training Details
120
+
121
+ See more details in the "Implementation" section of our [paper](https://arxiv.org/abs/2406.00977).
122
+
123
+ ## Evaluation
124
+
125
+ See more details in the "Results" section of our [paper](https://arxiv.org/abs/2406.00977).
126
+
127
+ ## ๐Ÿ† Credits
128
+
129
+ We would like to acknowledge the following resources that were instrumental in the development of Dragonfly:
130
+
131
+ - [Meta Llama 3.1](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct): We utilized the Llama 3 model as our foundational language model.
132
+ - [CLIP](https://huggingface.co/openai/clip-vit-base-patch32): Our vision backbone is CLIP model from OpenAI.
133
+ - Our codebase is built upon the following two codebases:
134
+ - [Otter: A Multi-Modal Model with In-Context Instruction Tuning](https://github.com/Luodian/Otter)
135
+ - [LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images](https://github.com/thunlp/LLaVA-UHD)
136
+
137
+ ## ๐Ÿ“š BibTeX
138
+
139
+ ```bibtex
140
+ @misc{chen2024dragonfly,
141
+ title={Dragonfly: Multi-Resolution Zoom Supercharges Large Visual-Language Model},
142
+ author={Kezhen Chen and Rahul Thapa and Rahul Chalamala and Ben Athiwaratkun and Shuaiwen Leon Song and James Zou},
143
+ year={2024},
144
+ eprint={2406.00977},
145
+ archivePrefix={arXiv},
146
+ primaryClass={cs.CV}
147
+ }
148
+ ```
149
+
150
+ ## Model Card Authors
151
+ Rahul Thapa, Kezhen Chen, Rahul Chalamala
152
+
153
+ ## Model Card Contact
154
+ Rahul Thapa (rahulthapa@together.ai), Kezhen Chen (kezhen@together.ai)