togethercomputer
/

Llama-3.1-8B-Dragonfly-v2

+---
+license: llama3.1
+language:
+- en
+pipeline_tag: image-text-to-text
+tags:
+- text-generation-inference
+---
+# Dragonfly Model Card
+**Note: Users are permitted to use this model in accordance with the Llama 3 Community License Agreement.**
+## Model Details
+Dragonfly is a multimodal visual-language model, trained by instruction tuning on Llama 3.1.
+- **Developed by:** [Together AI](https://www.together.ai/)
+- **Model type:** An autoregressive visual-language model based on the transformer architecture
+- **License:** [Llama 3.1 Community License Agreement](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
+- **Finetuned from model:** [Llama 3.1](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
+### Model Sources
+- **Repository:** https://github.com/togethercomputer/Dragonfly
+- **Paper:** https://arxiv.org/abs/2406.00977
+## Uses
+The primary use of Dragonfly is research on large visual-language models.
+It is primarily intended for researchers and hobbyists in natural language processing, machine learning, and artificial intelligence.
+## How to Get Started with the Model
+### 💿 Installation
+Create a conda environment and install necessary packages
+```bash
+conda env create -f environment.yml
+conda activate dragonfly_env
+```
+Install flash attention
+```bash
+pip install flash-attn --no-build-isolation
+```
+As a final step, please run the following command.
+```bash
+pip install --upgrade -e .
+```
+### 🧠 Inference
+If you have successfully completed the installation process, then you should be able to follow the steps below.
+Question: What is so funny about this image?
+![Monalisa Dog](monalisa_dog.jpg)
+Load necessary packages
+```python
+import torch
+from PIL import Image
+from transformers import AutoProcessor, AutoTokenizer
+from dragonfly.models.modeling_dragonfly import DragonflyForCausalLM
+from dragonfly.models.processing_dragonfly import DragonflyProcessor
+from pipeline.train.train_utils import random_seed
+```
+Instantiate the tokenizer, processor, and model.
+```python
+device = torch.device("cuda:0")
+tokenizer = AutoTokenizer.from_pretrained("togethercomputer/Llama-3.1-8B-Dragonfly-v1")
+clip_processor = AutoProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
+image_processor = clip_processor.image_processor
+processor = DragonflyProcessor(image_processor=image_processor, tokenizer=tokenizer, image_encoding_style="llava-hd")
+model = DragonflyForCausalLM.from_pretrained("togethercomputer/Llama-3.1-8B-Dragonfly-v1")
+model = model.to(torch.bfloat16)
+model = model.to(device)
+```
+Now, lets load the image and process them.
+```python
+image = Image.open("./test_images/skateboard.png")
+image = image.convert("RGB")
+images = [image]
+# images = [None] # if you do not want to pass any images
+text_prompt = "<|start_header_id|>user<|end_header_id|>\n\nWhat is so funny about this image?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
+inputs = processor(text=[text_prompt], images=images, max_length=4096, return_tensors="pt", is_generate=True)
+inputs = inputs.to(device)
+```
+Finally, let us generate the responses from the model
+```python
+temperature = 0
+with torch.inference_mode():
+    generation_output = model.generate(**inputs, max_new_tokens=1024, eos_token_id=tokenizer.encode("<|eot_id|>"), do_sample=temperature > 0, temperature=temperature, use_cache=True)
+generation_text = processor.batch_decode(generation_output, skip_special_tokens=False)
+```
+An example response.
+```plaintext
+The humor in this image comes from the surreal juxtaposition of a dog's face with the body of the Mona Lisa, a famous painting by Leonardo da Vinci.
+The Mona Lisa is known for her enigmatic smile and is often considered one of the most famous paintings in the world. By combining the dog's face with
+the body of the Mona Lisa, the artist has created a whimsical and amusing image that plays on the viewer 's expectations and familiarity with the
+original paintings. The contrast between the dog's natural, expressive features and the serene, mysterious expression of the Mona Lisa creates a
+humerous effect that is likely to elicit laughter<|eot_id|>
+```
+## Training Details
+See more details in the "Implementation" section of our [paper](https://arxiv.org/abs/2406.00977).
+## Evaluation
+See more details in the "Results" section of our [paper](https://arxiv.org/abs/2406.00977).
+## 🏆 Credits
+We would like to acknowledge the following resources that were instrumental in the development of Dragonfly:
+- [Meta Llama 3.1](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct): We utilized the Llama 3 model as our foundational language model.
+- [CLIP](https://huggingface.co/openai/clip-vit-base-patch32): Our vision backbone is CLIP model from OpenAI.
+- Our codebase is built upon the following two codebases:
+  - [Otter: A Multi-Modal Model with In-Context Instruction Tuning](https://github.com/Luodian/Otter)
+  - [LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images](https://github.com/thunlp/LLaVA-UHD)
+## 📚 BibTeX
+```bibtex
+@misc{chen2024dragonfly,
+      title={Dragonfly: Multi-Resolution Zoom Supercharges Large Visual-Language Model},
+      author={Kezhen Chen and Rahul Thapa and Rahul Chalamala and Ben Athiwaratkun and Shuaiwen Leon Song and James Zou},
+      year={2024},
+      eprint={2406.00977},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
+## Model Card Authors
+Rahul Thapa, Kezhen Chen, Rahul Chalamala
+## Model Card Contact
+Rahul Thapa (rahulthapa@together.ai), Kezhen Chen (kezhen@together.ai)