File size: 4,458 Bytes
983f690 d5681a9 983f690 a732c89 983f690 d5681a9 983f690 dba7af6 983f690 dba7af6 983f690 bfdfb84 983f690 bfdfb84 983f690 bfdfb84 983f690 bfdfb84 983f690 51a6725 983f690 51a6725 983f690 51a6725 983f690 e6f3920 983f690 e6f3920 983f690 e6f3920 983f690 e6f3920 d5681a9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
---
language:
- en
tags:
- llava
- phi
- HelpingAI
license: mit
library_name: transformers
base_model: visheratin/MC-LLaVA-3b
widget:
- text: What animal is it?
src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg
- text: Where is it?
src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg
---
# HelpingAI-Vision
<a target="_blank" href="https://colab.research.google.com/drive/1t2OAMVSKsiqVgvuHq7rhyNv28b67u0D8">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
## Model details
The fundamental concept behind HelpingAI-Vision is to generate one token embedding per N parts of an image, as opposed to producing N visual token embeddings for the entire image. This approach, based on the Dolphin 2.6 Phi model and incorporating the LLaVA adapter, aims to enhance scene understanding by capturing more detailed information.
For every crop of the image, an embedding is generated using the full SigLIP encoder (size [1, 1152]). Subsequently, all N embeddings undergo processing through the LLaVA adapter, resulting in a token embedding of size [N, 2560]. Currently, these tokens lack explicit information about their position in the original image, with plans to incorporate positional information in a later update.
HelpingAI-Vision was fine-tuned from Dolphin 2.6 Phi, leveraging the vision tower from SigLIP 400M. The training process had a context length of 1200 tokens, determined by the limitations of the L4 GPUs used.
The model adopts the ChatML prompt format, suggesting its potential application in chat-based scenarios. If you have specific queries or would like further details, feel free
```
<|im_start|>system
You are Dolphin, a helpful AI assistant.<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
```
## How to use
**Install dependencies**
```bash
!pip install -q open_clip_torch timm einops
```
**Download modeling files**
```python
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="configuration_llava.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="configuration_phi.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="modeling_llava.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="modeling_phi.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="processing_llava.py", local_dir="./", force_download=True)
```
**Create a model**
```python
from modeling_llava import LlavaForConditionalGeneration
import torch
model = LlavaForConditionalGeneration.from_pretrained("OEvortex/HelpingAI-Vision", torch_dtype=torch.float16)
model = model.to("cuda")
```
**Create processors**
```python
from transformers import AutoTokenizer
from processing_llava import LlavaProcessor, OpenCLIPImageProcessor
tokenizer = AutoTokenizer.from_pretrained("OEvortex/HelpingAI-Vision")
image_processor = OpenCLIPImageProcessor(model.config.preprocess_config)
processor = LlavaProcessor(image_processor, tokenizer)
```
**Set image and text**
```python
from PIL import Image
import requests
image_file = "https://images.unsplash.com/photo-1439246854758-f686a415d9da"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
prompt = """<|im_start|>system
A chat between a curious human and an artificial intelligence assistant.
The assistant gives helpful, detailed, and polite answers to the human's questions.
The assistant does not hallucinate and pays very close attention to the details.<|im_end|>
<|im_start|>user
<image>
Describe the image.<|im_end|>
<|im_start|>assistant
"""
```
**Process inputs**
```python
with torch.inference_mode():
inputs = processor(prompt, raw_image, model, return_tensors='pt')
inputs['input_ids'] = inputs['input_ids'].to(model.device)
inputs['attention_mask'] = inputs['attention_mask'].to(model.device)
from transformers import TextStreamer
streamer = TextStreamer(tokenizer)
```
**Generate the data**
```python
%%time
with torch.inference_mode():
output = model.generate(**inputs, max_new_tokens=200, do_sample=True, top_p=0.9, temperature=1.2, eos_token_id=tokenizer.eos_token_id, streamer=streamer)
print(tokenizer.decode(output[0]).replace(prompt, "").replace("<|im_end|>", ""))
``` |