File size: 3,055 Bytes
7190045
 
 
 
 
 
 
 
 
2fa03da
8965660
4f93c97
 
 
 
 
7190045
 
06b6855
7c5d988
982c1f8
7c5d988
 
7190045
 
 
e8f5de1
 
 
7190045
06bc212
 
e8f5de1
 
7190045
 
 
 
 
 
 
a1d449e
7190045
 
 
 
8965660
e8f5de1
7190045
 
e8f5de1
7190045
e8f5de1
7190045
e8f5de1
 
5aa135e
 
7190045
e8f5de1
 
7190045
 
3f4bd65
 
e8f5de1
 
 
 
 
3f4bd65
2d1df02
 
 
 
 
 
92d6894
 
 
 
 
260330d
 
e8f5de1
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
datasets:
- liuhaotian/LLaVA-Pretrain
- liuhaotian/LLaVA-Instruct-150K
language:
- en
tags:
- llava
- phi
license: mit
library_name: transformers
widget:
- text: "What animal is it?"
  src: "https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg"
- text: "Where is it?"
  src: "https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg"
---

# Multi-crop LLaVA-3b

<a target="_blank" href="https://colab.research.google.com/drive/1W7JQrFXwFunAY1XvS31mwC7mrXBgGD_M">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Model details

Usually, in LLaVA models, we generate N embeddings for the image, which we then combine with text embeddings and send to the LLM. But what if instead of creating N tokens 
for one image, we create K<<N tokens for M<N parts of the image (crops)? It would allow us to get visual information from small parts of the image and not inflate the 
number of image "tokens" too much. I called this method multi-crop LLaVA (MC-LLaVA).

You can read more about the model in the [blog post](https://huggingface.co/blog/visheratin/vlm-resolution-curse).

MC-LLaVA-3b was fine-tuned from [Phi-2 merge](vince62s/phi-2-psy) using vision tower from
[SigLIP 400M](https://huggingface.co/google/siglip-so400m-patch14-384). 

As Dolphin 2.6 Phi, LLaVA-3b uses ChatML prompt format:

```
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

```

## How to use

```python
from transformers import AutoModel, AutoProcessor
import torch

model = AutoModel.from_pretrained("visheratin/MC-LLaVA-3b", torch_dtype=torch.float16, trust_remote_code=True).to("cuda")

processor = AutoProcessor.from_pretrained("visheratin/MC-LLaVA-3b", trust_remote_code=True)

with torch.inference_mode():
    inputs = processor(prompt, [raw_image], model, max_crops=100, num_tokens=728)
    output = model.generate(**inputs, max_new_tokens=200, use_cache=True, do_sample=False,
        eos_token_id=processor.tokenizer.eos_token_id, pad_token_id=processor.tokenizer.eos_token_id)

result = processor.tokenizer.decode(output[0]).replace(prompt, "").replace("<|im_end|>", "")
print(result)
```

## Benchmarks

- TextVQA - 50.9%
- GQA - 59.5%
- VQAv2 - 76.72%
- VizWiz - 32.68%
- V*-bench - OCR - 56.66%, GPT4V-hard - 52.94%, direct attributes - 40.86%, relative position - 56.57%

## Examples

<a target="_blank" href="https://colab.research.google.com/drive/1sXDvVl5s9fTcE0N2bQGOlXhnNlKEdeun">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## License

The model is licensed under MIT license, but since the data used for model training is largely synthetic, you should also follow OpenAI and Google Gemini terms of service. 
Which means don't create competitor models for them.

## Acknowledgments

Thanks to [Lambda](https://lambdalabs.com/) for providing a machine to train the model.

Thanks to [ML Collective](https://mlcollective.org/) for continuous support and providing compute resources for testing the model.