visheratin commited on
Commit
06b6855
1 Parent(s): 92d6894

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -12
README.md CHANGED
@@ -16,7 +16,7 @@ widget:
16
  src: "https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg"
17
  ---
18
 
19
- # LLaVA-3b
20
 
21
  <a target="_blank" href="https://colab.research.google.com/drive/1W7JQrFXwFunAY1XvS31mwC7mrXBgGD_M">
22
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
@@ -24,13 +24,16 @@ widget:
24
 
25
  ## Model details
26
 
27
- LLaVA-3b is a model fine-tuned from [Dolphin 2.6 Phi](https://huggingface.co/cognitivecomputations/dolphin-2_6-phi-2) in a LLaVA fashion using vision tower from
28
- [SigLIP 400M](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384). There are a couple of things different from the original LLaVA architecture:
29
 
30
- 1. Multiple image tokens. The multimodal projector generates embeddings of shape [5, 2560] instead of [1, 2560] for images. The idea is that using more tokens
31
- allows us to get more info from the image into the language model.
32
- 2. The model uses the output from the latest layer of the vision encoder instead of the intermediate one.
33
- 3. The context length during training was 1200 tokens, as the L4 GPUs I used didn't allow me to get more.
 
 
 
34
 
35
  As Dolphin 2.6 Phi, LLaVA-3b uses ChatML prompt format:
36
 
@@ -115,15 +118,20 @@ inputs['attention_mask'] = inputs['attention_mask'].to(model.device)
115
  **Generate the data**
116
 
117
  ```python
118
- output = model.generate(**inputs, max_new_tokens=200, do_sample=True, top_p=0.5, temperature=1.2, eos_token_id=tokenizer.eos_token_id)
 
 
 
119
  ```
120
 
121
  ## Benchmarks
122
 
123
- - TextVQA - 33.25%
124
- - GQA - 47.15%
125
- - VQAv2 - 63.1%
126
- - VizWiz - 24.03%
 
 
127
 
128
  ## License
129
 
 
16
  src: "https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg"
17
  ---
18
 
19
+ # Multi-crop LLaVA-3b
20
 
21
  <a target="_blank" href="https://colab.research.google.com/drive/1W7JQrFXwFunAY1XvS31mwC7mrXBgGD_M">
22
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
 
24
 
25
  ## Model details
26
 
27
+ The core idea behind multi-crop LLaVA is that instead of N visual token embeddings per image, I generate one token embedding per N parts of the image.
28
+ Having high-quality embeddings for smaller parts of the image helps to extract more details and understand the scene better.
29
 
30
+ For every crop of the image, I generate an embedding from the full SigLIP encoder (size [1, 1152]) and then push all N embeddings through the LLaVA adapter, which
31
+ gives the token embedding of size [N, 2560]. Right now, the tokens do not contain explicit information about their position in the original image. I plan to add it later.
32
+
33
+ MC-LLaVA-3b was fine-tuned from [Dolphin 2.6 Phi](https://huggingface.co/cognitivecomputations/dolphin-2_6-phi-2) using vision tower from
34
+ [SigLIP 400M](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384).
35
+
36
+ The context length during training was 1200 tokens, as the L4 GPUs I used didn't allow me to get more.
37
 
38
  As Dolphin 2.6 Phi, LLaVA-3b uses ChatML prompt format:
39
 
 
118
  **Generate the data**
119
 
120
  ```python
121
+ import torch
122
+
123
+ with torch.inference_mode():
124
+ output = model.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.4, pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id)
125
  ```
126
 
127
  ## Benchmarks
128
 
129
+ - TextVQA - 38.59%
130
+ - GQA - 49.6%
131
+ - VQAv2 - 64.24%
132
+ - VizWiz - 24.88%
133
+ - POPE - 80.59%
134
+ - V*-bench - 52.25% (OCR - 46.66%, GPT4V-hard - 41.17%, direct attributes - 43.48%, relative position - 65.79%)
135
 
136
  ## License
137