visheratin
/

MC-LLaVA-3b

@@ -16,7 +16,7 @@ widget:
   src: "https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg"
 ---
-# LLaVA-3b
 <a target="_blank" href="https://colab.research.google.com/drive/1W7JQrFXwFunAY1XvS31mwC7mrXBgGD_M">
   <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
@@ -24,13 +24,16 @@ widget:
 ## Model details
-LLaVA-3b is a model fine-tuned from [Dolphin 2.6 Phi](https://huggingface.co/cognitivecomputations/dolphin-2_6-phi-2) in a LLaVA fashion using vision tower from
-[SigLIP 400M](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384). There are a couple of things different from the original LLaVA architecture:
-1. Multiple image tokens. The multimodal projector generates embeddings of shape [5, 2560] instead of [1, 2560] for images. The idea is that using more tokens
-   allows us to get more info from the image into the language model.
-2. The model uses the output from the latest layer of the vision encoder instead of the intermediate one.
-3. The context length during training was 1200 tokens, as the L4 GPUs I used didn't allow me to get more.
 As Dolphin 2.6 Phi, LLaVA-3b uses ChatML prompt format:
@@ -115,15 +118,20 @@ inputs['attention_mask'] = inputs['attention_mask'].to(model.device)
 **Generate the data**
 ```python
-output = model.generate(**inputs, max_new_tokens=200, do_sample=True, top_p=0.5, temperature=1.2, eos_token_id=tokenizer.eos_token_id)
 ```
 ## Benchmarks
-- TextVQA - 33.25%
-- GQA - 47.15%
-- VQAv2 - 63.1%
-- VizWiz - 24.03%
 ## License

   src: "https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg"
 ---
+# Multi-crop LLaVA-3b
 <a target="_blank" href="https://colab.research.google.com/drive/1W7JQrFXwFunAY1XvS31mwC7mrXBgGD_M">
   <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
 ## Model details
+The core idea behind multi-crop LLaVA is that instead of N visual token embeddings per image, I generate one token embedding per N parts of the image.
+Having high-quality embeddings for smaller parts of the image helps to extract more details and understand the scene better.
+For every crop of the image, I generate an embedding from the full SigLIP encoder (size [1, 1152]) and then push all N embeddings through the LLaVA adapter, which
+gives the token embedding of size [N, 2560]. Right now, the tokens do not contain explicit information about their position in the original image. I plan to add it later.
+MC-LLaVA-3b was fine-tuned from [Dolphin 2.6 Phi](https://huggingface.co/cognitivecomputations/dolphin-2_6-phi-2) using vision tower from
+[SigLIP 400M](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384).
+The context length during training was 1200 tokens, as the L4 GPUs I used didn't allow me to get more.
 As Dolphin 2.6 Phi, LLaVA-3b uses ChatML prompt format:
 **Generate the data**
 ```python
+import torch
+with torch.inference_mode():
+  output = model.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.4, pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id)
 ```
 ## Benchmarks
+- TextVQA - 38.59%
+- GQA - 49.6%
+- VQAv2 - 64.24%
+- VizWiz - 24.88%
+- POPE - 80.59%
+- V*-bench - 52.25% (OCR - 46.66%, GPT4V-hard - 41.17%, direct attributes - 43.48%, relative position - 65.79%)
 ## License