visheratin commited on
Commit
06bc212
1 Parent(s): a1d449e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -0
README.md CHANGED
@@ -28,6 +28,8 @@ Usually, in LLaVA models, we generate N embeddings for the image, which we then
28
  for one image, we create K<<N tokens for M<N parts of the image (crops)? It would allow us to get visual information from small parts of the image and not inflate the
29
  number of image "tokens" too much. I called this method multi-crop LLaVA (MC-LLaVA).
30
 
 
 
31
  MC-LLaVA-3b was fine-tuned from [Phi-2 merge](vince62s/phi-2-psy) using vision tower from
32
  [SigLIP 400M](https://huggingface.co/google/siglip-so400m-patch14-384).
33
 
 
28
  for one image, we create K<<N tokens for M<N parts of the image (crops)? It would allow us to get visual information from small parts of the image and not inflate the
29
  number of image "tokens" too much. I called this method multi-crop LLaVA (MC-LLaVA).
30
 
31
+ You can read more about the model in the [blog post](https://huggingface.co/blog/visheratin/vlm-resolution-curse).
32
+
33
  MC-LLaVA-3b was fine-tuned from [Phi-2 merge](vince62s/phi-2-psy) using vision tower from
34
  [SigLIP 400M](https://huggingface.co/google/siglip-so400m-patch14-384).
35