@visheratin on Hugging Face: "Isn't it sad that VLMs don't have any inference parameters for the vision…"

Post

Isn't it sad that VLMs don't have any inference parameters for the vision part? Well, MC-LLaVA now has two whole knobs you can use to make it find even the smallest details! I finally (almost) properly implemented multi-crop, and now you can control the number of crops and how many image tokens will be generated. The video shows how, by increasing the number of crops and tokens, my 3B model correctly identifies the 30x90 pixel logo in the 3200x3000 pixel image.
Other notable updates:
- I use SigLIP from Transformers, so you don't need to install additional libraries.
- the model now supports auto classes, so you can create the model and processor with only two lines.
- performance increased by 10%+ across all benchmarks.

The work is far from over, but it feels like good progress.

The model on the hub: visheratin/MC-LLaVA-3b
You can try the model here: visheratin/mc-llava-3b

Join the conversation