Alexander Visheratin

visheratin

AI & ML interests

None yet

Articles

Organizations

visheratin's activity

posted an update 29 days ago
view post
Post
2785
Yesterday, xAI announced Grok-1.5 Vision - https://x.ai/blog/grok-1.5v. But more importantly, they also released a new VLM benchmark dataset - RealWorldQA. The only problem was that they released it as a ZIP archive. I fixed that! Now you can use it in your evaluations as a regular HF dataset: visheratin/realworldqa
  • 1 reply
·
posted an update about 2 months ago
replied to their post 2 months ago
view reply

It uses the same vision encoder, so I expect that nothing changes.

posted an update 2 months ago
view post
Post
Keep stacking cool stuff and getting better results! After I changed the standard vision encoder to SigLIP, NLLB-CLIP got a 10% average performance improvement. And now, I added matryoshka layers (https://arxiv.org/abs/2205.13147) to enable smaller embeddings and got another 6% performance boost! Plus, thanks to MRL, 4.5x smaller embeddings retain 90%+ quality.

The large model is finally SoTA for both image and text multilingual retrieval!

The models are available on the hub:
- visheratin/nllb-siglip-mrl-base
- visheratin/nllb-siglip-mrl-large
  • 2 replies
·
replied to their post 2 months ago
view reply

I used 8xA100 80GB. With LoRA and smaller batch size, it should be possible to train on smaller GPUs, but it is still very resource-intensive.

replied to their post 2 months ago
view reply

You are right. The method requires multiple passes for the vision encoder, which increases memory usage. This is not such a big problem during inference, but it makes training harder because of the gradients stored. At the moment, I don't have a solution to make it more efficient. But this is an ongoing project, so maybe I will find one =)

replied to their post 3 months ago
view reply

There are links to existing papers in the blog post if you want to dive into the field.

replied to their post 3 months ago
view reply

I used mainly the LLaVA training codebase with some changes to support multi-crop. I'll be working on the next post about fine-tuning MC-LLaVA on a task-specific dataset and will open all the code.

posted an update 3 months ago
view post
Post
VLMs have a resolution problem, which prevents them from finding small details in large images. In my community blog post, I discuss the ways to solve it and describe the details of MC-LLaVA architecture - https://huggingface.co/blog/visheratin/vlm-resolution-curse

Check it out, and let me know what you think!
·
posted an update 3 months ago
view post
Post
Isn't it sad that VLMs don't have any inference parameters for the vision part? Well, MC-LLaVA now has two whole knobs you can use to make it find even the smallest details! I finally (almost) properly implemented multi-crop, and now you can control the number of crops and how many image tokens will be generated. The video shows how, by increasing the number of crops and tokens, my 3B model correctly identifies the 30x90 pixel logo in the 3200x3000 pixel image.
Other notable updates:
- I use SigLIP from Transformers, so you don't need to install additional libraries.
- the model now supports auto classes, so you can create the model and processor with only two lines.
- performance increased by 10%+ across all benchmarks.

The work is far from over, but it feels like good progress.

The model on the hub: visheratin/MC-LLaVA-3b
You can try the model here: visheratin/mc-llava-3b