Instructions to use shravvvv/VISE with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use shravvvv/VISE with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-VL-2B-Instruct") model = PeftModel.from_pretrained(base_model, "shravvvv/VISE") - Notebooks
- Google Colab
- Kaggle
VISE: Visual Invariance Self-Evolution
This is the VISE LoRA adapter for Qwen/Qwen3-VL-2B-Instruct, from our paper
"Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models".
VISE is a purely unsupervised, single-model self-evolving framework. Instead of optimizing answer agreement like prior self-evolving LMMs, it strengthens the model's visual conditioning, which is how much the decoder actually attends to the image while it generates. We train on raw, unlabeled images with no captions, bounding boxes, labels, external reward models, or specialist roles, using two invariance rewards computed from the model's own predictions:
- Geometric invariance: rewards consistent localization of the same object under a known spatial transform (affine, crop, or flip).
- Semantic invariance: blurs ("ghosts") the predicted region and rewards the model only if it judges the object visible before ghosting and not visible after.
We combine them as R = 0.5*R_geo + 0.5*R_sem and optimize with KL-regularized
REINFORCE against a frozen reference policy.
Usage
This is a LoRA adapter, so load the base model first and attach the adapter:
import torch
from PIL import Image
from transformers import AutoModelForVision2Seq, AutoProcessor
from peft import PeftModel
BASE = "Qwen/Qwen3-VL-2B-Instruct"
ADAPTER = "shravvvv/VISE"
model = AutoModelForVision2Seq.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, ADAPTER)
processor = AutoProcessor.from_pretrained(ADAPTER)
model.eval()
image = Image.open("example.jpg").convert("RGB")
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this image in detail."},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
with torch.inference_mode():
out = model.generate(**inputs, max_new_tokens=128)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])
Results (Qwen3-VL-2B)
| Metric | Base | VISE |
|---|---|---|
| COCO (CIDEr) | 21.54 | 38.39 |
| NoCaps (CIDEr) | 19.52 | 34.25 |
| Flickr30k (CIDEr) | 26.09 | 42.64 |
| TextCaps (CIDEr) | 22.20 | 41.86 |
| CHAIR-I (lower is better) | 13.21 | 8.21 |
| CHAIR-S (lower is better) | 45.96 | 40.51 |
| POPE Accuracy | 89.01 | 90.03 |
| ScienceQA | 79.42 | 83.61 |
VISE improves captioning, VQA, reasoning, and hallucination together with no task tradeoffs, and the same recipe generalizes across larger scales and other backbone families. Full results are in our paper.
Training
- Base:
Qwen/Qwen3-VL-2B-Instruct, vision encoder frozen. - LoRA:
r=16,alpha=32,dropout=0.05on the attention, MLP, and projector layers. - Optimizer: AdamW,
lr=1e-6, weight decay0.01, gradient clipping1.0, bfloat16. - RL: KL-regularized REINFORCE, adaptive KL (target
0.020), reward weights0.5 / 0.5. - Data: 4,000 raw, unlabeled COCO images. No question/answer pairs or annotations.
License
Apache 2.0.
Citation
@inproceedings{venkatraman2026vise,
title = {Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models},
author = {Venkatraman, Shravan and Thawkar, Ritesh and Thawakar, Omkar and
Anwer, Rao Muhammad and Cholakkal, Hisham and Khan, Salman and Khan, Fahad Shahbaz},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2026}
}
- Downloads last month
- 33
Model tree for shravvvv/VISE
Base model
Qwen/Qwen3-VL-2B-Instruct