VISE: Visual Invariance Self-Evolution

This is the VISE LoRA adapter for Qwen/Qwen3-VL-2B-Instruct, from our paper "Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models".

VISE is a purely unsupervised, single-model self-evolving framework. Instead of optimizing answer agreement like prior self-evolving LMMs, it strengthens the model's visual conditioning, which is how much the decoder actually attends to the image while it generates. We train on raw, unlabeled images with no captions, bounding boxes, labels, external reward models, or specialist roles, using two invariance rewards computed from the model's own predictions:

Geometric invariance: rewards consistent localization of the same object under a known spatial transform (affine, crop, or flip).
Semantic invariance: blurs ("ghosts") the predicted region and rewards the model only if it judges the object visible before ghosting and not visible after.

We combine them as R = 0.5*R_geo + 0.5*R_sem and optimize with KL-regularized REINFORCE against a frozen reference policy.

Usage

This is a LoRA adapter, so load the base model first and attach the adapter:

import torch
from PIL import Image
from transformers import AutoModelForVision2Seq, AutoProcessor
from peft import PeftModel

BASE = "Qwen/Qwen3-VL-2B-Instruct"
ADAPTER = "shravvvv/VISE"

model = AutoModelForVision2Seq.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, ADAPTER)
processor = AutoProcessor.from_pretrained(ADAPTER)
model.eval()

image = Image.open("example.jpg").convert("RGB")
messages = [{"role": "user", "content": [
    {"type": "image", "image": image},
    {"type": "text", "text": "Describe this image in detail."},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=128)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])

Results (Qwen3-VL-2B)

Metric	Base	VISE
COCO (CIDEr)	21.54	38.39
NoCaps (CIDEr)	19.52	34.25
Flickr30k (CIDEr)	26.09	42.64
TextCaps (CIDEr)	22.20	41.86
CHAIR-I (lower is better)	13.21	8.21
CHAIR-S (lower is better)	45.96	40.51
POPE Accuracy	89.01	90.03
ScienceQA	79.42	83.61

VISE improves captioning, VQA, reasoning, and hallucination together with no task tradeoffs, and the same recipe generalizes across larger scales and other backbone families. Full results are in our paper.

Training

Base: Qwen/Qwen3-VL-2B-Instruct, vision encoder frozen.
LoRA: r=16, alpha=32, dropout=0.05 on the attention, MLP, and projector layers.
Optimizer: AdamW, lr=1e-6, weight decay 0.01, gradient clipping 1.0, bfloat16.
RL: KL-regularized REINFORCE, adaptive KL (target 0.020), reward weights 0.5 / 0.5.
Data: 4,000 raw, unlabeled COCO images. No question/answer pairs or annotations.

License

Apache 2.0.

Citation

@inproceedings{venkatraman2026vise,
  title     = {Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models},
  author    = {Venkatraman, Shravan and Thawkar, Ritesh and Thawakar, Omkar and
               Anwer, Rao Muhammad and Cholakkal, Hisham and Khan, Salman and Khan, Fahad Shahbaz},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}

Downloads last month: 33

Model tree for shravvvv/VISE

Base model

Qwen/Qwen3-VL-2B-Instruct

Adapter

(69)

this model