VisReason-Pro-Qwen2.5-VL-7B

The main VisReason model from our ECCV 2026 paper. Built on VisReason-Qwen2.5-VL-7B and further trained on VisReason-Pro — the high-fidelity subset (~165K, the GQA portion) produced under a stronger GPT-4.1-series annotator with depth-informed 3D grounding — to strengthen spatially-grounded, multi-round visual Chain-of-Thought reasoning over small objects and complex 2D/3D relations.

This checkpoint is the primary model evaluated across our benchmark suite (fine-grained grounding, multi-round visual CoT, MME, POPE, V*).

Training

  • Base model: Qwen/Qwen2.5-VL-7B-Instruct
  • Method: LoRA supervised fine-tuning — continued from the VisReason base model and further trained on the VisReason-Pro subset; merged into the base weights
  • Data: VisReason + VisReason-Pro (depth-grounded GQA subset)
  • Framework: LLaMA-Factory

Usage

The model is trained in a tool-calling chat format: it wraps reasoning in <think>...</think>, optionally emits a single image_zoom_in_tool call with a ratio-based bbox_2d ([x1,y1,x2,y2] in [0,1]) to crop the current view, and outputs the final answer in <answer>...</answer>. Load with transformers (Qwen2_5_VLForConditionalGeneration) or serve with vLLM, using the standard Qwen2.5-VL processor.

Citation

@inproceedings{visreason2026,
  title     = {VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning},
  author    = {Lingxiao Li and Yifan Wang and Xinyan Gao and Chen Tang and Xiangyu Yue and Chenyu You},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}
Downloads last month
27
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Y-Research-Group/VisReason-Pro-Qwen2.5-VL-7B

Finetuned
(1117)
this model
Quantizations
2 models

Dataset used to train Y-Research-Group/VisReason-Pro-Qwen2.5-VL-7B