VisReason-Qwen2.5-VL-7B

Qwen2.5-VL-7B-Instruct fine-tuned on the VisReason dataset to perform human-like, global-to-local visual Chain-of-Thought reasoning: the model forms a holistic hypothesis, then iteratively zooms into salient regions (areas of interest) to gather fine-grained visual evidence before producing a grounded final answer.

This is the base VisReason model (the baseline checkpoint in our experiments) used in the ECCV 2026 paper. For the depth-grounded variant with stronger spatial reasoning, see VisReason-Pro-Qwen2.5-VL-7B.

Training

  • Base model: Qwen/Qwen2.5-VL-7B-Instruct
  • Method: LoRA supervised fine-tuning (2 epochs), then merged into the base weights
  • Data: VisReason training set (~489K multi-round visual-CoT examples)
  • Framework: LLaMA-Factory

Usage

The model is trained in a tool-calling chat format: it wraps reasoning in <think>...</think>, optionally emits a single image_zoom_in_tool call with a ratio-based bbox_2d ([x1,y1,x2,y2] in [0,1]) to crop the current view, and outputs the final answer in <answer>...</answer>. Load with transformers (Qwen2_5_VLForConditionalGeneration) or serve with vLLM, using the standard Qwen2.5-VL processor.

Citation

@inproceedings{visreason2026,
  title     = {VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning},
  author    = {Lingxiao Li and Yifan Wang and Xinyan Gao and Chen Tang and Xiangyu Yue and Chenyu You},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}
Downloads last month
22
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Y-Research-Group/VisReason-Qwen2.5-VL-7B

Finetuned
(1118)
this model
Quantizations
2 models

Dataset used to train Y-Research-Group/VisReason-Qwen2.5-VL-7B