ZejunLi commited on
Commit
9427d4a
·
verified ·
1 Parent(s): dfb93a4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -3
README.md CHANGED
@@ -1,3 +1,86 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - Qwen/Qwen2.5-VL-7B-Instruct
5
+ pipeline_tag: image-text-to-text
6
+ ---
7
+
8
+ # Mixture-of-Visual-Thoughts
9
+
10
+ AdaVaR-3B/7B is our presented adaptive visual reasoning model with the ability to reason in two thinking modes:
11
+
12
+ 1. Text-based reasoning: direct express reasoning with natural languages;
13
+ 2. Grounded reasoning: align reasoning processes with images with coordinates (typically object bounding boxes)
14
+
15
+ For more detailed introduction, please visit:
16
+
17
+ - Our Github Repo: [Mixture-of-Visual-Thoughts]()
18
+ - Our Paper: https://arxiv.org/pdf/2509.22746
19
+
20
+ ## Quick Usage of AdaVaR
21
+ Our AdaVaR-3B/7B models are based on Qwen2.5-VL-3B/7B, you can use them the same way as Qwen2.5-VL--just modify the system_prompt and supplement a post prompt.
22
+ ```python
23
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
24
+ from constants import R1_SYSTEM_PROMPT_ADAPT_v2, POST_PROMPT_ADAPT_v2
25
+ import torch
26
+ from qwen_vl_utils import process_vision_info
27
+
28
+ # loading the model and processor
29
+ model_path = "ZejunLi/AdaVaR-3B"
30
+ device = torch.device("cuda")
31
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map=device)
32
+ processor = AutoProcessor.from_pretrained(model_path)
33
+
34
+ # construct input messages
35
+ image = "./assets/vstar.jpg"
36
+ query = "Is the dog on the left or right side of the bicycle? (A) right; (B) left. Please answer the question with the correct option letter, e.g., A, B, C, D."
37
+
38
+ messages = [
39
+ {"role": "system", "content": R1_SYSTEM_PROMPT_ADAPT_v2},
40
+ {
41
+ "role": "user",
42
+ "content": [
43
+ {
44
+ "type": "image",
45
+ "image": image,
46
+ },
47
+ {"type": "text", "text": query + " " + POST_PROMPT_ADAPT_v2},
48
+ ],
49
+ }
50
+ ]
51
+
52
+ # process model inputs
53
+ image_inputs, _ = process_vision_info(messages)
54
+ query = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
55
+ input_dict = {k:v.to(device) for k,v in processor(text=[query], images=image_inputs, padding=True, return_tensors="pt").items()}
56
+
57
+ # generate model responses
58
+ output = model.generate(**input_dict, use_cache=True, do_sample=False, max_new_tokens=2048)
59
+ output_trimmed = [
60
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(input_dict['input_ids'], output)]
61
+ response = processor.tokenizer.batch_decode(output_trimmed)[0]
62
+ print(response)
63
+ ```
64
+ Note: the sample image is provided in our GitHub.
65
+
66
+ AdaVaR will adaptively choose an appropriate mode. Users can specify the mode by fixing the mode prefix token:
67
+ ```python
68
+ # visually-grounded mode
69
+ grd_query = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) + "<grounding>"
70
+
71
+ # text-based mode
72
+ txt_query = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) + "<text>"
73
+ ```
74
+
75
+ ## Citation
76
+
77
+ If you find our code, model, or data helpful for your work, please consider citing:
78
+
79
+ ```bibtex
80
+ @article{li2025mixture,
81
+ title={Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning},
82
+ author={Li, Zejun and Zhao, Yingxiu and Zhang, Jiwen and Wang, Siyuan and Yao, Yang and Zhao, Runzhou and Song, Jun and Zheng, Bo and Wei, Zhongyu},
83
+ journal={arXiv preprint arXiv:2509.22746},
84
+ year={2025}
85
+ }
86
+ ```