Configuration Parsing Warning:Invalid JSON for config file config.json
Imaginative Perception Token — MVC (Mixed)
A unified VLM trained with Imaginative Perception Tokens (IPT) for the Multiview Counting (MVC) spatial-reasoning task, using Mixed training (50/50 imaginative + answer-only). Released with the paper:
Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models — arXiv:2606.03988
- 💻 Training code: https://github.com/weikaih04/Imaginative-Perception-Token
- 🧪 Evaluation: https://github.com/weikaih04/Imaginative-Perception-Token-Eval
- 🤗 Data: https://huggingface.co/collections/weikaih/imaginative-perception-token-data-6a15f80e0fcef43bd0c50aba
- 🧠Built on
ThinkMorph-7B/BAGEL-7B-MoT
Two inference modes
Because this is a Mixed model (trained on both imaginative and answer-only data), it supports two inference modes, selected by the system prompt:
1. Answer-only (zero-shot, fast)
Answers directly — no image is generated at inference.
ThinkMorph(
model_path="weikaih/imaginative-perception-token-mvc-mixed",
think=False, understanding_output=True, visual_gen=True, vae_input=True,
)
# system prompt: "Answer the question directly ... Do not think or generate any images."
# output: <answer>X</answer>
2. Imaginative (Visual CoT, with image)
Generates an Imaginative Perception Token (an intermediate image of what it would perceive) before answering.
ThinkMorph(
model_path="weikaih/imaginative-perception-token-mvc-mixed",
think=True, understanding_output=False, visual_gen=True, save_dir="./imgs",
)
# system prompt: "Let's think step by step ... <think> ... <image_start> ... </image_end> ... <answer> ... </answer>"
# output: <think>...</think><image_start>[generated image]<image_end><answer>X</answer>
See the evaluation repo for the full inference wrapper and prompts.
Performance (MVC, Mixed)
| AI2-THOR | ScanNet | MessyTable | MindCube | All-Angles |
|---|---|---|---|---|
| 62.3 | 47.0 | 37.0 | 37.0 | 33.5 |
Accuracy (%); the paper reports the max of answer-only and free-generation inference.
Citation
@misc{bigverdi2026imaginativeperceptiontokensenhance,
title={Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models},
author={Mahtab Bigverdi and Linjie Li and Weikai Huang and Yiming Liu and Jaemin Cho and Jieyu Zhang and Tuhin Kundu and Chris Dangjoo Kim and Zelun Luo and Linda Shapiro and Ranjay Krishna},
year={2026},
eprint={2606.03988},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2606.03988},
}
- Downloads last month
- -
Model tree for weikaih/imaginative-perception-token-mvc-mixed
Base model
Qwen/Qwen2.5-7B