Configuration Parsing Warning:Invalid JSON for config file config.json

Imaginative Perception Token — MVC (Mixed)

A unified VLM trained with Imaginative Perception Tokens (IPT) for the Multiview Counting (MVC) spatial-reasoning task, using Mixed training (50/50 imaginative + answer-only). Released with the paper:

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models — arXiv:2606.03988

💻 Training code: https://github.com/weikaih04/Imaginative-Perception-Token
🧪 Evaluation: https://github.com/weikaih04/Imaginative-Perception-Token-Eval
🤗 Data: https://huggingface.co/collections/weikaih/imaginative-perception-token-data-6a15f80e0fcef43bd0c50aba
🧠 Built on ThinkMorph-7B / BAGEL-7B-MoT

Two inference modes

Because this is a Mixed model (trained on both imaginative and answer-only data), it supports two inference modes, selected by the system prompt:

1. Answer-only (zero-shot, fast)

Answers directly — no image is generated at inference.

ThinkMorph(
    model_path="weikaih/imaginative-perception-token-mvc-mixed",
    think=False, understanding_output=True, visual_gen=True, vae_input=True,
)
# system prompt: "Answer the question directly ... Do not think or generate any images."
# output: <answer>X</answer>

2. Imaginative (Visual CoT, with image)

Generates an Imaginative Perception Token (an intermediate image of what it would perceive) before answering.

ThinkMorph(
    model_path="weikaih/imaginative-perception-token-mvc-mixed",
    think=True, understanding_output=False, visual_gen=True, save_dir="./imgs",
)
# system prompt: "Let's think step by step ... <think> ... <image_start> ... </image_end> ... <answer> ... </answer>"
# output: <think>...</think><image_start>[generated image]<image_end><answer>X</answer>

See the evaluation repo for the full inference wrapper and prompts.

Performance (MVC, Mixed)

AI2-THOR	ScanNet	MessyTable	MindCube	All-Angles
62.3	47.0	37.0	37.0	33.5

Accuracy (%); the paper reports the max of answer-only and free-generation inference.

Citation

@misc{bigverdi2026imaginativeperceptiontokensenhance,
      title={Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models},
      author={Mahtab Bigverdi and Linjie Li and Weikai Huang and Yiming Liu and Jaemin Cho and Jieyu Zhang and Tuhin Kundu and Chris Dangjoo Kim and Zelun Luo and Linda Shapiro and Ranjay Krishna},
      year={2026},
      eprint={2606.03988},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2606.03988},
}