👀 PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking

Project Page GitHub arXiv Pinpoint-Bench

PixelEyes enhances active visual search in MLLMs by delegating fine-grained localization to a specialized perception tool, thereby achieving efficient and accurate multi-turn visual reasoning.


This repository contains the weights for PixelEyes, a multi-turn visual reasoning agent that explicitly decouples reasoning from perception, introduced in the paper PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking.

For more details on environment setup, training, and evaluation, please visit the GitHub repository.

Citation

If you find this project helpful in your research, please cite our paper:

@misc{gong2026pixeleyesdecouplingperceptionreasoning,
      title={PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking}, 
      author={Dengxian Gong and Yuanzheng Wu and Haobo Yuan and Zhengdong Hu and Tao Zhang and Yikang Zhou and Shihao Chen and Quanzhu Niu and Kai Wang and Jason Li and Haochen Wang and Lu Qi and Shunping Ji and Ming-Hsuan Yang},
      year={2026},
      eprint={2607.00115},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2607.00115}, 
}
Downloads last month
10
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for godx7/PixelEyes-4B

Quantizations
2 models

Collection including godx7/PixelEyes-4B

Paper for godx7/PixelEyes-4B